Apply Scaler

Applies a pre-fitted Scikit-Learn Scaler to the dataset.

Apply Scaler

Processing

This brick transforms the numerical values in your dataset using a pre-configured mathematical rule set (a "Scaler"). In data science, it is common to adjust numbers so they fit within a specific range or follow a standard distribution. For example, converting a salary range of 30,000–100,000 and an age range of 20–60 into a unified scale between 0 and 1.

This brick applies those specific calculation rules to your data. Crucially, it does not "learn" the rules from this data; it expects you to provide a Scaler that has already been trained (fitted) in a previous step.

Inputs

data
The dataset containing the numerical values you want to transform.
Scaler
The pre-fitted Scikit-Learn scaler object containing the mathematical rules to apply. This is usually the output of a previous brick.

Inputs Types

Input Types
data DataFrame, ArrowTable
Scaler Any

You can check the list of supported types here: Available Type Hints.

Outputs

scaled data
The resulting dataset where the target columns have been modified according to the Scaler's rules. Non-numerical columns are preserved if Safe Mode was enabled.

Outputs Types

Output Types
scaled data DataFrame

You can check the list of supported types here: Available Type Hints.

Options

The Apply Scaler brick contains some changeable options:

Safe Mode
Controls how the brick handles non-numeric data (like text or dates). When enabled, the brick automatically identifies numerical columns and only applies the scaler to them. Text or date columns are ignored and left unchanged, preventing errors.
Verbose
Controls the level of detail logged during processing.

Example

Input (data):

Age Salary
20 50000
40 100000

Input (Scaler): A MinMaxScaler that has been previously fitted to this data range.

Output (scaled data):

Age Salary
0.0 0.0
1.0 1.0

Explanation: The Scaler adjusted the values so the lowest number became 0.0 and the highest became 1.0, making the two different types of data comparable.

import logging
import pandas as pd
import polars as pl
import pyarrow as pa
from coded_flows.types import Union, DataFrame, ArrowTable, Any, List, Str
from coded_flows.utils import CodedFlowsLogger

logger = CodedFlowsLogger(name="Apply Scaler", level=logging.INFO)


def apply_scaler(
    data: Union[DataFrame, ArrowTable], Scaler: Any, options: dict = None
) -> DataFrame:
    options = options or {}
    verbose = options.get("verbose", True)
    safe_mode = options.get("safe_mode", True)
    scaled_data = None
    try:
        verbose and logger.info("Starting Scaler application process.")
        if isinstance(data, pl.DataFrame):
            verbose and logger.info("Detected Polars DataFrame. Converting to Pandas.")
            scaled_data = data.to_pandas()
        elif isinstance(data, (pa.Table, pa.lib.Table)):
            verbose and logger.info("Detected PyArrow Table. Converting to Pandas.")
            scaled_data = data.to_pandas()
        elif isinstance(data, pd.DataFrame):
            verbose and logger.info("Detected Pandas DataFrame.")
            scaled_data = data.copy()
        else:
            raise ValueError(
                "Input data must be a Pandas DataFrame, Polars DataFrame, or Arrow Table."
            )
        target_columns = scaled_data.columns.tolist()
        verbose and logger.info("No specific columns selected. Targeting all columns.")
        if safe_mode:
            numeric_cols = scaled_data.select_dtypes(
                include=["number", "bool"]
            ).columns.tolist()
            valid_cols = [col for col in target_columns if col in numeric_cols]
            dropped_cols = set(target_columns) - set(valid_cols)
            if dropped_cols and verbose:
                logger.warning(
                    f"Safe Mode: Ignored non-numerical columns: {dropped_cols}"
                )
            target_columns = valid_cols
            verbose and logger.info(
                f"Safe Mode enabled. Final target columns: {target_columns}"
            )
        if not target_columns:
            verbose and logger.warning(
                "No valid columns remaining to scale. Returning original data."
            )
        else:
            missing_cols = [
                col for col in target_columns if col not in scaled_data.columns
            ]
            if missing_cols:
                raise ValueError(f"Columns not found in dataset: {missing_cols}")
            verbose and logger.info(
                f"Applying scaler ({type(Scaler).__name__}) transform."
            )
            data_subset = scaled_data[target_columns]
            scaled_values = Scaler.transform(data_subset)
            scaled_data[target_columns] = scaled_values
            verbose and logger.info("Scaler transform applied successfully.")
    except Exception as e:
        verbose and logger.error(f"Error during scaler application.")
        raise e
    return scaled_data

Brick Info

version v0.1.4
python 3.11, 3.12, 3.13
requirements
  • shap>=0.47.0
  • scikit-learn
  • pandas
  • pyarrow
  • numba>=0.56.0
  • polars