Convert to Categorical

Converting Dataframe object (string) columns to categorical types.

Convert to Categorical

Processing

This brick optimizes your data by converting text (string) columns into categorical variables. This process is highly effective for reducing memory usage in large datasets and is often a required step when preparing data for specific machine learning models.

The brick accepts a dataset and checks for text columns. If you specify columns, it converts those; otherwise, it automatically detects all text-based columns. While the data looks the same visually, the internal storage is changed to a more efficient format (categories) rather than raw text.

Inputs

data: The dataset you want to process. This contains the columns that need to be converted from standard text to categorical types.

Inputs Types

Input	Types
`data`	`DataFrame`, `ArrowTable`

You can check the list of supported types here: Available Type Hints.

Outputs

result: The resulting dataset with the transformation applied. It contains the exact same information as the input, but the specified columns are now stored as categorical data types.

Outputs Types

Output	Types
`result`	`DataFrame`

You can check the list of supported types here: Available Type Hints.

Options

The Convert to Categorical brick contains some changeable options:

Columns to Convert: A list of specific column names to convert. If you provide names, only these columns will be converted to categorical types, otherwise the brick will automatically scan your dataset and convert all text (object/string) columns found.
Verbose: Controls the level of detail in the logs.

import logging
import polars as pl
import pyarrow as pa
from coded_flows.types import Union, DataFrame, ArrowTable, List, Str
from coded_flows.utils import CodedFlowsLogger

logger = CodedFlowsLogger(name="Convert to Categorical", level=logging.INFO)


def objects_to_categoricals(
    data: Union[DataFrame, ArrowTable], options=None
) -> DataFrame:
    options = options or {}
    verbose = options.get("verbose", True)
    target_columns = options.get("columns", [])
    output_format = "pandas"
    result = None
    try:
        verbose and logger.info(f"Starting categorical conversion.")
        if output_format == "pandas":
            if isinstance(data, pl.DataFrame):
                df = data.to_pandas()
            elif isinstance(data, (pa.Table, pa.lib.Table)):
                df = data.to_pandas()
            else:
                df = data
            if target_columns:
                cols_to_convert = [c for c in target_columns if c in df.columns]
                if len(cols_to_convert) != len(target_columns):
                    verbose and logger.warning(
                        "Some specified columns were not found in the DataFrame."
                    )
            else:
                cols_to_convert = df.select_dtypes(
                    include=["object", "string"]
                ).columns.tolist()
            if not cols_to_convert:
                verbose and logger.info("No columns found or selected for conversion.")
                result = df
            else:
                verbose and logger.info(
                    f"Converting {len(cols_to_convert)} columns to categorical: {cols_to_convert}"
                )
                conversion_dict = {col: "category" for col in cols_to_convert}
                result = df.astype(conversion_dict)
        else:
            raise ValueError(f"Unsupported output format: {output_format}")
        verbose and logger.info(f"Conversion complete. Final shape: {result.shape}")
    except Exception as e:
        verbose and logger.error(f"Error during categorical conversion: {str(e)}")
        raise
    return result

Brick Info

version v0.1.5

python 3.11, 3.12, 3.13

requirements

pyarrow
polars
pandas
numba>=0.56.0
shap>=0.47.0