ROC Curve
Computes the Receiver Operating Characteristic (ROC) curve. Returns a DataFrame with 'fpr', 'tpr', and 'thresholds'.
ROC Curve
Processing
This brick evaluates the performance of a classification model by calculating the Receiver Operating Characteristic (ROC) curve. It compares the actual "true" labels against the predicted scores to generate three key metrics: False Positive Rate (FPR), True Positive Rate (TPR), and decision Thresholds.
The resulting data allows you to visualize how well your model distinguishes between classes and helps you select the optimal probability threshold for your specific use case.
Inputs
- data
- The dataset containing your model's predictions and the actual ground truth labels.
Inputs Types
| Input | Types |
|---|---|
data |
DataFrame, ArrowTable |
You can check the list of supported types here: Available Type Hints.
Outputs
- ROC Table
- A summary table containing the calculated curve points. Each row corresponds to a specific decision threshold and its resulting performance metrics.
The ROC Table output contains the following specific data fields:
- fpr: The False Positive Rate (the portion of negatives incorrectly classified as positive).
- tpr: The True Positive Rate (the portion of positives correctly classified as positive, also known as Recall).
- thresholds: The specific probability score used as the cut-off to determine the associated FPR and TPR.
Outputs Types
| Output | Types |
|---|---|
ROC Table |
DataFrame |
You can check the list of supported types here: Available Type Hints.
Options
The ROC Curve brick contains some changeable options:
- True Label Column
- The name of the column in your input data that contains the actual class labels (the "Ground Truth"). Common names include "target", "actual", or "y_true".
- Score Column
- The name of the column that contains the probability or confidence scores output by your model. Common names include "score", "probability", or "y_score".
- Positive Class Label
- The specific label that represents the "positive" class (the class you are trying to predict, e.g., "Fraud", "Spam", or "Win"). If left empty, the brick defaults to using
1as the positive label. - Drop Intermediate
- An optimization setting that reduces the size of the output table. Drops suboptimal thresholds that do not appear on the ROC curve, resulting in a lighter, easier-to-plot dataset.
- Verbose
- Controls whether the brick writes detailed logs during processing. Useful for debugging if the calculation fails.
import logging
import pandas as pd
import polars as pl
import pyarrow as pa
from sklearn.metrics import roc_curve
from coded_flows.types import Union, DataFrame, ArrowTable, Tuple, DataSeries, Str
from coded_flows.utils import CodedFlowsLogger
logger = CodedFlowsLogger(name="ROC Curve", level=logging.INFO)
def compute_roc_curve(data: Union[DataFrame, ArrowTable], options=None) -> DataFrame:
options = options or {}
verbose = options.get("verbose", True)
target_col = options.get("target_column", "target")
score_col = options.get("score_column", "score")
pos_label_input = str(options.get("pos_label", "")).strip()
drop_intermediate = options.get("drop_intermediate", False)
ROC_Table = None
try:
verbose and logger.info(
f"Starting ROC computation. Target: '{target_col}', Score: '{score_col}'"
)
df_pandas = None
input_type = "unknown"
if isinstance(data, pd.DataFrame):
df_pandas = data
input_type = "pandas"
elif isinstance(data, pl.DataFrame):
df_pandas = data.to_pandas()
input_type = "polars"
elif isinstance(data, (pa.Table, pa.lib.Table)):
df_pandas = data.to_pandas()
input_type = "arrow"
else:
raise ValueError(
"Input data must be a pandas DataFrame, Polars DataFrame, or Arrow Table"
)
verbose and logger.info(f"Converted input from {input_type} to pandas.")
if target_col not in df_pandas.columns:
raise ValueError(f"Target column '{target_col}' not found in dataset.")
if score_col not in df_pandas.columns:
raise ValueError(f"Score column '{score_col}' not found in dataset.")
y_true = df_pandas[target_col]
y_score = df_pandas[score_col]
pos_label = None
if pos_label_input:
col_dtype = y_true.dtype
verbose and logger.info(f"Target column dtype detected: {col_dtype}")
if pd.api.types.is_bool_dtype(col_dtype):
if pos_label_input.lower() == "true":
pos_label = True
elif pos_label_input.lower() == "false":
pos_label = False
else:
pos_label = bool(pos_label_input)
verbose and logger.info(f"Casted pos_label to BOOLEAN: {pos_label}")
elif pd.api.types.is_numeric_dtype(col_dtype):
try:
if float(pos_label_input).is_integer():
pos_label = int(float(pos_label_input))
else:
pos_label = float(pos_label_input)
verbose and logger.info(f"Casted pos_label to NUMERIC: {pos_label}")
except ValueError:
pos_label = pos_label_input
verbose and logger.warning(
f"Target is numeric but pos_label '{pos_label_input}' could not be cast. Using string."
)
else:
pos_label = pos_label_input
verbose and logger.info(
f"Target is categorical/string. Using STRING pos_label: '{pos_label}'"
)
else:
verbose and logger.info(
"No positive label provided. Brick will infer the positive class."
)
(fpr, tpr, thresholds) = roc_curve(
y_true, y_score, pos_label=pos_label, drop_intermediate=drop_intermediate
)
ROC_Table = pd.DataFrame({"fpr": fpr, "tpr": tpr, "thresholds": thresholds})
verbose and logger.info(
f"Computation successful. Result shape: {ROC_Table.shape}"
)
except Exception as e:
verbose and logger.error(f"Error during ROC computation: {e}")
raise
return ROC_Table
Brick Info
- shap>=0.47.0
- scikit-learn
- pandas
- pyarrow
- polars[pyarrow]
- numba>=0.56.0