Browse Bricks

Data Preparation

Clean, organize, and transform your data.

🔧 Data Preparation Package

The Data Preparation package is a collection of tools within Coded Flows designed to help you clean, organize, and transform your data before analysis or export. It provides straightforward operations that work on tables you've loaded into your flow.

What This Package Does

This package offers building blocks for common data handling tasks. You can use these tools to:

Reorganize your data (sorting, filtering, grouping)
Modify column content (renaming, splitting, combining values)
Handle missing or problematic data (filling gaps, removing empty rows)
Transform data structure (pivoting, unpivoting, transposing)
Calculate new values based on existing columns
Join data from multiple sources
Extract specific parts of dates and times
Sample or subset your data

You build workflows by connecting these tools together, with each step producing output that feeds into the next.

Important Limitation

This package is designed for data that fits in your computer's local memory. It is not suitable for large datasets or big data processing. If you're working with:

Large files (typically over several gigabytes)
Datasets that exceed your available RAM
Data requiring distributed or cloud-based processing

...you'll need alternative solutions outside this package.

How It Works

Each tool in this package performs a specific operation on your data table. Most tools follow this pattern:

Input: You provide your data table
Configuration: You set options that define what the tool should do
Output: The tool returns your data (modified or filtered) for the next step

Common operations include filtering rows, adding calculated columns, reorganizing structure, or merging tables.

Tools Available

Reading & Writing Data

Read Data File: Load data from CSV, Excel, Parquet, or JSON files
Data to File: Save your processed data back to a file

Organizing Data

Sort Data: Arrange rows by one or more columns
Keep Columns: Select only the columns you need
Delete Columns: Remove unwanted columns
Rename Columns: Change column names
Move Columns: Reorder your columns
Remove Empty Rows: Delete rows with missing values

Filtering Data

Filter Rows by Values: Keep or remove rows based on text content
Filter Rows by Numeric Range: Keep or remove rows based on number ranges
Filter Rows by Date: Keep or remove rows based on date criteria
Filter Rows by Boolean Value: Keep or remove rows based on true/false columns

Grouping & Aggregating

Group Data: Combine rows into groups and calculate totals, averages, counts, etc.
Top N: Extract top or bottom N rows (useful for rankings)

Handling Missing Data

Fill Empty Cells: Replace missing values with a fixed value
Fill Forward/Backward: Fill gaps using the previous or next value
Impute Missing Values: Replace gaps with statistical measures (mean, median, mode)

Text Operations

Concatenate Columns: Combine multiple columns into one with a separator
Split Column: Break a column into multiple columns using a delimiter
Find and Replace Strings: Search for text patterns and replace them

Column Operations

Copy Column: Duplicate a column
Unfold Column: Convert categories into separate binary columns (one-hot encoding)
Negate Boolean: Flip true/false values

Numerical Operations

Numerical Combinations: Add, subtract, multiply, or divide two columns
Round Numbers: Round values to decimal places or significant digits
Force Numerical Range: Clip values to stay within min/max bounds
Compute Average: Calculate row averages across multiple columns
Bin Numbers: Group continuous numbers into ranges or categories

Date & Time

Dates from Strings: Convert text into date values with format detection
Extract Date Elements: Pull out year, month, day, hour, etc. into separate columns
Date Difference: Calculate the time between two dates

Data Structure Changes

Fold Multiple Columns: Convert wide format to long format (unpivot)
Transpose Data: Flip rows and columns
Pivot Table: Convert long format to wide format with aggregation
Stack Data: Combine multiple tables vertically

Joining & Combining

Join Data: Merge two tables based on matching columns
Distinct Rows: Remove duplicate rows

Analysis & Reporting

Data Schema: Inspect your data structure and get statistics
Window Computation: Perform advanced calculations like moving averages or rankings
Sampling: Extract a representative subset of rows
Flag Rows by Numeric Range: Mark rows that meet criteria (creates a new column with 1 or empty)
Flag Rows by Values: Mark rows matching specific text (creates a new column with 1 or empty)
Flag Rows by Date: Mark rows in specific date ranges (creates a new column with 1 or empty)

Filling & Creating Values

Fill All Values: Replace every value in a column with the same value

Output Formats

Most tools allow you to choose the output format for your final data:

Pandas DataFrame: Standard table format (default)
Polars DataFrame: Faster alternative for large operations
Arrow Table: Efficient columnar format

For most workflows, the default option works fine.

Safe Mode

Many tools include a "Safe Mode" option. When enabled, the tool will skip over problems (like missing columns) and continue working instead of stopping. This is useful if you're building flows that might receive slightly different data shapes. When disabled, the tool will report errors, which helps catch mistakes in your setup.

Tips for Building Workflows

Start simple with one operation at a time
Test your flow with a small sample of data first
Use the Data Schema tool to inspect your data at different stages
Break complex transformations into multiple smaller steps
Check output at key stages to verify the results make sense

Performance Notes

These tools use DuckDB, an in-memory database engine, to perform operations efficiently. While fast for local data, keep in mind:

All data must fit in your computer's RAM
Very large tables may slow down your computer or cause it to run out of memory
Simple operations (filtering, sorting) are generally faster than complex aggregations

If you regularly hit these limits, consider working with smaller date ranges, sampling larger datasets, or archiving older data.

Package Info

version v0.1.5

python 3.10, 3.11, 3.12, 3.13