Data Preparation

Clean, organize, and transform your data.

🔧 Data Preparation Package

The Data Preparation package is a collection of tools within Coded Flows designed to help you clean, organize, and transform your data before analysis or export. It provides straightforward operations that work on tables you've loaded into your flow.

What This Package Does

This package offers building blocks for common data handling tasks. You can use these tools to:

  • Reorganize your data (sorting, filtering, grouping)
  • Modify column content (renaming, splitting, combining values)
  • Handle missing or problematic data (filling gaps, removing empty rows)
  • Transform data structure (pivoting, unpivoting, transposing)
  • Calculate new values based on existing columns
  • Join data from multiple sources
  • Extract specific parts of dates and times
  • Sample or subset your data

You build workflows by connecting these tools together, with each step producing output that feeds into the next.

Important Limitation

This package is designed for data that fits in your computer's local memory. It is not suitable for large datasets or big data processing. If you're working with:

  • Large files (typically over several gigabytes)
  • Datasets that exceed your available RAM
  • Data requiring distributed or cloud-based processing

...you'll need alternative solutions outside this package.

How It Works

Each tool in this package performs a specific operation on your data table. Most tools follow this pattern:

  1. Input: You provide your data table
  2. Configuration: You set options that define what the tool should do
  3. Output: The tool returns your data (modified or filtered) for the next step

Common operations include filtering rows, adding calculated columns, reorganizing structure, or merging tables.

Tools Available

Reading & Writing Data

  • Read Data File: Load data from CSV, Excel, Parquet, or JSON files
  • Data to File: Save your processed data back to a file

Organizing Data

  • Sort Data: Arrange rows by one or more columns
  • Keep Columns: Select only the columns you need
  • Delete Columns: Remove unwanted columns
  • Rename Columns: Change column names
  • Move Columns: Reorder your columns
  • Remove Empty Rows: Delete rows with missing values

Filtering Data

  • Filter Rows by Values: Keep or remove rows based on text content
  • Filter Rows by Numeric Range: Keep or remove rows based on number ranges
  • Filter Rows by Date: Keep or remove rows based on date criteria
  • Filter Rows by Boolean Value: Keep or remove rows based on true/false columns

Grouping & Aggregating

  • Group Data: Combine rows into groups and calculate totals, averages, counts, etc.
  • Top N: Extract top or bottom N rows (useful for rankings)

Handling Missing Data

  • Fill Empty Cells: Replace missing values with a fixed value
  • Fill Forward/Backward: Fill gaps using the previous or next value
  • Impute Missing Values: Replace gaps with statistical measures (mean, median, mode)

Text Operations

  • Concatenate Columns: Combine multiple columns into one with a separator
  • Split Column: Break a column into multiple columns using a delimiter
  • Find and Replace Strings: Search for text patterns and replace them

Column Operations

  • Copy Column: Duplicate a column
  • Unfold Column: Convert categories into separate binary columns (one-hot encoding)
  • Negate Boolean: Flip true/false values

Numerical Operations

  • Numerical Combinations: Add, subtract, multiply, or divide two columns
  • Round Numbers: Round values to decimal places or significant digits
  • Force Numerical Range: Clip values to stay within min/max bounds
  • Compute Average: Calculate row averages across multiple columns
  • Bin Numbers: Group continuous numbers into ranges or categories

Date & Time

  • Dates from Strings: Convert text into date values with format detection
  • Extract Date Elements: Pull out year, month, day, hour, etc. into separate columns
  • Date Difference: Calculate the time between two dates

Data Structure Changes

  • Fold Multiple Columns: Convert wide format to long format (unpivot)
  • Transpose Data: Flip rows and columns
  • Pivot Table: Convert long format to wide format with aggregation
  • Stack Data: Combine multiple tables vertically

Joining & Combining

  • Join Data: Merge two tables based on matching columns
  • Distinct Rows: Remove duplicate rows

Analysis & Reporting

  • Data Schema: Inspect your data structure and get statistics
  • Window Computation: Perform advanced calculations like moving averages or rankings
  • Sampling: Extract a representative subset of rows
  • Flag Rows by Numeric Range: Mark rows that meet criteria (creates a new column with 1 or empty)
  • Flag Rows by Values: Mark rows matching specific text (creates a new column with 1 or empty)
  • Flag Rows by Date: Mark rows in specific date ranges (creates a new column with 1 or empty)

Filling & Creating Values

  • Fill All Values: Replace every value in a column with the same value

Output Formats

Most tools allow you to choose the output format for your final data:

  • Pandas DataFrame: Standard table format (default)
  • Polars DataFrame: Faster alternative for large operations
  • Arrow Table: Efficient columnar format

For most workflows, the default option works fine.

Safe Mode

Many tools include a "Safe Mode" option. When enabled, the tool will skip over problems (like missing columns) and continue working instead of stopping. This is useful if you're building flows that might receive slightly different data shapes. When disabled, the tool will report errors, which helps catch mistakes in your setup.

Tips for Building Workflows

  • Start simple with one operation at a time
  • Test your flow with a small sample of data first
  • Use the Data Schema tool to inspect your data at different stages
  • Break complex transformations into multiple smaller steps
  • Check output at key stages to verify the results make sense

Performance Notes

These tools use DuckDB, an in-memory database engine, to perform operations efficiently. While fast for local data, keep in mind:

  • All data must fit in your computer's RAM
  • Very large tables may slow down your computer or cause it to run out of memory
  • Simple operations (filtering, sorting) are generally faster than complex aggregations

If you regularly hit these limits, consider working with smaller date ranges, sampling larger datasets, or archiving older data.

Package Info

version v0.1.3
python 3.10, 3.11, 3.12, 3.13