🔧 Data Preparation Package
The Data Preparation package is a collection of tools within Coded Flows designed to help you clean, organize, and transform your data before analysis or export. It provides straightforward operations that work on tables you've loaded into your flow.
What This Package Does
This package offers building blocks for common data handling tasks. You can use these tools to:
- Reorganize your data (sorting, filtering, grouping)
- Modify column content (renaming, splitting, combining values)
- Handle missing or problematic data (filling gaps, removing empty rows)
- Transform data structure (pivoting, unpivoting, transposing)
- Calculate new values based on existing columns
- Join data from multiple sources
- Extract specific parts of dates and times
- Sample or subset your data
You build workflows by connecting these tools together, with each step producing output that feeds into the next.
Important Limitation
This package is designed for data that fits in your computer's local memory. It is not suitable for large datasets or big data processing. If you're working with:
- Large files (typically over several gigabytes)
- Datasets that exceed your available RAM
- Data requiring distributed or cloud-based processing
...you'll need alternative solutions outside this package.
How It Works
Each tool in this package performs a specific operation on your data table. Most tools follow this pattern:
- Input: You provide your data table
- Configuration: You set options that define what the tool should do
- Output: The tool returns your data (modified or filtered) for the next step
Common operations include filtering rows, adding calculated columns, reorganizing structure, or merging tables.
Tools Available
Reading & Writing Data
- Read Data File: Load data from CSV, Excel, Parquet, or JSON files
- Data to File: Save your processed data back to a file
Organizing Data
- Sort Data: Arrange rows by one or more columns
- Keep Columns: Select only the columns you need
- Delete Columns: Remove unwanted columns
- Rename Columns: Change column names
- Move Columns: Reorder your columns
- Remove Empty Rows: Delete rows with missing values
Filtering Data
- Filter Rows by Values: Keep or remove rows based on text content
- Filter Rows by Numeric Range: Keep or remove rows based on number ranges
- Filter Rows by Date: Keep or remove rows based on date criteria
- Filter Rows by Boolean Value: Keep or remove rows based on true/false columns
Grouping & Aggregating
- Group Data: Combine rows into groups and calculate totals, averages, counts, etc.
- Top N: Extract top or bottom N rows (useful for rankings)
Handling Missing Data
- Fill Empty Cells: Replace missing values with a fixed value
- Fill Forward/Backward: Fill gaps using the previous or next value
- Impute Missing Values: Replace gaps with statistical measures (mean, median, mode)
Text Operations
- Concatenate Columns: Combine multiple columns into one with a separator
- Split Column: Break a column into multiple columns using a delimiter
- Find and Replace Strings: Search for text patterns and replace them
Column Operations
- Copy Column: Duplicate a column
- Unfold Column: Convert categories into separate binary columns (one-hot encoding)
- Negate Boolean: Flip true/false values
Numerical Operations
- Numerical Combinations: Add, subtract, multiply, or divide two columns
- Round Numbers: Round values to decimal places or significant digits
- Force Numerical Range: Clip values to stay within min/max bounds
- Compute Average: Calculate row averages across multiple columns
- Bin Numbers: Group continuous numbers into ranges or categories
Date & Time
- Dates from Strings: Convert text into date values with format detection
- Extract Date Elements: Pull out year, month, day, hour, etc. into separate columns
- Date Difference: Calculate the time between two dates
Data Structure Changes
- Fold Multiple Columns: Convert wide format to long format (unpivot)
- Transpose Data: Flip rows and columns
- Pivot Table: Convert long format to wide format with aggregation
- Stack Data: Combine multiple tables vertically
Joining & Combining
- Join Data: Merge two tables based on matching columns
- Distinct Rows: Remove duplicate rows
Analysis & Reporting
- Data Schema: Inspect your data structure and get statistics
- Window Computation: Perform advanced calculations like moving averages or rankings
- Sampling: Extract a representative subset of rows
- Flag Rows by Numeric Range: Mark rows that meet criteria (creates a new column with 1 or empty)
- Flag Rows by Values: Mark rows matching specific text (creates a new column with 1 or empty)
- Flag Rows by Date: Mark rows in specific date ranges (creates a new column with 1 or empty)
Filling & Creating Values
- Fill All Values: Replace every value in a column with the same value
Output Formats
Most tools allow you to choose the output format for your final data:
- Pandas DataFrame: Standard table format (default)
- Polars DataFrame: Faster alternative for large operations
- Arrow Table: Efficient columnar format
For most workflows, the default option works fine.
Safe Mode
Many tools include a "Safe Mode" option. When enabled, the tool will skip over problems (like missing columns) and continue working instead of stopping. This is useful if you're building flows that might receive slightly different data shapes. When disabled, the tool will report errors, which helps catch mistakes in your setup.
Tips for Building Workflows
- Start simple with one operation at a time
- Test your flow with a small sample of data first
- Use the Data Schema tool to inspect your data at different stages
- Break complex transformations into multiple smaller steps
- Check output at key stages to verify the results make sense
Performance Notes
These tools use DuckDB, an in-memory database engine, to perform operations efficiently. While fast for local data, keep in mind:
- All data must fit in your computer's RAM
- Very large tables may slow down your computer or cause it to run out of memory
- Simple operations (filtering, sorting) are generally faster than complex aggregations
If you regularly hit these limits, consider working with smaller date ranges, sampling larger datasets, or archiving older data.