FeaturesUse CasesBlogAPI ReferenceWhy CorePlexMLPricing
Start Free
← Back to Blog
Tutorials9 min read

AI-Powered Data Preparation with the Dataset Builder

CorePlexML Team·

Introduction

Data preparation is the most time-consuming phase of any machine learning project. Industry surveys consistently report that data scientists spend 60-80% of their time cleaning, transforming, and engineering features before a single model is trained. The work is tedious, error-prone, and often repeated across projects with only minor variations.

CorePlexML's Dataset Builder reimagines this process as a conversation. Instead of clicking through a rigid wizard or writing ad-hoc pandas scripts, you describe what you want in natural language and the Builder generates, executes, and validates the transformation code for you. The underlying LLM understands data preparation patterns and translates your intent into executable Python. Every generated script is shown to you for full transparency.

This tutorial walks through a complete data preparation workflow using a customer churn dataset as our example.

How the Dataset Builder Works

The Dataset Builder is a multi-step conversational agent powered by an LLM. When you start a new session, the Builder profiles your data and guides you through a structured pipeline: profiling, cleaning, column management, feature engineering, and export. At each step, the Builder asks targeted questions about your data and preferences, generates a Python transformation script based on your answers, executes it, and shows you the results.

The key architectural decision is that the Builder uses predefined questions at each step rather than open-ended LLM-generated queries. This keeps the conversation focused and predictable. However, your responses are interpreted by the LLM using semantic intent classification, so you can answer naturally rather than selecting from rigid options.

The pipeline itself is deterministic. The LLM guides the conversation, but every transformation runs as a standard pandas script. You can review, modify, or reject any generated code before it touches your data.

Step-by-Step Walkthrough

Let us work through preparing a customer churn dataset. The raw CSV has 7,043 rows with columns including customerID, gender, SeniorCitizen, tenure, MonthlyCharges, TotalCharges, and Churn.

Step 1: Upload and Profile

After uploading the CSV, the Builder automatically profiles every column. You will see a summary like this:

  • 7,043 rows, 21 columns detected
  • Column types: 16 categorical, 4 numeric, 1 identifier
  • Missing values: TotalCharges has 11 blanks (0.16%)
  • Unique counts: customerID is unique (likely an identifier), gender has 2 values, Contract has 3

The profiling step runs without any input from you. It sets the context for the cleaning decisions ahead. Pay attention to the missing value counts and the column type inference, as these will drive the next steps.

Step 2: Cleaning

The Builder asks about your cleaning preferences. For our churn dataset, the conversation might go like this:

Builder: "I found 11 missing values in TotalCharges (0.16%). How would you like to handle missing values? Options include mean imputation, median imputation, mode imputation, or dropping rows with missing values."

You: "Use median for TotalCharges since it's a monetary column and might have outliers."

The Builder generates a script that converts TotalCharges to numeric (the blanks were causing it to be read as a string), imputes the 11 missing values with the column median, and validates that zero missing values remain. The executed script is displayed in a terminal-style output block so you can verify exactly what happened.

If duplicates exist, the Builder asks whether to remove them based on all columns or a subset. For our dataset, no exact duplicates are found, so this sub-step is skipped automatically.

Step 3: Column Dropping

Not every column belongs in a training dataset. The Builder analyzes column characteristics and asks which columns to remove.

Builder: "I identified the following columns that may not be useful for modeling: customerID (unique identifier, 7043 distinct values). Would you like to drop any columns?"

You: "Drop customerID. It's just a row identifier with no predictive value."

The Builder also flags columns with very low variance or near-constant values. If a column has 99.9% of the same value, it contributes almost no information to the model and can safely be removed. You make the final call on each suggestion.

This step is critical for avoiding data leakage. If a column contains information that would not be available at prediction time, such as a post-event label or a derived aggregate, including it will produce artificially inflated metrics during training that collapse in production.

Step 4: Feature Engineering

This is where the Builder adds the most value. Feature engineering transforms raw columns into representations that ML algorithms can learn from more effectively.

Builder: "I see several categorical columns. Would you like to encode them? Options include one-hot encoding (creates binary columns for each category) or label encoding (maps categories to integers)."

You: "One-hot encode the low-cardinality categoricals like gender, Partner, Dependents, and Contract. Label encode InternetService since it has an ordinal nature."

The Builder generates a script that applies pd.get_dummies() for the one-hot targets and LabelEncoder for the ordinal ones. For our churn dataset, one-hot encoding Contract (Month-to-month, One year, Two year) produces three binary columns: Contract_Month-to-month, Contract_One year, and Contract_Two year.

Scaling is another common transformation. The Builder will ask whether to apply standard scaling (zero mean, unit variance) or min-max scaling (0-1 range) to numeric columns like tenure, MonthlyCharges, and TotalCharges. Standard scaling is generally preferred for tree-based models, while min-max works better for neural networks and distance-based algorithms.

If your dataset contains date columns, the Builder can extract temporal features: year, month, day of week, hour, and time-since features. For example, a signup_date column can be transformed into signup_month, signup_day_of_week, and days_since_signup.

Step 5: Review and Export

After all transformations, the Builder presents a summary of everything that changed:

  • Rows: 7,043 (unchanged)
  • Columns: 21 original, 32 after encoding, 31 after dropping customerID
  • Missing values: 0
  • Transformations applied: median imputation, one-hot encoding (6 columns), label encoding (1 column), standard scaling (3 columns)

You can preview the first few rows of the transformed dataset. If everything looks correct, export the result as a new dataset version. The original data is never modified. CorePlexML stores every version with full lineage, so you can always trace back to the raw upload.

# The Builder generates and executes scripts like this:
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Convert TotalCharges to numeric, coerce errors to NaN
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Impute missing values with median
df["TotalCharges"].fillna(df["TotalCharges"].median(), inplace=True)

# Drop identifier column
df.drop(columns=["customerID"], inplace=True)

# One-hot encode categorical columns
df = pd.get_dummies(df, columns=["gender", "Partner", "Dependents",
                                  "PhoneService", "Contract",
                                  "PaymentMethod"])

# Label encode ordinal column
le = LabelEncoder()
df["InternetService"] = le.fit_transform(df["InternetService"])

# Standard scale numeric columns
scaler = StandardScaler()
df[["tenure", "MonthlyCharges", "TotalCharges"]] = scaler.fit_transform(
    df[["tenure", "MonthlyCharges", "TotalCharges"]]
)

SDK Integration

You can also drive the Dataset Builder programmatically through the SDK.

from coreplexml import CorePlexMLClient

client = CorePlexMLClient(
    base_url="https://api.coreplexml.io",
    api_key="sk_your_api_key"
)

# Start a builder session
session = client.builder.create_session(
    dataset_version_id="dv_abc123"
)

# Send a message to the builder
response = client.builder.chat(
    session_id=session["session_id"],
    message="Impute missing values with median for all numeric columns"
)
print(response["reply"])
print(response["script"])  # The generated Python code

# Advance through steps
response = client.builder.chat(
    session_id=session["session_id"],
    message="Drop customerID column"
)

# Export the result
export = client.builder.export(session_id=session["session_id"])
print(f"New version: {export['dataset_version_id']}")

Common Cleaning Patterns

Numerical Missing Values

For numerical columns, median imputation is the safest default. It is robust to outliers, unlike mean imputation, which can be skewed by extreme values. If the column has a known domain (e.g., age cannot be negative), consider adding a validation step after imputation.

Categorical Missing Values

Mode imputation (replacing with the most frequent category) works for low-cardinality columns. For higher cardinality, consider creating an explicit "Unknown" category. This preserves the signal that the value was missing, which can itself be predictive.

High-Cardinality Categoricals

Columns with hundreds or thousands of unique values (like zip codes or product SKUs) should not be one-hot encoded directly. The Builder can apply frequency encoding (replacing each category with its count in the dataset) or target encoding (replacing with the mean of the target variable for that category). Both produce a single numeric column instead of hundreds of sparse binary ones.

Date Feature Extraction

Date columns contain rich temporal information compressed into a single value. The Builder extracts multiple features: year, month, day of week, and elapsed time since a reference date. A column like last_purchase_date might yield days_since_last_purchase, which is far more useful to a model than the raw date string.

Tips for Best Results

Be specific with your instructions. Instead of saying "clean the data," say "impute missing values in the Age column with the median and drop rows where Income is negative." The more precise your request, the more accurate the generated script.

Review the generated scripts. The Builder shows every line of code it executes. Take a moment to verify the logic. If a script does something unexpected, you can reject it and rephrase your instruction.

Use "force" for edge cases. If a transformation reduces your dataset to zero rows (e.g., an overly aggressive filter), the Builder will block the operation and ask for confirmation. If you genuinely want to proceed, type "force" to override the guard.

Iterate in small steps. Rather than describing all transformations in one message, work through each step individually. This gives you a chance to inspect intermediate results and catch issues early. The conversational format is designed for this iterative workflow.

Check column counts after encoding. One-hot encoding can dramatically increase the number of columns. A dataset with 20 original columns might balloon to 100+ after encoding all categoricals. Monitor column counts and consider alternative encodings for high-cardinality features.

The Dataset Builder turns hours of manual data wrangling into minutes of guided conversation. Combined with CorePlexML's versioning system, every transformation is tracked, reproducible, and auditable, giving you a complete lineage from raw upload to ML-ready feature set.