Dataset Builder

Conversational data preparation

Chat with an AI assistant powered by OpenAI and Anthropic LLMs to prepare your data. Upload CSV/Excel/JSON/XML, clean, transform, encode — all through natural conversation.

platform.coreplexml.io

Dataset Builder AI conversational interface with data preparation assistant

PIPELINE

6-step data preparation pipeline

From raw upload to ML-ready dataset. Every step guided by AI, every transformation shown as executable Python.

1. Load & Explore

Upload your file (CSV, Excel, JSON, XML). The AI automatically detects schema, column types, missing values, and data quality issues.

2. Clean

Handle missing values (mean, median, forward fill), remove duplicates, fix outliers, and drop irrelevant columns — all through conversation.

3. Transform

Type casting, date feature extraction, text normalization, binning, and custom expressions. Tell the AI what you need in plain English.

4. Encode & Scale

One-hot encoding, label encoding, StandardScaler, MinMaxScaler. The AI suggests the best approach based on your data and target variable.

5. Feature Selection

Select or drop features for modeling. The AI recommends based on correlation analysis and feature importance from initial profiling.

6. Export

Export as an AI-ready versioned dataset. Full lineage tracking — every transformation recorded with the generated Python script.

Key Capabilities

Everything you need to get the most out of this module.

Multi-Format Ingestion

Upload CSV, Excel, JSON, or XML files. Automatic schema detection and type inference.

Smart Transformations

Type casting, one-hot encoding, label encoding, scaling, imputation — all through natural language.

Transparent Execution

See the generated Python scripts for every transformation. Full reproducibility guaranteed.

Version Control

Dataset versioning with schema diffing. Track every change and roll back when needed.

AI-POWERED

Talk to your data

Powered by OpenAI and Anthropic LLMs. Describe what you want in plain English — the AI generates and executes the transformation code for you.

YOU

Drop columns with more than 50% missing values

I found 3 columns above the threshold: fax_number (87% null), middle_name (62% null), and alt_phone (54% null). Dropping them now.

YOU

Fill missing ages with the median and encode gender as one-hot

Done. Filled 47 missing age values with median (34.0). Created gender_male and gender_female columns. Script saved for reproducibility.

TRANSFORMATIONS

15+ transformation types

From simple imputation to complex feature engineering. All available through natural language or the API.

Imputation

Fill missing values with mean, median, mode, forward fill, backward fill, or custom values. Column-specific strategies.

One-Hot Encoding

Convert categorical columns to binary indicator columns. Handles high-cardinality with configurable thresholds.

Label Encoding

Ordinal encoding for ordered categories. Preserves natural ordering (low < medium < high).

Scaling

StandardScaler (z-score) or MinMaxScaler (0-1 range). Essential for distance-based and neural network models.

Date Extraction

Extract year, month, day, weekday, hour from datetime columns. Creates multiple numeric features from a single date.

Binning

Group continuous values into discrete bins. Equal-width, equal-frequency, or custom boundaries.

Text Normalization

Lowercase, strip whitespace, remove special characters. Standardize text columns before encoding.

Custom Drops

Remove columns by name, high-null threshold, or zero-variance detection. The AI recommends drops based on data quality.

Type Casting

Convert string to numeric, parse dates, fix mixed-type columns. Handles edge cases like currency symbols and percentages.

File Formats

CSV, Excel, JSON, XML

AI Engine

OpenAI + Anthropic

Transform Types

15+

Version History

Full lineage

Code Transparency

Python scripts shown

Schema Detection

Automatic

Export Formats

CSV, Parquet

Session State

Persistent & resumable

PYTHON SDK

Conversational ETL from code

Integrate the AI data prep pipeline into your workflows. Chat, transform, and export programmatically.

prepare_data.py

from coreplexml import CorePlexMLClient

client = CorePlexMLClient(
    base_url="https://api.coreplexml.io",
    api_key="sk_your_api_key"
)

# Start a conversational ETL session
session = client.builder.create_session(
    project_id="proj_abc",
    file_path="raw_customers.csv"
)
print(f"Session: {session['id']}")
print(f"Rows: {session['row_count']}, Cols: {session['col_count']}")

# Chat with the AI to clean data
resp = client.builder.chat(
    session_id=session["id"],
    message="Drop the customer_id column and fill missing ages with median"
)
print(resp["reply"])
print(f"Script: {resp['script']}")  # Shows generated Python

# Execute the cleaning step
result = client.builder.execute(
    session_id=session["id"],
    step="cleaning"
)
print(f"Rows after cleaning: {result['row_count']}")

# Ask for encoding recommendations
resp = client.builder.chat(
    session_id=session["id"],
    message="One-hot encode the categorical columns and scale numerics"
)

# Export the prepared dataset
final = client.builder.finalize(session_id=session["id"])
print(f"Dataset version: {final['dataset_version_id']}")
print(f"Ready for AutoML training")

Dataset Builder API

Endpoints for conversational sessions, step execution, and dataset export.

POST

/api/builder/sessions

Create a new session with file upload (multipart)

POST

/api/builder/sessions/{id}/chat

Send a natural language message, get AI response + plan

POST

/api/builder/sessions/{id}/steps/{step}/propose

Generate transformation plan and Python script

POST

/api/builder/sessions/{id}/steps/{step}/execute

Execute the proposed transformation step

POST

/api/builder/sessions/{id}/finalize

Export AI-ready dataset as a versioned resource

GET

/api/builder/sessions/{id}

Get session state, current step, and data preview

Platform Views

From upload to analysis

platform.coreplexml.io/datasets/...

Dataset overview with column statistics and data profiling

Dataset overview with column-level statistics

platform.coreplexml.io/datasets/.../visualizations

Data distribution charts and visualization analysis

Interactive data visualizations and distributions

Ready to get started?

Start building with CorePlexML today. Free tier available — no credit card required.

Start Free View Pricing