What data do you want to create?

Frequently asked questions

Chaveta generates training-ready synthetic datasets from natural-language prompts. It produces JSONL, Parquet, HDF5, and JSON trajectory files across 8 categories: LLM conversations, agent tool-use traces, RL/world-model transitions, robotics trajectories, code generation pairs, chain-of-thought reasoning traces, tabular records, and generic structured data. Every dataset is backed by a reproducible generator script, a validation report, and curlable export endpoints.

The pipeline runs in 8 steps. (1) Your prompt is classified into a training category. (2) A generation spec is compiled with coverage axes, constraints, and difficulty tiers. (3) A generator package is authored with source code and validation rules. (4) A smoke batch is generated and validated. (5) Failed samples are repaired via semantic regeneration or blueprint merging. (6) The full scale batch is generated. (7) Artifacts are exported in your requested formats. (8) A quality report scores the entire run across structural integrity, semantic fidelity, distribution health, and policy posture.

Eight categories: (1) LLM training data — multi-turn conversations with tool calling, (2) Agent training — tool-use traces with error recovery and policy dispositions, (3) RL/world model — state-action-reward transitions with physics simulation, (4) Robotics — trajectories with forward kinematics, sensor noise, and calibration drift, (5) Code generation — instruction-code-test triples in Python, TypeScript, SQL, and Bash, (6) Reasoning — chain-of-thought traces with self-correction, (7) Tabular — structured rows for analytics and warehousing, (8) Generic — fallback JSONL records.

Yes. Every dataset session has an editable generator script panel. You can modify the source code directly, reset it, and apply changes to regenerate the dataset. The workspace agent can also edit scripts and datasets for you via natural language — it calls the edit tools automatically and shows confirmations before applying changes.

Every generated batch goes through four validation layers: structural (required fields present), semantic (tool grounding, turn ordering, state transitions), policy (safety and compliance language), and distribution (deduplication, coverage axis coverage). Failed samples are repaired up to 3 times. A quality report scores the run on schema validity, deduplication rate, diversity, coverage, constraint satisfaction, and policy pass rate. A benchmark proxy score determines whether the gate passes, warns, or fails.

JSONL (all categories), Parquet (agent, RL, tabular, code), HDF5 and JSON trajectories (robotics), and ROS bag (robotics). Every session gets a curlable dataset endpoint — pull the latest generated data from scripts, notebooks, or CI pipelines. Export artifacts are available for download with signed URLs when using S3/Tigris storage.

Free tier: up to 1GB of generated data per request. Paid tier: per-GB metered billing via Stripe. The LLM-driven classification step recommends a sample count that maximizes GBs for paid users based on the prompt context. Pricing varies by data type: tabular data starts at $10-50/GB, tool-calling datasets at $50-250/GB, robotics trajectories at $100-500/GB, and geological/gaming data at $100-1,000+/GB.

No. Chaveta is a compiler, not a wrapper. It classifies your request, compiles a deterministic generation spec, authors a reproducible generator package with source code, generates samples via templates or AI, validates them through four layers, repairs failures, exports artifacts, and produces a quality report. The generator is the asset — you can inspect it, edit it, version it, and run it again later. The LLM is one component in a larger deterministic pipeline, not the whole product.