← Platform
Datasetgen

Datasetgen: rules, eval suites, and regressions after every change

This layer grew out of four deployments where the same work kept repeating: pull rules out of documents and chats, lift similar real cases, and see before release where the agent is still acting under the old scenario.

In one deployment, 47% of all changes came from live conversations. In another, corporate tone went through 60+ iterations. In a third, the confidence formula grew to more than 30 parameters. In a fourth, every intermediate step needed its own verification set.

After several deployments the same pattern became obvious: the real cost is not model hosting or RAG wiring. The expensive part starts later, when rules have to be pulled out of PDFs, DOCX files, spreadsheets, and chats, production mistakes must become test cases, and the whole quality loop cannot be rebuilt from scratch after every contract or scenario change. Datasetgen grew out of that repeated work.

The full platform map lives on the platform overview →

How it looks in one real change

A patient writes in the DMS chat: “I need an MRI, tomorrow afternoon works for me.” Until yesterday the agent could prepare a clinic booking right away. Now the program says this route must go through telemedicine first, and no guarantee letter can be issued before that step.

Without a layer like this, the change lives in a PDF, in messages from the business team, and in the analyst’s head. The agent can easily keep following the old route in some scenarios: trying to book immediately or issuing the guarantee letter too early.

Datasetgen takes the new rule, pulls similar real chats, and builds the verification set from them: where direct booking is allowed, where the flow must stop at telemedicine, and where the case must go to an operator.

The team then runs the new agent version through that set and gets not an abstract “quality loop”, but a concrete list of scenarios where the agent still confuses the route, the clinic choice, or the timing of the guarantee letter.

Documents and chats turned into one rule base
Test sets under the exact agent contract
Fixes and migration instead of one-off generation
BA and QA inside the same loop
A reusable layer grown out of four deployments
Documents and chats turned into one rule base
Test sets under the exact agent contract
Fixes and migration instead of one-off generation
BA and QA inside the same loop
A reusable layer grown out of four deployments

Which case studies created it

This was not invented in isolation. Every block appeared after a concrete pain point in a live deployment.

AI agent replaces the personal manager for small investors

In the investment project, 47% of all changes came from live conversations. That made one thing clear: production conversations must become new rules and test cases, not a postmortem discussion on a call.

One of the Largest Telecom Operators

In telecom, corporate tone and phrasing went through 60+ iterations. We needed a reusable way to lock scenarios down and verify quality after each change instead of rebuilding that loop from scratch.

Operator of an Urban Transport System

On the transport project, the confidence formula grew to 30+ parameters and every penalty came from a concrete production failure. That showed that failure often lives in the data and checks before it reaches the model.

Luchi: a decision system for the VHI service workflow

In the DMS workflow, we had to keep separate verification sets for chat parsing, service matching, visits, notes, and operator QA. One final score was useless; the loop had to be step-by-step.

What datasetgen actually does

Builds one rule base

We pull requirements out of documents, spreadsheets, diagrams, and examples so the team has one place describing rules, constraints, and edge cases.

Builds test sets under the exact agent contract

The sets are not created “for the domain in general” but under the precise input/output schema, scenario types, and boundaries the agent must hold.

Checks meaning, not just format

The target is not merely valid YAML or JSON. We check requirement coverage, negative and boundary cases, contradictions in expected output, and signs of quality drift.

Keeps the sets alive after changes

When the agent schema or product contract changes, we try not to rebuild everything from scratch. The sets are patched, extended, and moved to the new contract where possible.

What is harder here than ordinary open-source setup

The challenge is not getting a file out. The challenge is making expected output reflect the real process logic and hidden constraints instead of merely looking plausible.

The hardest part is translating analyst, QA, and operator judgment into fields, negative cases, tolerances, and verifiable scenarios. That is exactly where ordinary open-source setup stops being enough.

And this is not one-off work. Documents change, agent contracts change, scenarios expand. The test sets and the quality loop must evolve with the system instead of being thrown away after every change.

Where it sits inside the platform

This is not a separate product standing next to the platform. It is the junction of three modules that already exist on the platform map.

Evaluation

The main loop: synthetic and test sets, LLM judges, regressions, and checks for degradation after changes.

Documents

The intake into the loop: documents, spreadsheets, guidelines, and examples become one rule base for generation and verification.

Chat & Agents

BA and QA workflows on top of the agent runtime: intake, requirement normalization, set preparation, and targeted fixes after changes.

How we assemble this loop

Show the workflow
INPUTS
PDF / DOCX / spreadsheets · production conversations · QA artifacts · input / output schemas
in parallel
CONTEXT
turn fragmented evidence into one source of truth
Requirement normalization documents → requirements
rules, constraints, examples, and edge cases are extracted from documents
Research and intake BA workflow
when context is still raw, the layer first exposes uncertainty and locks down what is actually known
CONTRACT
fix the exact contract the agent must satisfy
Schema and examples strict shape
input / output structure, field order, allowed values, and invariants
Domain rules project-specific
RAG, subscriber mocks, Excel scenarios, and other inputs get their own generation logic
one source of truth →
requirements.md agent schema examples and constraints
GENERATION dataset creation
positive, negative, edge, and cross-entity cases are generated under the exact schema
expected outputs must stay grounded in source documents or known process rules
light checks stop broken format from moving to the next stage
EVALUATION quality proof
Dataset review coverage + semantics
coverage, suspicious cases, repetition, and internal contradictions are checked
Fast and detailed reports judge loop
a quick sanity check during generation and a deeper offline report before release
EVOLUTION after the agent changes
Targeted fixes patch, not rewrite
problematic cases are fixed or added without rebuilding the whole set
Schema migration contract drift
existing eval suites survive a new agent contract instead of dying after every refactor
Reverse spec handover
when code outruns documentation, the layer reconstructs the spec from the actual implementation
documents → requirements → datasets → eval reports → fixes / migrations · BA + QA workflows in one loop · reusable across projects while domain logic stays client-specific

What changes in the project

A new project no longer starts its quality loop from a blank page. Repeated steps are already packaged, so the team spends time on domain specifics instead of mechanical artifact assembly.

After changes in the agent or the documents, the whole loop does not have to be rebuilt blindly. The sets can be updated selectively and checked against the exact place where drift appeared.

For the client this means something simple: after an agent change, the team does not guess what broke. It gets updated rules, a verification set, and a report showing which scenarios no longer pass.

The four deployments that created this layer live on the cases page. See case studies →

Tell us which process you want to break down.

We will tell you whether the task fits AI agents and, if it does, outline a concrete plan.

or write directly to ilya@manaraga.ai