Chemin

AI Council 2026: Why Verification Now Decides Model Quality

26 June, 2026Insights
AI Council 2026: Why Verification Now Decides Model Quality

Executive Summary

Zhi Xiong, Head of Client Solutions at Chemin, attended AI Council 2026 in San Francisco. Over three days, he sat in on sessions covering training, agents, inference, and AI security.

image.pngimage.png

One idea surfaced repeatedly: building better AI increasingly depends on deciding what information deserves to influence a model.

The pattern appeared across four stages of the AI lifecycle:

  • Training data: A 4B model outperformed one nearly 60 times its size because experts verified the data it learned from.
  • Evaluation: Open-ended tasks require experts to define what good performance looks like before a model can learn from feedback.
  • Synthetic data: Machine-generated data only improves models when quality checks filter it before training.
  • Runtime agents: AI agents need safeguards that distinguish trusted instructions from untrusted inputs before they act.

Each stage tackles a different problem. Together, they show how model quality depends on the decisions made before, during, and after training.

Training Data: Why a Smaller Model Beat a Bigger One

Charles Dickens, Senior Research Scientist at Snorkel AI, presented a 4B-parameter model that outperformed a 235B-parameter model on finance questions. Both models came from the same family.

The smaller model was trained in about 21 hours using eight NVIDIA H100 GPUs. The full training cost was under US$500, including the GPU resources needed to run a second model that graded responses.

Snorkel built the dataset through the review pipeline shown below.

Figure 1. Snorkel's dataset curation and verification pipeline.

image.png

AI-generated question-answer pairs passed automated and expert review before entering model training.

Snorkel extracted financial tables from SEC filings, the annual reports filed by US public companies. A separate model generated question-answer pairs from those tables.

Before entering training, every question-answer pair passed progressively stricter review:

  • A programmatic check confirmed every answer matched its source table.
  • Two language models independently reviewed the calculations and extracted values.
  • Human financial experts audited the most difficult cases before approving the dataset.

The team trained the model on simple single-table questions. That capability was extended to more complex questions that drew on information from multiple tables. The verified dataset helped the model generalize beyond the examples it had seen.

The findings also have practical implications for deployment. A 4B model can run on a single production GPU, while a 235B model requires a multi-GPU cluster. 

For enterprise teams, a carefully trained smaller model can deliver comparable performance with substantially lower infrastructure costs.

Evaluation: Why Open-Ended Work Is Hard To Grade

A reinforcement learning panel featuring Nic Ouporov (Fleet), Thais Castello Branco (Taste), and Vincent Weisser (Prime Intellect) addressed how to evaluate models when there is no single correct answer.

Reinforcement learning improves a model through feedback. The model attempts a task, receives a score, and uses that score to improve future responses.

Some tasks are straightforward to evaluate. Code either runs or it fails. Open-ended tasks, such as writing a customer reply or designing a user interface, have no single correct answer. Before a model can learn from these tasks, a domain expert has to define what a good result looks like.

As more teams build and share these evaluation environments, the quality of the grading becomes just as important as the model itself. A weak scoring system teaches the wrong behavior, regardless of how capable the model is.

Synthetic Data: Which Generated Examples Survive

NVIDIA described how it trained Nemotron 3 Super using more than 7 million training examples. Much of the dataset was synthetic, generated by other AI models rather than collected from people.

Generating millions of examples is no longer the difficult part. The challenge is deciding which ones deserve to reach training.

Before entering the dataset, every machine-generated example passed automated quality checks:

  • Math answers were tested for correctness.  
  • Generated code was run and checked for errors. 
  • Outputs that needed a fixed format were validated against schemas.  
  • A second model graded each conversation's quality.  
  • When multiple responses existed, the system kept the one with the strongest agreement.
  • Data that overlapped with evaluation benchmarks was removed to keep testing fair.

NVIDIA's automation mirrors Snorkel's earlier curation logic, but scales it up from hundreds of human-verified pairs to millions of machine-verified samples. In both cases, the filtering process decides what the model learns.

As synthetic datasets grow, filtering becomes as important as generation.

Runtime Agents: Verifying AI Agents In Production

Filtering doesn’t stop once training is complete. Diana Kelley, Chief Information Security Officer at Noma Security, showed how the same question returns when AI agents begin interacting with live data.

AI agents work with information from many sources, including system instructions, user requests, retrieved documents, and web pages. 

Everything the model reads arrives as a single stream of text. User instructions and webpage content look the same to the model, with no built-in way to tell which to trust.

This creates a prompt injection risk. An attacker can hide instructions inside a document or webpage that the agent reads. Without additional safeguards, the agent may follow those instructions as though they came from the user.

Teams reduce this risk with layered controls:

  • Keep system instructions separate from retrieved content.
  • Verify and clean external inputs before they reach the model.
  • Log tool calls and agent actions.
  • Require human approval before high-impact actions.

These recommendations align with the OWASP Agentic AI Risk Framework, which provides guidance for securing AI agents in production.

The Constraint Behind Modern AI

Each session examined a different stage of AI development. The common requirement across all of them was deciding what information should shape a model's behavior.

image.png

The week ended on a lighter note, with AI Council founder Pete Soderling performing an AI song he wrote himself at the piano. Behind the humour was a conference focused on a serious engineering challenge: building AI systems that people can trust.

It is also the principle that guides our work at Chemin. We build verified datasets, expert review pipelines, and evaluation workflows that help teams deploy AI systems with confidence.

Interested in exploring the talks in more detail? Recordings from AI Council 2026 are available on the AI Council YouTube channel.

If you’re building models or agents and verification has become your bottleneck, talk to us.

Share