Adversarial Audit for AI Verifiers and Evals

Your eval has a
hidden objective.

We audit AI evaluations, verifiers, and reward functions for exploitability, hidden incentives, and brittle scoring failures.

Request an audit →

Audits for

Coding-agent evals RLVR environments Benchmark scorers Tool-use tasks Reward functions

01

The Problem

The scorer is the attack surface.

Modern AI systems are trained and evaluated against tasks with automated scoring. If the scorer is weak, the model learns to game the score instead of learning the intended skill.

This appears across reinforcement learning, RLVR post-training, coding-agent evals, browser benchmarks, tool-use environments, and long-horizon agent tasks.

A benchmark can look rigorous while quietly training the wrong behavior. Teams overestimate capability. Training runs optimize the wrong target. Benchmark gains don't transfer. Decisions get made on corrupted signals.

We believe evaluation integrity will become one of the central bottlenecks in AI development. No one is systematically auditing the score itself.

Common exploit classes

  • Test leakageGround-truth answers accessible to the model at scoring time.
  • Brittle matchingRegex or keyword scorers that pass on formatting rather than correctness.
  • DeterminismFixed seeds make evaluation order predictable and memorizable.
  • Reward hackingMulti-term composition dominated by gaming a single auxiliary term.
  • ShortcutsDegenerate outputs that satisfy the verifier without solving the task.
  • Sandbox escapeEnvironments that allow writes to grading code or shared state.
  • Blind spotsSystematic failure modes the scorer never checks.

02

Tooling

A CLI for evaluation integrity.

We are starting narrow. The first product is a command-line tool that runs adversarial analysis against verifiers, scoring scripts, and benchmark configurations.

The goal is not to compute more metrics. It is to reveal concrete ways the evaluation can fail, and to produce findings engineers can act on.

Every exploit path should be reproducible, inspectable, and accompanied by a hardening path.

$ objective audit ./verifiers/coding_eval.py
Analyzing: coding_eval.py
Found 4 exploit paths across 2 severity levels.
$ objective attack --env verl --task gsm8k
Running adversarial probes...
Pass-without-solving confirmed on 3 of 12 subtasks.
$ objective report --format md
Report written to ./objective-report-2026-03-28.md

What the CLI does

  • audit — static analysis to identify likely weakness classes: leakage, brittle matching, lenient pass criteria, state dependencies.
  • attack — active probing for reward loopholes, pass-without-solving strategies, and exploitable reward composition.
  • trace — reproducible failure traces showing exactly how and when the verifier is exploited.
  • report — structured findings with severity assessments, hardening recommendations, and version diffs.
  • monitor — CI integration to detect when a benchmark or verifier becomes easier to game over time.

03

Positioning

We audit the score, not the model.

There are already many tools for running evals, tracing agent behavior, and monitoring LLM applications. Hidden Objective is not one of those tools.

The question we answer is not "did the model do well?" It is whether the model did well for the right reason — and whether its score can be produced without the intended capability.

Others answer
We answer
Did the model score well?
Can this score be produced without the intended skill?
What is the model's capability?
What is the verifier's exploitable weakness?
How do we run better evals?
How do we know this eval can be trusted?
Did benchmark performance improve?
Did it improve for the right reason?

04

Who We Work With

Teams where bad evals are expensive.

Our early work is with teams doing RL or RLVR post-training, teams building coding and tool-use agents, and researchers creating benchmarks that others will train against.

If your training runs or product decisions depend on automated scoring and you have reason to believe your evals might be brittle, we want to talk.

We start with manual audits before full tooling automation. Each audit builds a corpus of failure modes that sharpens the next one.

  • Frontier labs running RL or RLVR post-training
  • AI agent startups building coding and tool-use systems
  • Benchmark creators whose evals others will train against
  • AI safety and alignment researchers
  • Enterprises deploying agentic systems in high-stakes domains

"We are not trying to build a giant everything-platform on day one. We are starting with a narrow, painful wedge: audit whether a verifier or evaluation setup can be exploited."

— Hidden Objective research brief

Find the hidden objective
before your model does.

We are working with a small number of teams. If your evaluation pipeline is load-bearing and you want to know whether it can be gamed, get in touch.

audit@hiddenobjective.com →