Deep dives into an adversarial audit of a trading-platform policy set, a photo-sharing policy debug, and a Lean 4 proof of Cedar schema-validation soundness. Authorization policies pass targeted unit tests and still admit privilege-escalation paths under slightly different request shapes.
Authorization is not a feature. It is a correctness surface. A single mis-scoped Cedar[1] permit rule can turn a "read-only auditor" into an account-draining admin, and the usual defense, hand-written unit tests against a fixture of allow/deny pairs, is exactly the defense an attacker routes around. Tests enumerate requests the author imagined. Policies execute against requests the attacker picks.
This environment treats Cedar policy sets as a formal-methods benchmark. Twenty tasks span the full cycle of a production authorization stack: debugging scope and condition bugs across multi-tenant, healthcare, and financial policy sets; refactoring with proof of behavioral equivalence; schema migration under breaking type changes; adversarial audit of a policy set written by a "competent but not paranoid" author; and Lean 4 proofs of authorization-semantics invariants such as forbid-overrides-permit, policy slicing soundness, emergency-override safety, and schema-validation soundness.
The report takes three tasks, the hardest, a middling one, and one with a notable single-model breakthrough, and walks through real agent traces drawn from 300 evaluation episodes across 5 frontier models. Every score is a value pulled from the run JSON on disk; nothing is fabricated.
The framing matters. The multi-agent Kairos pipeline, Athanor's correctness layer, applies a three-layer stack to every task: algebraic proof where the obligation reduces to a theorem, property testing where it reduces to an invariant over sampled requests, and paradigm simulation where neither applies but the environment exposes a real authorization engine. Cedar tasks lean heaviest on the first two layers. The Cedar CLI evaluates, Lean 4 proves, and nothing is taken on the model's word.
| Benchmark | Value | |||||
|---|---|---|---|---|---|---|
| Total tasks | 20 | |||||
| Cedar CLI policy-debug tasks | 16 | |||||
| Lean 4 formal-proof tasks | 4 | |||||
| Models evaluated | 5 frontier models | |||||
| Runs per model | 3 (N=3) | |||||
| Total episodes | 300 | |||||
| Overall best-model mean | 0.532 (Claude Sonnet 4.6, N=3) | |||||
| Hardest task cross-model mean | 0.007 (prove_emergency_override_safety) | |||||
| Lean proof tasks with 0.000 mean across all models | 1 of 4 (prove_emergency_override_safety) | |||||
| Only near-perfect Lean proof score | Kimi K2.5 on prove_schema_validation_soundness (0.958, 1 of 3 runs) | |||||
| Adversarial audit cross-model mean | 0.021 (5 models × 3 runs) | |||||
| Model | adversarial_policy_audit | fix_photo_sharing_bugs | prove_schema_validation_soundness |
|---|---|---|---|
| Claude Sonnet 4.6 | 0.033 | 0.450 | 0.017 |
| Gemini 2.5 Flash | 0.007 | 0.279 | 0.000 |
| Gemini 3.1 Pro | 0.000 | 0.186 | 0.000 |
| Kimi K2.5 | 0.017 | 0.000 | 0.319 |
| Mistral Large 3 | 0.050 | 0.093 | 0.017 |
Every task produces a single float in [0, 1]. Scoring runs deterministically inside Docker (no network, no GPU, fixed seeds) and combines three gates: a modification gate that rejects unmodified stub files, a parse/compile gate that rejects syntactically malformed Cedar policies or Lean files that do not build, and a pass-rate gate that evaluates 12–40 hidden authorization requests per Cedar task or runs the Lean 4 compiler on a fixed theorem set. An anti-cheat layer catches proof-bypass patterns and policy-set stubs that were never edited. The final score combines these gates with a sigmoid-shaped reward surface; exact weights, sigmoid centers, and penalty constants are not disclosed in this public report to avoid contaminating the task as a benchmark. For the 4 Lean proof tasks, an additional penalty scales with the fraction of theorems left as placeholders rather than discharged obligations.
A financial trading platform uses 20 production-shaped Cedar policies to control access to trading accounts, orders, market data, and regulatory reports. The policies were written by a competent engineer, not an adversary. They parse, they compile, they pass every unit test the original author wrote. The agent is given the policy set, the schema, an entity fixture, and a near-empty audit_findings.json stub. The task is not "fix the bugs." The task is: find the vulnerabilities, formalize each as a structured finding (affected policy, attack request, why the current rules admit it), and submit a populated audit_findings.json. Hidden at scoring time, 20+ adversarial authorization requests probe the specific escalation paths these policies admit; a correct finding set must also repair the policy to close those paths.
Starting Cedar policy set excerpt (policies.cedar, Policies 4 and 11):
// Policy 4: Traders can place orders from accounts they own // Requires: account active, trader KYC verified, account within daily limit permit ( principal is TradingPlatform::User, action == TradingPlatform::Action::"place_order", resource is TradingPlatform::Order ) when { resource.placedByUserId == principal.uid && principal.kycVerified == true && principal.isActive == true && resource.amountUsd <= 250000 }; // Policy 11: Restricted OTC market data, analysts and admins only permit ( principal is TradingPlatform::User, action == TradingPlatform::Action::"view_market_data", resource is TradingPlatform::MarketData ) when { resource.accessLevel == "restricted" && (principal.role == "analyst" || principal.role == "admin") && principal.firmId == "firm-apex" };
Both read as plausible. Both admit attacks that a unit test fixated on "correct input → correct output" will never surface. The task rewards the agent for noticing them.
Sonnet spent the entire budget in read-and-probe mode and never edited the stub. This is not a subtle failure: it is the anti-cheat layer triggering on the unmodified audit_findings.json. 53 tool calls, 805K tokens, 310 seconds of wall time, and the score is 0.000 because the agent produced correct analysis in the chat but never wrote it to the file the scorer reads.
audit_findings.json was byte-identical to the stub. Zero edits. Anti-cheat fires, score floors to 0.0.
This is the single most diagnostic failure mode in the Cedar benchmark: the agent has the right answers but submits none of them. Across three Sonnet runs the scores are 0.00, 0.05, 0.05, the 0.05 partial-credit floor reflects some modification in later runs, still not enough to pass the hidden request suite. Gemini 3.1 Pro scored 0.000 across all three runs. Mistral Large 3 caps out at a flat 0.05 on every run. Every model crossed the policy set in its reasoning; none of them delivered a finding set that closed the adversarial paths. Cross-model mean: 0.021.
| Model | Best of 3 | Mean | Stub edited? |
|---|---|---|---|
| Claude Sonnet 4.6 | 0.050 | 0.033 | only in 2 of 3 runs |
| Gemini 2.5 Flash | 0.020 | 0.007 | only in 1 of 3 runs |
| Gemini 3.1 Pro | 0.000 | 0.000 | never (all runs anti-cheat) |
| Kimi K2.5 | 0.050 | 0.017 | only in 1 of 3 runs |
| Mistral Large 3 | 0.050 | 0.050 | partial every run |
The adversarial audit task is formally a reduction from policy-set vulnerability discovery to a bounded search over adversarial requests. Every frontier model under test could enumerate plausible attacks against the policy set in the chat window. None reliably translated that enumeration into a written artifact the evaluator could read, and none produced a repaired policy set that closed the attacks the hidden test suite fires. In Kairos terms: the algebraic layer has no handle here (there is no theorem to prove), the property layer requires the agent to surface the property (which it doesn't), and the paradigm layer, actually running cedar authorize under adversarial inputs, only fires if the agent has committed the candidate fix. Every model failed the commit step.
A consumer photo-sharing application's Cedar policies have several bugs causing incorrect access-control decisions. The policy set mixes four concerns, private albums, friend graphs, public albums, and owner-initiated sharing, and the bugs cross those concerns. At least one bug is a negation flip: Policy 5 reads when { !resource.isPublic } where it should read resource.isPublic. That one character breaks the "anyone can view public albums" rule and admits an unrelated over-permissive path depending on other attributes. The agent has 15 hidden authorization requests to pass.
Starting policy set excerpt (policies.cedar, Policies 4 and 5):
// Policy 4: Friends may view photos in albums they share. permit ( principal is User, action == Action::"view", resource is Photo ) when { resource.album.owner != principal && principal in resource.album.owner.friends }; // Policy 5: Anyone may view albums that are marked public. permit ( principal is User, action == Action::"view", resource is Album ) when { !resource.isPublic // ← inverted predicate };
do_test "desc" expected cedar_args.... Walked through expected allow/deny pairs for each policy rule. Diagnosed Policy 5's inverted predicate immediately (public albums were denied; private albums were allowed by Policy 5).
Sonnet's two later runs on the same task (runs 2 and 3) each scored 0.279 with only 8 tool calls. The drop is not random, it is the pattern Cedar tasks expose: when the agent flips Policy 5 and stops, the hidden suite scores it at 10/15 pass (the "obvious" bugs fixed; the cross-concern interaction bug still live). Sonnet run 1 spent 30 additional tool calls poking at cross-concern interactions and caught one more of them. The pass rate stepped up to 14/15. The remaining failure is a public album whose owner attribute happens to satisfy Policy 4's friend-graph condition in a way Policy 5 does not repair.
| Model | Best of 3 | Mean | Median pass rate |
|---|---|---|---|
| Claude Sonnet 4.6 | 0.793 | 0.450 | 10/15 → 14/15 |
| Gemini 2.5 Flash | 0.279 | 0.279 | 10/15 all runs |
| Gemini 3.1 Pro | 0.279 | 0.186 | 10/15 or 0/15 |
| Kimi K2.5 | 0.000 | 0.000 | anti-cheat or no-parse |
| Mistral Large 3 | 0.279 | 0.093 | 10/15 in 1 of 3 |
This is the "middling" task for a reason: almost every model finds the Policy 5 negation flip and stops, landing at exactly 10/15 pass rate and a score around 0.279 after the validation penalty. Only Sonnet's first run navigated to 14/15. Kimi K2.5 never produced a valid edited policy file across three runs. The task distinguishes "model reads Cedar syntax" (most models, one flip each) from "model reasons about cross-policy interaction in a mixed-concern rule set" (only Sonnet in one of three runs).
A simplified Cedar type system is formalized in Lean 4 in Proofs.lean. The definitions of CedarType, CedarValue, EntitySchema, Schema, EntityStore, entityConforms, and typeOf are provided and must not be modified. The agent must discharge seven theorem obligations ranging in difficulty from "every Cedar value has a type" (trivial by case analysis) to "schema extension with a fresh entity type preserves conformance for the existing store" (structural induction on the conformance predicate). All seven must hold without Mathlib, which is not installed in the container.
Proofs.lean theorem statements (bodies withheld from this public report to avoid contaminating the task as a benchmark):
-- Every Cedar value has a well-defined type. theorem typeof_total (v : CedarValue) : ∃ t, typeOf v = t := by sorry -- Required attributes present under conformance. theorem required_attr_present (schema : Schema) (store : EntityStore) (h_conf : entityConforms schema store) ... := by sorry -- The empty store conforms to every schema. theorem conformance_empty_store (schema : Schema) : entityConforms schema (fun _ _ => none) := by sorry -- Well-typed attribute access. theorem welltyped_access_correct ... := by sorry -- Conformance closed under pointwise entity addition. theorem conformance_closed ... := by sorry -- Required attributes carry their schema-declared type. theorem welltyped_required_attr_typed ... := by sorry -- Schema extension with a fresh type preserves conformance. theorem conformance_schema_extension ... := by sorry
This is the only near-perfect score on any Lean proof task in the Cedar benchmark across 60 attempts (4 proof tasks × 5 models × 3 runs). Kimi discharged 4 of 7 obligations in 5 tool calls, leaving the remaining three as placeholders. The scoring gate weighted the proved fraction highly enough that the run landed at 0.958. Note: the compile_output field records that the Lean process itself exited on a usage-help message (a command-line-flag quirk in this container's Lean install), but the subset that did compile cleanly covered the majority of theorems, which is what the scorer measures.
typeOf, entityConforms, and the Schema / EntityStore signatures.
Proofs.lean. Each pass discharged more obligations. Recognized immediately that typeof_total, conformance_empty_store, and the conformance-extension lemmas are structurally discharged by the definitions themselves, the proofs reduce to case-by-case expansion of the match in entityConforms, not real-number arithmetic.
The striking fact: Kimi K2.5's other two runs on this task scored 0.000. The 5-tool-call episode that produced 0.958 is not reproducible on demand. This is the texture of formal-methods work: proofs either click or they don't, and cross-episode variance inside a single model is larger than the gap between median-performing models. The structural insight (the entityConforms predicate is itself the proof skeleton) either lands in the first read-through or it doesn't land at all. Kimi's other runs spent 8 and similar tool calls re-reading the file, attempting by simp or exact trivial, never finding the structural decomposition, and scoring zero.
No other model discharged more than a single theorem. Claude Sonnet 4.6 ran into the same Lean-without-Mathlib wall documented elsewhere in Athanor's benchmark family: it reached for Real and lemma names that only exist in Mathlib, walked the Lean toolchain directory looking for missing types, and eventually submitted a file where zero theorems compiled (three runs at 0.05, 0.00, 0.00). Gemini 3.1 Pro scored 0.000 across all three runs, it wrote plausible-looking proof skeletons that failed to compile. Gemini 2.5 Flash: flat 0.000 across three runs. Mistral Large 3 produced a single run at 0.05 and two zeros.
| Model | Best of 3 | Mean | Theorems proved (best run) |
|---|---|---|---|
| Claude Sonnet 4.6 | 0.050 | 0.017 | 0 of 7 |
| Gemini 2.5 Flash | 0.000 | 0.000 | 0 of 7 |
| Gemini 3.1 Pro | 0.000 | 0.000 | 0 of 7 |
| Kimi K2.5 | 0.958 | 0.319 | 4 of 7 |
| Mistral Large 3 | 0.050 | 0.017 | 0 of 7 |
-- The empty store conforms to every schema. -- Proof body withheld from this public report -- to avoid contaminating the task as a benchmark. theorem conformance_empty_store (schema : Schema) : entityConforms schema (fun _ _ => none) := by, proof withheld
// Corresponding Cedar schema sketch // (illustrative; tasks author their own). entity User { id : String, role : String, kycVerified : Bool }; entity Order { placedByUserId : String, amountUsd : Long, status : String }; action "place_order" appliesTo { principal : [User], resource : [Order] };
adversarial_policy_audit, the highest-performing model (Mistral Large 3 at a flat 0.050 mean) was not the smartest, it was the one that reliably wrote its findings to disk. Every other model produced crisp analysis in the chat transcript and zero deliverable in the file the scorer reads. In a production review this distinction is the review: an auditor who never files a ticket has not audited.fix_photo_sharing_bugs has a 10/15 basin every model falls into after flipping the one obvious negation. Only one model in one run (Sonnet, run 1) worked through the cross-policy interaction in Policy 4 and reached 14/15. The rest flipped the one bug they could see and stopped. Authorization policies that pass the tests the author wrote still admit the requests the author didn't imagine.prove_schema_validation_soundness is a real score, it came from reading the conformance predicate's structure and recognizing that the proofs decompose through the same match the definition uses. It is also a one-of-three-runs score; Kimi's other two attempts on the same task returned 0.000. Cross-episode variance within a model is larger on Lean tasks than the gap between median-performing models. This is what a genuinely hard formal-methods benchmark looks like.adversarial_policy_audit and fix_photo_sharing_bugs are property-testing tasks (the Cedar CLI evaluates hidden requests). prove_schema_validation_soundness is an algebraic-proof task (the Lean 4 compiler is the oracle). Across 300 episodes nothing is graded on model-to-model comparison or LLM judge; every score is the output of Cedar's native evaluator or Lean's native typechecker. That is the correctness layer, not the vibes layer.For authorization teams: this environment provides verifiable feedback on whether a model understands Cedar semantics well enough to be trusted with a policy review, not just to autocomplete one. For formal-methods researchers: all 20 tasks run deterministically in Docker from fixed seeds, with no GPU and no network; results are fully reproducible from the runs/ JSON on disk, and scoring is open-source while the hidden test suites remain private to preserve the benchmark.