26 SystemVerilog and NuSMV repair tasks scored by EBMC[1], five frontier models, 130 episodes under the anti-cheat regime. Model-checked properties either prove or they don't. No partial credit for plausible-looking code that fails under assertion.
A hardware engineer evaluating code-generating models has a problem that simulation cannot solve: a bounded stimulus testbench that shows no failures does not mean the design is correct. Model checking closes that gap by proving properties over the full reachable state space within the configured check bound. This benchmark takes 26 classic hardware-verification bug classes, hands the model broken RTL and a fixed property set, and accepts only submissions that EBMC proves against every property.
Tasks cover arbiter fairness and deadlock, FIFO pointer and credit protocols, cache-coherence and memory-controller scheduling, TLB consistency, Booth multiplier correctness, restoring divider invariants, interrupt-timer ordering, LFSR periodicity, UART framing, I2C and SPI protocol conformance, branch-predictor recovery, a register-file forwarding network, and seven liveness-assertion synthesis tasks that ask the model to write the property set itself. Five tasks additionally pair the SystemVerilog repair with a Lean 4[2] theorem obligation so the algebraic layer can reject a submission that passes bounded model-checking but violates a universally-quantified property.
Across 130 episodes, Kimi K2.5 averaged 0.747 and the worst-performing model averaged 0.269. Two of the 26 tasks sit at a uniform zero across every frontier model. Twenty-three admit at least one perfect score. No model wins every task: the best-scoring model on one task is often at zero on the next.
| Total tasks | 26 (19 repair, 7 assertion-synthesis) | |
| Models evaluated (anti-cheat regime, schema_version=1) | 5 frontier models | |
| Total episodes | 130 (5 models × 26 tasks, N=1) | |
| Tasks with 0.000 across all models | 2 of 26 (fix_spi_slave, write_assertions_txnmon) | |
| Tasks with at least one perfect score (1.000) | 23 of 26 | |
| Tasks pairing SV repair with a Lean 4 theorem obligation | 5 (fix_arb_lock, fix_booth_mul, fix_cache_ctrl, fix_restoring_div, fix_tlb_ctrl) | |
| Overall best-model mean | 0.747 (Kimi K2.5) | |
Per-model means across the 26-task benchmark:
| Model | Mean | Tasks at 1.000 | Tasks at 0.000 |
|---|---|---|---|
| Kimi K2.5 | 0.747 | 17 | 5 |
| Claude Sonnet 4.6 | 0.692 | 17 | 8 |
| Gemini 2.5 Flash | 0.538 | 12 | 10 |
| Gemini 3 Flash Preview | 0.538 | 12 | 11 |
| Mistral Large 3 | 0.269 | 5 | 19 |
Each task runs deterministically inside Docker with no network. Repair tasks give the model a broken SystemVerilog or NuSMV module and a fixed property block; scoring runs EBMC against the submitted module and accepts only when every property is proved within the configured bound. Assertion-synthesis tasks reverse the direction: the model receives a correct RTL module and is asked to produce the property set whose semantics the module satisfies, scored by matching assertion signatures and then re-proving with EBMC. Both gates are hard-thresholded. An anti-cheat layer catches proof-bypass patterns before a score is recorded, including disabled assertions, stub Lean files, and interface tampering against the hidden reference.
Five of the 26 repair tasks additionally pair the SystemVerilog module with a Lean 4 theorem file. The algebraic layer is present because bounded model checking does not discharge every correctness property at every bound; a universally-quantified statement about a multiplier or an arbiter's long-run fairness needs a proof the model checker cannot produce. On those five tasks the agent must close the Lean theorems without any of the disallowed tactics (sorry, admit, axiom, and the decide family).
Repair a broken SPI slave receiver so it conforms to mode 0 framing under back-to-back transactions. The broken module disagrees with a known-good reference on four distinct property checks spanning data latching, MISO drive ordering, and the exit-from-reset invariant. Every model fails.
| Model | Score | EBMC proved | Tool calls | Tokens | Wall time |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 0.000 | 1 of 4 | 8 | 77,390 | 189 s |
| Gemini 2.5 Flash | 0.000 | 2 of 4 | 40 | 300,783 | 145 s |
| Gemini 3 Flash Preview | 0.000 | 1 of 4 | 50 | 715,813 | 118 s |
| Kimi K2.5 | 0.000 | 1 of 4 | 14 | 126,731 | 153 s |
| Mistral Large 3 | 0.000 | 2 of 4 | 52 | 717,145 | 784 s |
The best result is a partial pass of two of the four EBMC obligations (Flash and Mistral). No model found an edit that proved all four. Across 130 episodes this task and write_assertions_txnmon are the only two that sit at a uniform zero, which puts an informative floor on the benchmark. A score distribution that bottoms out above zero everywhere cannot separate models at the hard end; this one does.
Repair a round-robin arbiter so it satisfies seven safety properties and the liveness theorems stated in ArbLockLiveness.lean. The broken module permits a pathological grant pattern that starves the lowest-priority requester under a specific combination of request stalls. Two of five models pass; three drop to zero.
| Model | Score | Tool calls | Tokens | Wall time |
|---|---|---|---|---|
| Claude Sonnet 4.6 | 0.000 | 6 | 54,075 | 32 s |
| Gemini 2.5 Flash | 1.000 | 6 | 40,056 | 42 s |
| Gemini 3 Flash Preview | 0.000 | 7 | 51,292 | 13 s |
| Kimi K2.5 | 1.000 | 8 | 45,016 | 22 s |
| Mistral Large 3 | 0.000 | 13 | 111,448 | 52 s |
The distribution is bimodal. The models that solved the task were inside 50 seconds and under 50K tokens. The models that failed were not slower by any economically meaningful margin. The difference is not compute. It is reading the interaction between the round-robin pointer update and the request-masking logic. Agents that patched the grant pattern without touching the pointer advance ship a module whose bounded traces look fine and whose liveness obligations fail.
The Lean layer matters on this task. The bounded safety properties EBMC checks alone can be satisfied by an arbiter that livelocks at depth greater than the check bound. The ArbLockLiveness.lean obligation quantifies over request sequences of unbounded length; a submission that passes EBMC on the safety block but cannot discharge the Lean theorem does not pass the task. The anti-cheat layer rejects Lean submissions shipped with sorry, admit, or the decide family.
A signed Booth multiplier that produces off-by-one results on a subset of operand pairs. Correctness is specified by a universally-quantified Lean theorem in BoothMulCorrectness.lean and a bounded refinement of the same statement in EBMC. One model solves the task cleanly.
| Model | Score | Tool calls | Tokens | Wall time |
|---|---|---|---|---|
| Claude Sonnet 4.6 | 0.000 | 7 | 62,616 | 59 s |
| Gemini 2.5 Flash | 0.000 | 6 | 43,877 | 59 s |
| Gemini 3 Flash Preview | 0.000 | 50 | 795,854 | 127 s |
| Kimi K2.5 | 1.000 | 10 | 64,297 | 67 s |
| Mistral Large 3 | 0.000 | 8 | 47,978 | 31 s |
The extremes of the tool-call distribution are both at zero. Gemini 3 Flash Preview burned 796K tokens and 50 tool calls without a passing submission. Kimi K2.5 solved the task in 10 tool calls and 64K tokens. The other three models produced plausible-looking edits in around 60 seconds and never made contact with the bug: the broken module's partial-product-select logic has a sign-extension error specific to the most-negative multiplicand, and EBMC's refutation output points at the exact register pair. The agents that failed did not read the counter-example trace. The agent that passed edited the sign-extension in one pass and re-ran EBMC without further exploration.
The numbers above are a single-model, single-shot view. A production pipeline does two things before and after a single-model attempt that change the outcome shape entirely. It refuses bad specs at the door so no generation model is asked to work on submissions that would have failed anyway, and it runs a multi-agent fleet on the specs it accepts so variance on hard tasks collapses.
Before a generation model is invoked, the pipeline runs a rule-based spec validator. It checks five things: required config fields, parseable SV or NuSMV properties, property count against what the config declares, undeclared signal references, and contradictory property pairs. If any check fails, the pipeline exits with a rejection verdict, writes a spec_rejected artifact with the reason and a suggested fix, and returns control to the customer. No generation model is ever called.
Three intentionally malformed specs ran through the validator on 2026-04-18. Every rejection is rule-based and completes in milliseconds, without invoking a generation model.
| Case | Failure class | Validator verdict |
|---|---|---|
| Missing required config fields | schema gate | ok: false, reason: task config missing category and difficulty fields |
| Unparseable SystemVerilog property | parse gate | ok: false, reason: zero properties extracted, parser found no assert property, assert final, or NuSMV LTLSPEC/CTLSPEC/INVARSPEC; config expected 2 |
| Contradictory property pair | semantic gate | ok: false, reason: assertions p_mutex and p_not_mutex differ only by negation of (grant0 && grant1) |
Combined validator wall time across the three cases: under 200 ms. Without this gate, each of the three bad specs would have consumed a full multi-agent fleet run attempting to formalize, patch, or prove something that is not a valid spec to begin with. The schema-gate case would have looked like a hung orchestrator. The parse-gate case would have surfaced as the builder hallucinating properties it thought the file "should have" had. The contradiction case would have produced a proof attempt for a logically vacuous conjunction, often surfacing with misleading partial-progress credit.
The rejection artifact hands the customer exactly the change needed to resubmit: add the missing config fields, fix the SystemVerilog syntax, or remove one side of the contradiction. The gate is cheap by design; the fleet's model budget is reserved for specs that are ready to work on.
fix_arb_lock is one of the five Lean-paired tasks. In the single-model single-shot numbers above, two of five models pass and three fail. The fleet pairs a builder race, property generation, a mutation sweep, Lean refinement, an EBMC close, and a Lean-close gate, then ships a bundle containing the winning SystemVerilog, the EBMC verdicts, and the Lean proof obligations.
A fleet run on fix_arb_lock completed on 2026-04-18. End-to-end numbers from the workdir:
| Phase | Outcome | Wall time or elapsed |
|---|---|---|
| Phase -2, spec validator | ok: true, 7 properties parsed, 5 gates clean |
under 200 ms |
| Phase 0, builder race | 3 parallel builders; each produced an EBMC-verifying SystemVerilog with 7 of 7 properties proved. Winning builder selected. | majority of wall budget |
| Phase 3, Lean refinement | converged in one iteration | seconds |
| Phase 4, EBMC close | exit_code: 0, 7 of 7 properties proved, 0 refuted |
0.278 s |
| Phase 4.5, Lean-close gate | 5 theorems authored. 4 proved cleanly; 1 (fair_alternation) found to be false as stated and replaced with a proven corrected statement (see below). All proofs use only allowed tactics; none of the banned tactics (sorry, admit, axiom, native_decide, decide, omega, linarith, simp, simp_all, tauto, by_contra, ring) appear in the closing proofs. Final axioms: {propext, Classical.choice, Quot.sound}. |
seconds inline; external proof search for the corrected theorem |
| Phase 6, bundle | tarball written; ships the winning SystemVerilog, the 7 EBMC verdicts (p_reset, p_mutex, p_lock_hold, p_lock_end, p_burst_load, p_rr_turn, p_starve_pri), the Lean proof file, and a reproducible ebmc reverification recipe. |
seconds |
| Total wall time | about 5 minutes | |
The EBMC half of the task is discharged. The winning design has all seven declared properties proved against the task's check bound in 0.278 seconds of model-check time. The Lean half is also discharged, with one notable caveat described in the next subsection.
The Phase 0 builder race authored five Lean theorems as the formal specification of the arbiter's long-horizon behaviour. Four of them are true statements about the design and Phase 4.5 closed them without any banned tactic. The fifth, fair_alternation, claimed the arbiter's turn bit alternates with step-count parity under symmetric contention with no locking. External proof search attempted the theorem, failed, and returned a counterexample at k = 2: the state after two steps is a fixed point under the step function with both requesters active, so the turn bit stops alternating and stays at false rather than flipping with k's parity. The original theorem body was commented out with an inline note explaining why. A corrected version was then stated and proved: after 2k steps under symmetric contention without locking, the turn is always false.
What this surfaced is a bug in the specification, not in the RTL. The SystemVerilog arbiter is behaving as its implementation describes; every EBMC assertion it was asked to check proved. The authored long-horizon fairness claim was wrong as written. None of the seven EBMC assertions (p_reset, p_mutex, p_lock_hold, p_lock_end, p_burst_load, p_rr_turn, p_starve_pri) encoded an equivalent multi-step fairness claim; p_rr_turn is a per-transition conditional, not a fixed-point statement over 2k steps. Without the Lean layer, the misstatement would have shipped as part of the specification documentation for the verified design.
This is the argument for pairing model-checking with theorem-proving rather than running either alone. EBMC discharges the transition-level safety assertions quickly and completely, and is the right tool for that class of property. Lean formalises the long-horizon behavioural claims that do not fit bounded model-checking, and at the level of quantification where a misstatement is a readable English sentence about the design. Neither layer is redundant on fix_arb_lock: they catch different classes of defect.
Against the single-model single-shot table on fix_arb_lock above (two of five pass, three at zero), the fleet delivers the EBMC safety block at 7 of 7 proved and a fully closed Lean proof file against the authored specification, with one theorem corrected mid-flight. The full ten-phase run completes in about 5 minutes of wall time. No model in the single-model table was asked to produce a Lean proof at all; the fleet adds that obligation, runs it to ground, and ships the evidence with the design.
One run on one task. The argument the fleet makes here is not that it solves a task single models cannot; two single models did pass the EBMC half. The argument is that the fleet ships the EBMC verdict and a machine-checked specification in the same bundle, and that the specification layer catches misstatements the bounded-model-checking layer is structurally unable to catch.
fix_arb_lock admits modules whose EBMC safety block proves at depth 10 and whose liveness theorems still fail under the Lean 4 obligation. Pairing SV repair with a universally-quantified proof rejects the class of submissions that are correct-up-to-the-check-bound and incorrect in general. Five of the 26 tasks use this pairing. The Lean layer changes the outcome on all five.
fix_booth_mul the only passing model used 10 tool calls. The model that used 50 tool calls and 796K tokens was at zero. On fix_arb_lock every passing submission was inside 50 seconds and under 50K tokens. Exploration burn is a negative signal, not a positive one. Agents that fail to read the refutation trace cannot recover by spending more compute.
fix_spi_slave, write_assertions_txnmon). Twenty-three admit at least one perfect score. The distribution is neither uniformly easy nor uniformly hard, which is what is needed for score separation to track real capability. Across frontier models the spread between best and worst mean is 0.478.
For verification teams: this benchmark provides verifiable feedback on whether a model understands SystemVerilog semantics well enough to discharge universally-quantified correctness properties, not just to autocomplete a repair that happens to pass an acceptance trace. The 26 tasks run deterministically in Docker with no network and no GPU, reproducible from the run JSONs on disk; scoring is open-source and the hidden test modules remain private so a benchmark cannot be gamed by overfitting to its own reference implementations.