Technical Report · April 2026

Hardware Verification: Where Frontier Models Break on RTL

26 SystemVerilog and NuSMV repair tasks scored by EBMC^[1], five frontier models, 130 episodes under the anti-cheat regime. Model-checked properties either prove or they don't. No partial credit for plausible-looking code that fails under assertion.

Abstract

A hardware engineer evaluating code-generating models has a problem that simulation cannot solve: a bounded stimulus testbench that shows no failures does not mean the design is correct. Model checking closes that gap by proving properties over the full reachable state space within the configured check bound. This benchmark takes 26 classic hardware-verification bug classes, hands the model broken RTL and a fixed property set, and accepts only submissions that EBMC proves against every property.

Tasks cover arbiter fairness and deadlock, FIFO pointer and credit protocols, cache-coherence and memory-controller scheduling, TLB consistency, Booth multiplier correctness, restoring divider invariants, interrupt-timer ordering, LFSR periodicity, UART framing, I2C and SPI protocol conformance, branch-predictor recovery, a register-file forwarding network, and seven liveness-assertion synthesis tasks that ask the model to write the property set itself. Five tasks additionally pair the SystemVerilog repair with a Lean 4^[2] theorem obligation so the algebraic layer can reject a submission that passes bounded model-checking but violates a universally-quantified property.

Across 130 episodes, Kimi K2.5 averaged 0.747 and the worst-performing model averaged 0.269. Two of the 26 tasks sit at a uniform zero across every frontier model. Twenty-three admit at least one perfect score. No model wins every task: the best-scoring model on one task is often at zero on the next.

Summary statistics

Total tasks		26 (19 repair, 7 assertion-synthesis)
Models evaluated (anti-cheat regime, schema_version=1)		5 frontier models
Total episodes		130 (5 models × 26 tasks, N=1)
Tasks with 0.000 across all models		2 of 26 (`fix_spi_slave`, `write_assertions_txnmon`)
Tasks with at least one perfect score (1.000)		23 of 26
Tasks pairing SV repair with a Lean 4 theorem obligation		5 (`fix_arb_lock`, `fix_booth_mul`, `fix_cache_ctrl`, `fix_restoring_div`, `fix_tlb_ctrl`)
Overall best-model mean		0.747 (Kimi K2.5)

Per-model means across the 26-task benchmark:

Model	Mean	Tasks at 1.000	Tasks at 0.000
Kimi K2.5	0.747	17	5
Claude Sonnet 4.6	0.692	17	8
Gemini 2.5 Flash	0.538	12	10
Gemini 3 Flash Preview	0.538	12	11
Mistral Large 3	0.269	5	19

Methodology. Legacy runs predating the anti-cheat layer are present in the repository but excluded. Under the legacy scorer, a submission could claim a pass without the RTL actually discharging the property set; those numbers are not comparable to the current regime. All numbers on this page come from the five *_run1.json files with scoring_metadata.schema_version == 1. This is N=1 per model at current cost discipline. Replication is cheap if a customer wants to fund it.

Scoring

Each task runs deterministically inside Docker with no network. Repair tasks give the model a broken SystemVerilog or NuSMV module and a fixed property block; scoring runs EBMC against the submitted module and accepts only when every property is proved within the configured bound. Assertion-synthesis tasks reverse the direction: the model receives a correct RTL module and is asked to produce the property set whose semantics the module satisfies, scored by matching assertion signatures and then re-proving with EBMC. Both gates are hard-thresholded. An anti-cheat layer catches proof-bypass patterns before a score is recorded, including disabled assertions, stub Lean files, and interface tampering against the hidden reference.

Five of the 26 repair tasks additionally pair the SystemVerilog module with a Lean 4 theorem file. The algebraic layer is present because bounded model checking does not discharge every correctness property at every bound; a universally-quantified statement about a multiplier or an arbiter's long-run fairness needs a proof the model checker cannot produce. On those five tasks the agent must close the Lean theorems without any of the disallowed tactics (sorry, admit, axiom, and the decide family).

Three tasks in detail

Task 1. fix_spi_slave, the benchmark floor

Repair a broken SPI slave receiver so it conforms to mode 0 framing under back-to-back transactions. The broken module disagrees with a known-good reference on four distinct property checks spanning data latching, MISO drive ordering, and the exit-from-reset invariant. Every model fails.

Model	EBMC proved	Tool calls	Tokens	Wall time
Claude Sonnet 4.6	1 of 4	8	77,390	189 s
Gemini 2.5 Flash	2 of 4	40	300,783	145 s
Gemini 3 Flash Preview	1 of 4	50	715,813	118 s
Kimi K2.5	1 of 4	14	126,731	153 s
Mistral Large 3	2 of 4	52	717,145	784 s

The best result is a partial pass of two of the four EBMC obligations (Flash and Mistral). No model found an edit that proved all four. Across 130 episodes this task and write_assertions_txnmon are the only two that sit at a uniform zero, which puts an informative floor on the benchmark. A score distribution that bottoms out above zero everywhere cannot separate models at the hard end; this one does.

Task 2. fix_arb_lock, bimodal under EBMC plus Lean

Repair a round-robin arbiter so it satisfies seven safety properties and the liveness theorems stated in ArbLockLiveness.lean. The broken module permits a pathological grant pattern that starves the lowest-priority requester under a specific combination of request stalls. Two of five models pass; three drop to zero.

Model	Score	Tool calls	Tokens	Wall time
Claude Sonnet 4.6	0.000	6	54,075	32 s
Gemini 2.5 Flash	1.000	6	40,056	42 s
Gemini 3 Flash Preview	0.000	7	51,292	13 s
Kimi K2.5	1.000	8	45,016	22 s
Mistral Large 3	0.000	13	111,448	52 s

The distribution is bimodal. The models that solved the task were inside 50 seconds and under 50K tokens. The models that failed were not slower by any economically meaningful margin. The difference is not compute. It is reading the interaction between the round-robin pointer update and the request-masking logic. Agents that patched the grant pattern without touching the pointer advance ship a module whose bounded traces look fine and whose liveness obligations fail.

The Lean layer matters on this task. The bounded safety properties EBMC checks alone can be satisfied by an arbiter that livelocks at depth greater than the check bound. The ArbLockLiveness.lean obligation quantifies over request sequences of unbounded length; a submission that passes EBMC on the safety block but cannot discharge the Lean theorem does not pass the task. The anti-cheat layer rejects Lean submissions shipped with sorry, admit, or the decide family.

Task 3. fix_booth_mul, single-winner

A signed Booth multiplier that produces off-by-one results on a subset of operand pairs. Correctness is specified by a universally-quantified Lean theorem in BoothMulCorrectness.lean and a bounded refinement of the same statement in EBMC. One model solves the task cleanly.

Model	Score	Tool calls	Tokens	Wall time
Claude Sonnet 4.6	0.000	7	62,616	59 s
Gemini 2.5 Flash	0.000	6	43,877	59 s
Gemini 3 Flash Preview	0.000	50	795,854	127 s
Kimi K2.5	1.000	10	64,297	67 s
Mistral Large 3	0.000	8	47,978	31 s

The extremes of the tool-call distribution are both at zero. Gemini 3 Flash Preview burned 796K tokens and 50 tool calls without a passing submission. Kimi K2.5 solved the task in 10 tool calls and 64K tokens. The other three models produced plausible-looking edits in around 60 seconds and never made contact with the bug: the broken module's partial-product-select logic has a sign-extension error specific to the most-negative multiplicand, and EBMC's refutation output points at the exact register pair. The agents that failed did not read the counter-example trace. The agent that passed edited the sign-extension in one pass and re-ran EBMC without further exploration.

Spec rejection and the fleet

The numbers above are a single-model, single-shot view. A production pipeline does two things before and after a single-model attempt that change the outcome shape entirely. It refuses bad specs at the door so no generation model is asked to work on submissions that would have failed anyway, and it runs a multi-agent fleet on the specs it accepts so variance on hard tasks collapses.

Spec rejection at the door, before any generation model runs

Before a generation model is invoked, the pipeline runs a rule-based spec validator. It checks five things: required config fields, parseable SV or NuSMV properties, property count against what the config declares, undeclared signal references, and contradictory property pairs. If any check fails, the pipeline exits with a rejection verdict, writes a spec_rejected artifact with the reason and a suggested fix, and returns control to the customer. No generation model is ever called.

Three intentionally malformed specs ran through the validator on 2026-04-18. Every rejection is rule-based and completes in milliseconds, without invoking a generation model.

Case	Failure class	Validator verdict
Missing required config fields	schema gate	`ok: false`, reason: task config missing `category` and `difficulty` fields
Unparseable SystemVerilog property	parse gate	`ok: false`, reason: zero properties extracted, parser found no `assert property`, `assert final`, or NuSMV `LTLSPEC/CTLSPEC/INVARSPEC`; config expected 2
Contradictory property pair	semantic gate	`ok: false`, reason: assertions `p_mutex` and `p_not_mutex` differ only by negation of `(grant0 && grant1)`

Combined validator wall time across the three cases: under 200 ms. Without this gate, each of the three bad specs would have consumed a full multi-agent fleet run attempting to formalize, patch, or prove something that is not a valid spec to begin with. The schema-gate case would have looked like a hung orchestrator. The parse-gate case would have surfaced as the builder hallucinating properties it thought the file "should have" had. The contradiction case would have produced a proof attempt for a logically vacuous conjunction, often surfacing with misleading partial-progress credit.

The rejection artifact hands the customer exactly the change needed to resubmit: add the missing config fields, fix the SystemVerilog syntax, or remove one side of the contradiction. The gate is cheap by design; the fleet's model budget is reserved for specs that are ready to work on.

Fleet versus single model on fix_arb_lock

fix_arb_lock is one of the five Lean-paired tasks. In the single-model single-shot numbers above, two of five models pass and three fail. The fleet pairs a builder race, property generation, a mutation sweep, Lean refinement, an EBMC close, and a Lean-close gate, then ships a bundle containing the winning SystemVerilog, the EBMC verdicts, and the Lean proof obligations.

A fleet run on fix_arb_lock completed on 2026-04-18. End-to-end numbers from the workdir:

Phase	Outcome	Wall time or elapsed
Phase -2, spec validator	`ok: true`, 7 properties parsed, 5 gates clean	under 200 ms
Phase 0, builder race	3 parallel builders; each produced an EBMC-verifying SystemVerilog with 7 of 7 properties proved. Winning builder selected.	majority of wall budget
Phase 3, Lean refinement	converged in one iteration	seconds
Phase 4, EBMC close	`exit_code: 0`, 7 of 7 properties proved, 0 refuted	0.278 s
Phase 4.5, Lean-close gate	5 theorems authored. 4 proved cleanly; 1 (`fair_alternation`) found to be false as stated and replaced with a proven corrected statement (see below). All proofs use only allowed tactics; none of the banned tactics (`sorry`, `admit`, `axiom`, `native_decide`, `decide`, `omega`, `linarith`, `simp`, `simp_all`, `tauto`, `by_contra`, `ring`) appear in the closing proofs. Final axioms: `{propext, Classical.choice, Quot.sound}`.	seconds inline; external proof search for the corrected theorem
Phase 6, bundle	tarball written; ships the winning SystemVerilog, the 7 EBMC verdicts (`p_reset`, `p_mutex`, `p_lock_hold`, `p_lock_end`, `p_burst_load`, `p_rr_turn`, `p_starve_pri`), the Lean proof file, and a reproducible `ebmc` reverification recipe.	seconds
Total wall time		about 5 minutes

The EBMC half of the task is discharged. The winning design has all seven declared properties proved against the task's check bound in 0.278 seconds of model-check time. The Lean half is also discharged, with one notable caveat described in the next subsection.

Lean caught a misstated specification that EBMC could not catch

The Phase 0 builder race authored five Lean theorems as the formal specification of the arbiter's long-horizon behaviour. Four of them are true statements about the design and Phase 4.5 closed them without any banned tactic. The fifth, fair_alternation, claimed the arbiter's turn bit alternates with step-count parity under symmetric contention with no locking. External proof search attempted the theorem, failed, and returned a counterexample at k = 2: the state after two steps is a fixed point under the step function with both requesters active, so the turn bit stops alternating and stays at false rather than flipping with k's parity. The original theorem body was commented out with an inline note explaining why. A corrected version was then stated and proved: after 2k steps under symmetric contention without locking, the turn is always false.

What this surfaced is a bug in the specification, not in the RTL. The SystemVerilog arbiter is behaving as its implementation describes; every EBMC assertion it was asked to check proved. The authored long-horizon fairness claim was wrong as written. None of the seven EBMC assertions (p_reset, p_mutex, p_lock_hold, p_lock_end, p_burst_load, p_rr_turn, p_starve_pri) encoded an equivalent multi-step fairness claim; p_rr_turn is a per-transition conditional, not a fixed-point statement over 2k steps. Without the Lean layer, the misstatement would have shipped as part of the specification documentation for the verified design.

This is the argument for pairing model-checking with theorem-proving rather than running either alone. EBMC discharges the transition-level safety assertions quickly and completely, and is the right tool for that class of property. Lean formalises the long-horizon behavioural claims that do not fit bounded model-checking, and at the level of quantification where a misstatement is a readable English sentence about the design. Neither layer is redundant on fix_arb_lock: they catch different classes of defect.

What the fleet adds over a single-model baseline

Against the single-model single-shot table on fix_arb_lock above (two of five pass, three at zero), the fleet delivers the EBMC safety block at 7 of 7 proved and a fully closed Lean proof file against the authored specification, with one theorem corrected mid-flight. The full ten-phase run completes in about 5 minutes of wall time. No model in the single-model table was asked to produce a Lean proof at all; the fleet adds that obligation, runs it to ground, and ships the evidence with the design.

One run on one task. The argument the fleet makes here is not that it solves a task single models cannot; two single models did pass the EBMC half. The argument is that the fleet ships the EBMC verdict and a machine-checked specification in the same bundle, and that the specification layer catches misstatements the bounded-model-checking layer is structurally unable to catch.

Significance

Bounded model checking alone is not a sufficient benchmark for RTL repair. fix_arb_lock admits modules whose EBMC safety block proves at depth 10 and whose liveness theorems still fail under the Lean 4 obligation. Pairing SV repair with a universally-quantified proof rejects the class of submissions that are correct-up-to-the-check-bound and incorrect in general. Five of the 26 tasks use this pairing. The Lean layer changes the outcome on all five.

Tool-call discipline correlates with passing; more tool calls does not help. On fix_booth_mul the only passing model used 10 tool calls. The model that used 50 tool calls and 796K tokens was at zero. On fix_arb_lock every passing submission was inside 50 seconds and under 50K tokens. Exploration burn is a negative signal, not a positive one. Agents that fail to read the refutation trace cannot recover by spending more compute.

The benchmark has a hard floor and a clean ceiling. Two tasks sit at 0.000 across every model (fix_spi_slave, write_assertions_txnmon). Twenty-three admit at least one perfect score. The distribution is neither uniformly easy nor uniformly hard, which is what is needed for score separation to track real capability. Across frontier models the spread between best and worst mean is 0.478.

Model-checked verification is robust to the anti-cheat layer. Under this scoring regime a submission cannot pass by disabling assertions, stubbing the Lean file, or tampering with the hidden module interface. The anti-cheat gate is a one-bit filter before the score is recorded; an episode either has no violation and is eligible for the EBMC and Lean checks, or it is rejected at zero. The underlying property-proving machinery does not need to defend itself against shortcut behaviour because shortcut behaviour does not reach it.

For verification teams: this benchmark provides verifiable feedback on whether a model understands SystemVerilog semantics well enough to discharge universally-quantified correctness properties, not just to autocomplete a repair that happens to pass an acceptance trace. The 26 tasks run deterministically in Docker with no network and no GPU, reproducible from the run JSONs on disk; scoring is open-source and the hidden test modules remain private so a benchmark cannot be gamed by overfitting to its own reference implementations.

References

Kroening, D., et al., "EBMC: The Enhanced Bounded Model Checker." Diffblue. github.com/diffblue/hw-cbmc
de Moura, L., Ullrich, S., "The Lean 4 Theorem Prover and Programming Language," CADE, 2021. doi:10.1007/978-3-030-79876-5_37