Does a formal-verification layer improve kernel quality?

A bisected case study on an AI-generated accelerator kernel.

Abstract

A bisected case study on an AI-generated accelerator kernel. Property tests routinely approve kernels that are wrong outside the narrow range of shapes the harness exercises. The case study below shows a concrete LLM-generated kernel that passes its property-test harness with zero error and silently produces zero output on up to 36% of its rows for nearby input shapes. A Lean 4 theorem quantified over every index cannot be closed against this kernel. That is the property that makes formal verification a useful additional quality gate.

1. The task

Element-wise vector addition followed by a ReLU activation, executed as a tiled kernel on an AI-accelerator target:

$$\text{kernel}(a, b){i,j} \;=\; \max!\big(a{i,j} + b_{i,j},\; 0\big)$$

Inputs a, b are 2-D float32 tensors. The fixed test shape is [128, 512]. The task comes with a formal specification file (Lean 4) into which the AI-generated implementation is asked to discharge correctness properties.

2. The generated implementation

 1: def kernel(a_tensor, b_tensor, out_tensor):
 2:     M, N = a_tensor.shape
 3:     TILE = accel.partition_size()             # target tile size
 4:
 5:     for blk in accel.tile_range(M // TILE):   # see §4
 6:         i = accel.arange(TILE)[:, None]
 7:         j = accel.arange(N)[None, :]
 8:         a_tile = accel.load(a_tensor[blk * TILE + i, j])
 9:         b_tile = accel.load(b_tensor[blk * TILE + i, j])
10:         sum_tile = accel.add(a_tile, b_tile)
11:         res_tile = accel.relu(sum_tile)
12:         accel.store(out_tensor[blk * TILE + i, j], value=res_tile)

(API symbols generalised. The underlying target is a tile-partitioned AI accelerator intrinsic.)

The model produced this implementation in a single revision, self-tested it on the reference shape, and reported Max error: 0.000000; PASS. A second leading model from a different lab produced the same bug pattern on its own run, so this is not an artifact of one model family.

3. Property-test result: PASS

Running the standard harness (reference NumPy implementation, tolerance 10⁻³, fixed seed 2083866274, shape [128, 512]):

Max error: 0.000000
PASS

The harness declares the kernel correct. Under a pass/fail harness of this kind, the kernel ships.

4. Bisection: the kernel is already broken

The loop bound is accel.tile_range(M // TILE). Integer division. When M is a multiple of TILE (as in the eval shape [128, 512], where TILE = 128) every row gets written. When M is not a multiple of TILE, the trailing M - (M // TILE) × TILE rows are never touched by any accel.store and remain at the zero-initialized buffer value.

Running the kernel's algebraic form (identical to what the accelerator simulator produces) against the reference on a sweep of shapes:

Input shape	Property-test verdict	Max error	Rows silently unwritten
`[128, 512]` (the eval shape)	PASS	0.000	0
`[129, 512]`	FAIL	4.30	1
`[130, 512]`	FAIL	4.43	2
`[200, 512]`	FAIL	6.14	72 (36% of output)
`[255, 512]`	FAIL	5.70	127
`[256, 512]`	PASS	0.000	0
`[513, 512]`	FAIL	4.41	1
`[1000, 512]`	FAIL	6.01	104

One additional row of input, and the maximum error is 4300× the accepted tolerance. At M = 200, a reasonable batch size in a production serving stack, more than a third of the kernel's output is zero, regardless of input.

No seeded fuzz harness with a fixed shape can detect this.

5. Why formal verification catches what property tests miss

The kernel's correctness specification, the claim the implementation is supposed to satisfy, is:

$$\forall\, M,\, \forall\, i \in [0, M),\quad \text{out}[i] \;=\; \max(a[i] + b[i],\, 0)$$

This is a theorem over every valid index at every valid shape. As a Lean 4 proof obligation:

theorem kernel_correct
    (M N : ℕ) (a b out : Fin M → Fin N → ℝ)
    (h : computes_kernel a b out) :
    ∀ i : Fin M, ∀ j : Fin N, out i j = max (a i j + b i j) 0

This theorem cannot be proved against the kernel in §2. Take M = 200, i = 128, and any nonzero a[128], b[128]. The kernel writes nothing at row 128, so out[128] is the buffer's initial value (zero), while the right-hand side is in general nonzero. The specification is false on the kernel's observable behavior. No proof exists.

A pipeline that requires the proof obligation to close will reject this kernel. The pipeline treats sorry and admit (Lean's two ways for an author to leave a proof unfinished) as proof failures, so an implementation that cannot articulate a closed proof of the spec is rejected the same way an implementation that fails the property test is rejected. The rejection is not heuristic, not probabilistic, and not sensitive to which seeds or shapes were in the test set. It follows from the logical structure of the specification and the kernel's definition.

This class of universal-index correctness proof has been closed on adjacent tasks in the same environment (a softmax theorem set quantified over every input row, discharged by a multi-step verification pipeline). The proof obligation in §5 is the same shape as the ones already closed; the kernel above happens to be the one that fails it.

6. What property tests check versus what formal verification checks

	Property tests	Formal verification
Scope	Output matches on the shapes and seeds the harness was given	Output matches the specification, at every index, for every input, at every shape
Catches this kernel?	No. Passes.	Yes. The specification is false on the kernel's behavior.
Confidence for other shapes	None implied	Implied by the universal quantifier (once proof is closed)
Confidence for other seeds	None implied	Implied (once proof is closed)
Cost per kernel	Milliseconds	Tens of minutes of prover time on the kernels we have measured

Property tests answer: did this implementation happen to match on the inputs we tested? Formal verification answers: does this implementation provably match the specification? In production ML-infrastructure code, where a single mismatched row in an attention weight matrix can poison an entire training run, the second question is the one that matters.

7. The fair objection: couldn't a better engineer have written a test that caught this?

Yes, for this specific bug. A ten-line shape-sweep over M ∈ {1, 64, 127, 128, 129, 200, 256, 257, 513, 1024} would catch the remainder-handling defect in under a second. The purpose of this case study is not to argue that property testing is useless. It is to locate the ceiling where testing principled-ly stops.

(a) Finite sampling cannot establish universal properties. A shape-sweep catches this shape-dependent bug if the test author happens to think of the M % 128 ≠ 0 case. It does not catch the next kernel's bug: a NaN-propagation edge where one input element is -∞, an associativity drift where a tile-parallel reduction produces a different summation order than the reference, a numerical-stability bug where softmax underflows on a row of large negatives. Each of those requires its own test, written independently, by an author who thought of it. Formal verification does not require the author to enumerate failure modes. A theorem of the form "for every input, the output matches the specification" covers them all at once. Property testing is sampling. Formal verification is a proof.

(b) In AI-assisted development, the code and the tests often come from the same model. The engineer reviewing a model-generated kernel frequently asks the same model for the test suite. A model that did not reason about M % 128 when writing the kernel will not reason about it when writing the tests. Its blind spots are correlated between the two artifacts. Formal specifications break this correlation because they are external to the model: the theorem is stated up front, by the human or by a separate verifier, and the model cannot write its way around an obligation it did not author. Specification and implementation are produced by different processes, so a bug in one is detected rather than camouflaged by the other.

(c) Some correctness claims are, in principle, beyond the reach of testing. Consider the statement "softmax is shift-invariant: for every scalar c and every input x, softmax(x + c) = softmax(x)." A test can falsify this (find a c and an x where the kernel violates it) but cannot confirm it, because the space of c values is uncountable. A formal proof discharges the claim once. Shape invariance, commutativity of reductions, monotonicity of masked attention weights under mask intensity, numerical stability under the full float32 range: all universal claims that formal verification can close and testing can only probe.

The objection is real on the narrow reading. On the broader reading, "can I build a kernel-engineering pipeline that relies on property tests alone and still get correctness guarantees?", the answer is no, for the three reasons just given.

8. Conclusion

On a real kernel generated by a capable LLM, the standard property-test harness certified an implementation that is silently wrong on 36% of its rows under a slightly different batch size. The same kernel fails its correctness theorem: no proof exists because the theorem is not true. A development pipeline that requires a closed formal-verification proof alongside the property tests rejects this kernel automatically. A pipeline that relies on property tests alone ships it.

The answer to the question at the top: a Lean-enabled verification layer is useful. It catches a class of correctness bug that narrow property-test harnesses miss, and it does so without requiring the test author to enumerate every possible failure mode in advance. The case study above is one such bug, reproducible in thirty lines of NumPy.

Appendix: reproducer

The bisection table in §4 is reproducible in ~30 lines of NumPy. The reproducer is available on request to verified partners; contact us via the homepage form. The published Lean theorem statement and the bug class signature below are sufficient to audit the shape-dependent failure pattern without disclosing the underlying task.