Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Verifying images, audio & video

OGONG verifies more than text. The same protocol covers image, audio, and video models, but the check itself is different, because these models work differently from a chat model.

Why the text check doesn’t transfer

A language model produces a probability distribution over the next token at every step, and that distribution is a fingerprint of the exact computation that made it. The Verified-tier check (see How verification works) commits and cheaply re-derives those distributions.

A diffusion or flow model hands you nothing like that. It starts from noise and runs N denoising steps down to a final latent, then decodes that latent to pixels or audio. There is no per-token distribution and no position to teacher-force. The only natural thing to look at is the final output, and the output is exactly what a cheaper computation can forge. So the text defenses are not weak here; they simply do not apply.

Check the process, not the output

The insight is that an output is not evidence of the computation that produced it; a process is. A diffusion model’s computation is not a single result, it is a sequence of N steps, each one a forward pass of the same network the provider claims to run. That sequence is checkable in exactly the way a lone output is not.

So the provider commits a trajectory: a Merkle root over the latent at sampled denoising steps, plus the final latent. To verify, an auditor:

  1. draws a step at random,
  2. asks the provider to open the committed latents at that step (Merkle proofs checked first),
  3. runs one reference denoising step from the committed input, and
  4. accepts if the result matches the committed output within a tolerance.

One step re-run against the N the provider performed: cost ρ ≈ 1/N, the same cheap-check economics as text. Sampling k steps instead of one raises both the cost and the per-request catch rate.

The Merkle-inclusion check runs before the tolerance check, so a provider cannot serve one trajectory, commit another, and reveal whichever is convenient. The only way to pass is for every committed step to match the reference model’s step, which is to say, to actually run the model. A final check then decodes the committed last latent and confirms it matches the served bytes, so a provider cannot run the honest trajectory and hand back a different output.

Measured on three engines

The primitive is implemented and measured on three independent engines, with no shared code:

ModalityEngineHonest re-runA cheat scores
Audio3.5B diffusion-transformer (flow)exact (rel-L2 = 0)0.27 (5% conditioning change)
Image1.5B Euler latent diffusionexact (rel-L2 = 0)1.0 (changed prompt)
VideoWan latent video diffusionexact (rel-L2 = 0)rejected (fabricated step)

An honest re-run reproduces each step exactly; a substituted computation lands two to three orders of magnitude away.

An honest caveat (and a happy one)

The accept tolerance is a measured quantity, not a proven constant. A different GPU or kernel reproduces a latent with a small nonzero drift, so the threshold is set from the honest cross-hardware drift, and the guarantee is that this drift stays clear of a cheat’s divergence. For diffusion that separation is comfortable: the honest drift is tiny and a cheat diverges by 0.27 to 1.0. That actually makes diffusion a cleaner verification target than text, where the near-lossless quantization band is the hard case.

One subtlety: guidance schemes that carry momentum across steps are not reproducible from a single committed latent inside the guidance window, so the auditor draws its re-check step from outside that window, where a step depends only on its input.