Split inference: models too big for one GPU
Some models are too large to fit on any single GPU a casual provider owns. OGONG serves them anyway, by splitting one model across a cohort of providers, each running a slice of its layers, with every slice independently verified and paid. This is verified split inference: decentralized inference of a frontier model across ordinary GPUs that no single machine could hold.
Why only a zero-bond network can do this
Sharding a model across machines is not new. Doing it across untrusted, unbonded machines is. In a design where each provider must post a slashable bond, sharding multiplies the capital barrier by the number of shards: ten segments, ten bonds. Casual nodes never clear that bar.
OGONG posts no correctness bond (see Why there is no correctness bond), so the barrier doesn’t multiply. A cohort of ordinary, unbonded GPUs can serve a frontier model, each segment paid only for the layers it ran.
How a cohort serves one model
- A provider advertises a segment: the range of layers it can run.
- The router assembles the cheapest cohort whose segments tile the whole model, end to end.
- A lead drives the request through the cohort: the first shard runs from the prompt, each interior shard runs its layers from the previous shard’s output, and the result flows down the chain.
- Each shard commits the hidden state at its layer boundary and signs it with its provider key. The commitments chain: one segment’s output is the next’s input.
So the model is computed in a relay, and the relay leaves a signed, checkable trail. It is architecture-agnostic because it rides the residual stream every decoder transformer exposes; the only per-model detail is the input embedding an interior shard skips.
Verified per segment
The same cheap audit that checks a whole model checks each segment. A validator re-runs a sampled segment and confirms its boundary reproduces:
- An honest re-run reproduces the boundary essentially exactly (on a 2B model: ~0% on the same engine, ~0.6% drift across backends), while a substituted sub-computation lands ~30% off, a roughly 50x separation, on the same calibration the whole-model check uses.
- Because each boundary is signed, a cheat is localized to the one provider that produced it, with no trusted lead. A caught segment withholds the whole request (the consumer is refunded) and ejects exactly that provider. An honest shard risks nothing.
- The deterrence holds per segment, and it is a formally proven, machine-checked result. A shard’s compute saving and its fee share both scale with the layers it runs, but the stake it puts at risk does not shrink with its slice. So a smaller shard is, if anything, more deterred, and sharding never weakens the honesty guarantee that protects a whole model.
End to end, a two-shard cohort in which each shard loads only its own layers reproduces the single-machine model’s output to a relative difference of about 1e-5.
Paid per slice, on-chain
Settlement is a single cohort settle: each shard is paid for its slice under a conservation invariant, the per-shard amounts must sum to the provider’s share, so a release can never exceed the request’s fee. A non-conserving split is rejected on-chain with no funds moved.
A topology, not a tier
Split inference is a serving topology, orthogonal to the trust tiers. It composes with both: a cohort’s guarantee follows the tier of its shards, a cohort of Verified shards is Verified, a cohort of Confidential shards is Confidential. See How verification works for the per-segment audit it builds on.
It composes the whole stack
Split inference is not a bolt-on. It is the capstone that falls out of everything else OGONG already does:
- the zero-bond result removes the per-shard capital barrier, so a cohort of casual nodes is even possible,
- cheap per-segment verification catches a lying shard for a fraction of its compute,
- signed boundary commitments localize a cheat to the one node that produced it,
- the router assembles the cohort and on-chain cohort settlement pays each shard its slice under a conservation invariant.
Each of those was built for serving a whole model on one machine. Put together, they let a crowd of ordinary GPUs serve a model none of them could run alone, which is why it’s a headline capability rather than a feature.