Split inference: models too big for one GPU

Some models are too large to fit on any single GPU a casual provider owns. OGONG serves them anyway, by splitting one model across a cohort of providers, each running a slice of its layers, with every slice independently verified and paid. This is verified split inference: decentralized inference of a frontier model across ordinary GPUs that no single machine could hold.

Why only a zero-bond network can do this

Sharding a model across machines is not new. Doing it across untrusted, unbonded machines is. In a design where each provider must post a slashable bond, sharding multiplies the capital barrier by the number of shards: ten segments, ten bonds. Casual nodes never clear that bar.

OGONG posts no correctness bond (see Why there is no correctness bond), so the barrier doesn’t multiply. A cohort of ordinary, unbonded GPUs can serve a frontier model, each segment paid only for the layers it ran.

How a cohort serves one model

A provider advertises a segment: the range of layers it can run.
The router assembles the cheapest cohort whose segments tile the whole model, end to end.
A lead drives the request through the cohort: the first shard runs from the prompt, each interior shard runs its layers from the previous shard’s output, and the result flows down the chain.
Each shard commits the hidden state at its layer boundary and signs it with its provider key. The commitments chain: one segment’s output is the next’s input.

So the model is computed in a relay, and the relay leaves a signed, checkable trail. It is architecture-agnostic because it rides the residual stream every decoder transformer exposes; the only per-model detail is the input embedding an interior shard skips.

Verified per segment

The same cheap audit that checks a whole model checks each segment. A validator re-runs a sampled segment and confirms its boundary reproduces:

An honest re-run reproduces the boundary essentially exactly (on a 2B model: ~0% on the same engine, ~0.6% drift across backends), while a substituted sub-computation lands ~30% off, a roughly 50x separation, on the same calibration the whole-model check uses.
Because each boundary is signed, a cheat is localized to the one provider that produced it, with no trusted lead. A caught segment withholds the whole request (the consumer is refunded) and ejects exactly that provider. An honest shard risks nothing.
The deterrence holds per segment, and it is a formally proven, machine-checked result. A shard’s compute saving and its fee share both scale with the layers it runs, but the stake it puts at risk does not shrink with its slice. So a smaller shard is, if anything, more deterred, and sharding never weakens the honesty guarantee that protects a whole model.

End to end, a two-shard cohort in which each shard loads only its own layers reproduces the single-machine model’s output to a relative difference of about 1e-5.

Paid per slice, on-chain

Settlement is a single cohort settle: each shard is paid for its slice under a conservation invariant, the per-shard amounts must sum to the provider’s share, so a release can never exceed the request’s fee. A non-conserving split is rejected on-chain with no funds moved.

A topology, not a tier

Split inference is a serving topology, orthogonal to the trust tiers. It composes with both: a cohort’s guarantee follows the tier of its shards, a cohort of Verified shards is Verified, a cohort of Confidential shards is Confidential. See How verification works for the per-segment audit it builds on.

It composes the whole stack

Split inference is not a bolt-on. It is the capstone that falls out of everything else OGONG already does:

the zero-bond result removes the per-shard capital barrier, so a cohort of casual nodes is even possible,
cheap per-segment verification catches a lying shard for a fraction of its compute,
signed boundary commitments localize a cheat to the one node that produced it,
the router assembles the cohort and on-chain cohort settlement pays each shard its slice under a conservation invariant.

Each of those was built for serving a whole model on one machine. Put together, they let a crowd of ordinary GPUs serve a model none of them could run alone, which is why it’s a headline capability rather than a feature.