Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Advanced serving

Most providers just run ogong-provider local --model ... and never touch a flag. When you’re serving a large model, packing more concurrency onto a GPU, or splitting a model across hardware, there are extra knobs.

ogong-provider tunes its embedded engine (a llama.cpp fork’s llama-server, which it spawns and proxies) mostly through environment variables. For full control of how a model is placed on hardware, you front your own engine with --upstream instead.

It auto-sizes by default

You usually don’t need to set any of this. When the provider spawns the engine, it inspects the model (weight size, KV bytes per token, whether it’s MoE) against your machine’s memory budget (discrete VRAM, or a share of system RAM) and picks a config on its own:

  • if the model fits, it spends the spare memory on concurrent request slots (capped where a single GPU stops scaling),
  • if a MoE model is too big, it turns on the expert cache to stream cold experts,
  • if a dense model is too big, it falls back to mmap / SSD paging so it still runs.

The knobs below are overrides of those automatic choices, for when you want to tune it yourself.

Two ways to tune

  1. Embedded engine + env knobs. Keep using --embedded-text / local, and set LLAMAMP_* variables to control KV cache, MoE, speculative decoding, and GPU offload.
  2. Front your own engine (--upstream). Launch llama-server (or vLLM) yourself with any flags you like, then point ogong-provider serve (or start / local) at it with --upstream http://127.0.0.1:8080/v1. This is how you do multi-GPU tensor-split and multi-node splits, which the embedded path doesn’t expose directly. ogong-provider still commits and settles exactly the same; the placement is the engine’s concern.

GPU offload

By default the embedded engine offloads all layers to the visible GPU(s). To cap it (for example, partial offload on a small card):

LLAMAMP_NGL=40 ogong-provider local --model my-model.gguf

KV cache: quantize it and size it

The KV cache dominates memory once you serve many concurrent requests. Quantize it and grow the context to fit more slots:

LLAMAMP_CACHE_TYPE_K=q8_0 LLAMAMP_CACHE_TYPE_V=q4_0 LLAMAMP_PARALLEL=16 \
  ogong-provider local --model my-model.gguf --n-ctx 16384
KnobEffect
--n-ctx <n>per-slot context (default 8192). The engine’s total KV is n-ctx × parallel.
LLAMAMP_PARALLEL=<n>concurrent slots (auto-sized by default)
LLAMAMP_CACHE_TYPE_K, LLAMAMP_CACHE_TYPE_VKV quantization: f16 (default), q8_0, q4_0
LLAMAMP_FLASH_ATTN=on|offflash attention (auto by default)
LLAMAMP_BATCH_SIZE=<n>engine batch size
LLAMAMP_CACHE_REUSE=<n>prompt-cache reuse window (default 256)

Quantized KV needs flash attention on, and is incompatible with tensor-split mode.

MoE: run a model bigger than your VRAM

A Mixture-of-Experts model can be served even when it doesn’t fit in VRAM by keeping some experts on CPU. Three controls, in increasing order of how much they offload:

  • Partial offload (LLAMAMP_NCMOE=N): keep the experts of the first N MoE layers on CPU. The right knob when a model is only slightly too big.
  • Full offload (the engine’s -cmoe, via a catalog model’s args): all experts on CPU.
  • Expert cache (LLAMAMP_MOE_CACHE_SLOTS): for models well over budget, the OGONG engine streams cold experts through a slot cache, so an oversized MoE keeps running instead of failing to load. It auto-enables when a MoE exceeds the budget.
# partial: keep the first 12 MoE layers' experts on CPU
LLAMAMP_NCMOE=12 ogong-provider local --model big-moe.gguf

# or the streaming expert cache
LLAMAMP_MOE_CACHE_SLOTS=24 ogong-provider local --model big-moe.gguf

Speculative decoding

The engine accelerates generation by drafting tokens ahead and verifying them in a batch. ogong-provider turns it on automatically when it finds a drafter:

  • set LLAMAMP_DRAFT_MODEL=/abs/path/to/drafter.gguf, or
  • let a catalog model pull its own drafter (entries carry a draft_url).
LLAMAMP_DRAFT_MODEL=/models/drafter.gguf ogong-provider local --model gemma-4.gguf

The method defaults to MTP (multi-token prediction). Override it with LLAMAMP_SPEC_TYPE:

LLAMAMP_SPEC_TYPEMethodDraft model?
draft-mtp (default)multi-token predictionyes (mtp-*.gguf)
draft-simple, draft-eagledraft-model speculationyes
ngram-simple, ngram-map-kn-gram lookupno

The n-gram methods need no draft model at all, so they work on any model.

Multi-GPU: split a model across GPUs

The embedded engine already spreads a model across all visible GPUs in layer-split mode. For tensor-parallel across GPUs, pass the engine’s split flags through a catalog model’s args array (every flag there is appended verbatim to the engine command):

ogong-provider local --served-models \
  '[{"id":"big","kind":"Text","path":"/models/big.gguf",
     "args":["--split-mode","tensor","--tensor-split","1,1","--flash-attn","on"]}]'

Any flag the engine supports can be set this way, per model. Alternatively, launch your own llama-server with the split flags and front it with ogong-provider serve --upstream http://127.0.0.1:8080/v1 ....

Multi-node: split a model across machines

A single model can be sharded across the GPUs of several machines. Each worker node runs the rpc-server binary (shipped with the provider) to expose its GPU; the provider node lists the workers, and the engine splits the model’s layers across the pool.

# on each worker box, expose its GPU over RPC:
rpc-server --host 0.0.0.0 --port 50052

# on the provider box, point at the workers; the model is sharded across them:
LLAMAMP_RPC_SERVERS="10.0.0.2:50052,10.0.0.3:50052" \
  ogong-provider local --model big.gguf

Trusted networks only The RPC transport is unauthenticated and unencrypted. Run it only over a private network you control, never the public internet.

You run and point at your own rpc-server nodes. Automatic discovery and a sharding policy (the provider spawning and balancing remote workers for you) are a separate layer still to come; for now this is the manual enable-and-point path.

Your machines vs. the network The RPC path above shards a model across your own machines, which you run and trust. To serve a model too big for your hardware by joining a cohort of independent providers that each verify and get paid for their slice, see Split inference, the network-level capability.