autotune sits between your code and Ollama and applies automatic optimizations: right-sized KV buffers, KV precision tuning, system prompt caching, intelligent context management, and model keep-alive. The result: 300+ MB freed per request, first word up to 53% faster, and your computer stays responsive. No config changes. Your code stays exactly the same.
pip install llm-autotune
autotune chat --model qwen3:8bHow it works
autotune sits between your code and Ollama as a transparent proxy. Before each request reaches Ollama, autotune calculates the exact memory it needs, watches live RAM usage from your other apps, and adjusts automatically. No config. No changes to your code or output quality.
Every time Ollama runs your prompt, it must first allocate a block of RAM called the KV cache — it's where it stores the attention state for every token in the context window. By default, Ollama always allocates for 4,096 tokens. For a typical 50-word message, that's allocating 12× more RAM than the message actually needs. autotune measures the real token count, adds a safe headroom buffer, and tells Ollama the exact minimum. That freed RAM goes back to your browser, your apps, your system.
Buckets (512, 768, 1024, 1536, 2048…) prevent Ollama from reallocating the Metal buffer on every call — requests with similar lengths reuse the same pre-allocated buffer, eliminating 100–300 ms of KV thrashing overhead per request.
Right-sizing the KV cache at request time is the foundation. But RAM usage on your machine is dynamic: Chrome opens a tab, Xcode compiles, a background process wakes up. autotune reads your system's actual RAM usage before every single request and applies two independent levers — context window size and KV precision — to ensure the model never pushes your system into disk swap, even as conditions change.
KV precision switching (F16 → Q8) cuts the KV cache's RAM footprint in half instantly — with no meaningful quality impact. Q8 stores each attention value in 1 byte instead of 2; the difference in model output is undetectable in practice. These adjustments happen automatically — you see a brief note in the chat UI when one fires.
In any multi-turn chat, Ollama re-processes your entire system prompt from scratch on every message. autotune pins those tokens in the KV cache so they're only ever evaluated once — at the start. Every follow-up turn gets faster because fewer tokens need processing. The savings compound with every turn.
Ollama unloads the model after 5 minutes idle — a 1–4 second reload every time you come back to it. autotune keeps the model resident in memory between sessions. The weights were already using that RAM; keeping them there costs nothing extra and eliminates the cold-start delay entirely.
Benchmark numbers use qwen3:8b / llama3.2:3b on Apple M2 16 GB. KV savings scale with model size — larger models free more RAM in absolute terms. Generation speed and output quality are unchanged: autotune touches only buffer sizes, precision, and scheduling — never model weights or sampling.
All 14 optimizations explained →Measured results
Benchmarked on Apple M2 16 GB using Ollama's internal Go nanosecond timers — not wall-clock estimates. 3 runs × 5 prompt types, Wilcoxon signed-rank test. Every number here is reproducible with autotune proof.
| Model | KV: Before | KV: After | RAM freed | First word |
|---|---|---|---|---|
| qwen3:8b | 576 MB | 195 MB | 381 MB | −53% |
| llama3.2:3b | 448 MB | 155 MB | 293 MB | −35% |
| gemma4:e2b | 96 MB | 30 MB | 66 MB | −29% |
TTFT improvement is largest when the model is cold or when RAM is under pressure. Generation speed (tok/s) is Metal GPU-bound and is not affected by autotune. KV savings apply every single request regardless of hardware.
Every request you send, Ollama allocates a KV buffer for 4,096 tokens. autotune sizes it to the actual prompt — returning hundreds of MB to your system on every single call, automatically.
The KV buffer must be initialized before token 1. A smaller buffer initializes faster. On qwen3:8b, autotune cuts first-word time from the raw baseline by 53% — every new session, every cold request.
autotune changes only the KV buffer size. Model weights, sampling, and generation speed are identical. prompt_eval_count is unchanged — no tokens are dropped or skipped.
Multi-turn & agentic workloads
Single-prompt benchmarks miss the real problem: context accumulates. Each tool call, each reasoning step, each file read appends more tokens. By turn 8, the model is processing 5–8× more tokens than turn 1 — and raw Ollama's fixed 4,096-token window runs out, forcing a full model reload mid-session.
autotune computes a session-ceiling KV window once before the loop starts and locks it for the entire session. No reloads. And because the system prompt is pinned via prefix caching, TTFT actually falls as the session grows — not climbs.
| Metric | Raw Ollama | autotune |
|---|---|---|
| Session wall time | 74 s | 40 s |
| Model reloads | 0.5 | 0.5 |
| TTFT trend per turn | −101 ms/turn | −435 ms/turn |
| Swap events | 0 | 0 |
| Context at session end | 3,043 tokens | 1,946 tokens |
The system prompt is pinned in KV after turn 1 and never re-evaluated. Each new turn only prefills the new tokens — not the full conversation from scratch. By turn 5, autotune is noticeably faster than turn 1. By turn 10, the difference compounds significantly.
autotune computes a KV window for the full session ceilingbefore the first turn, then holds it constant. raw Ollama's fixed 4,096-token window fills up mid-task and forces a model reload (~1–3 s each). autotune trades a slightly higher turn-1 cost to eliminate all reloads.
Benchmark: code_debugger task, N=2 trials, Apple M2 16 GB, llama3.2:3b balanced profile. Timings from Ollama's internal Go nanosecond timers. Full methodology in AGENT_BENCHMARK.md.
Verify it yourself
autotune ships with a built-in benchmark that runs two head-to-head tests on your hardware in about 30 seconds. It uses Ollama's own internal Go nanosecond timers — nothing estimated, nothing made up.
Works with any model you have installed in Ollama. Picks the smallest installed model automatically if you don't specify one.
autotune proof -m qwen3:8b
# Runs in ~30 seconds. Uses Ollama's own timers.
# Saves a proof_qwen3_8b.json you can share.autotune proof --list-models to see which Ollama models are available on your machine.Quickstart
No Ollama commands needed — autotune handles everything for you.
pip install llm-autotuneautotune recommendautotune pull qwen3:8bautotune chat --model qwen3:8bautotune proof -m qwen3:8bpip install "llm-autotune[mlx]"import autotune
from openai import OpenAI
autotune.start() # start the optimizing proxy
client = OpenAI(**autotune.client_kwargs())
response = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role": "user", "content": "Hello!"}],
)
# Every optimization is automatic.autotune serve
# → http://localhost:8765/v1
# Any OpenAI client works automatically.Docker
The Docker image bundles Ollama and autotune in a single container. No local install needed — just pull the image, mount a volume for model storage, and your OpenAI-compatible endpoint is ready on port 8765.
# Build once
docker build -t autotune .
# Run — autotune on :8765, models cached in a volume
docker run -p 8765:8765 \
-v ollama_models:/root/.ollama \
-e OLLAMA_MODEL=qwen3:8b \
autotuneOLLAMA_MODEL auto-pulls the model on first start. Models are cached in the named volume and persist across restarts.
--profile singleOllama + autotune in one container. Simplest setup.--profile multiSeparate services. Lighter autotune image (~200 MB). Set AUTOTUNE_OLLAMA_URL=http://ollama:11434.ollama/ollama:latest — includes CUDA and ROCm layers. Add --gpus all for NVIDIA, or mount /dev/kfd for AMD.What autotune does
What to run
autotune works with any Ollama model. These are the best options as of April 2026. Run autotune recommend to get a hardware-specific recommendation.
| RAM | Model | Size |
|---|---|---|
| 8 GB | qwen3:4b | ~2.6 GB |
| 16 GB | qwen3:8b | ~5.2 GB |
| 16 GB | gemma4:e2b | ~5.8 GB |
| 24 GB | qwen3:14b | ~9.0 GB |
| 32 GB | qwen3:30b-a3b | ~17 GB |
| Coding | qwen2.5-coder:14b | ~9.0 GB |
| Reasoning | deepseek-r1:14b | ~9.0 GB |
Open source, MIT licensed. Works with whatever Ollama models you already have. The autotune proof command will show you the exact improvement on your own hardware.
pip install llm-autotune