🧠Choosing and sizing your model
Which local model should you run on YOUR machine? The 2026 open-weight landscape (mostly Chinese), and the simple rule for fitting them to your RAM.
Here’s the truth nobody likes to hear: you don’t choose your model, your machine does. You can dream of running the biggest open-weight monster on the planet, but if you have 32 GB of RAM, it’ll never launch. The good news is that in 2026 the local landscape has become excellent, as long as you aim right. We’ll first lay down the rule that decides everything (sizing), then look at the catalog, and finally settle the real question: when your mini-PC is enough, and when Claude stays in front.
The sizing rule (the one that decides everything)
Before model names, physics. A model is billions of parameters (the “weights”), and each one takes up space in memory. The mental formula fits on one line:
RAM needed ≈ (number of parameters × bytes per weight) + the context cache (the KV cache)
The bytes per weight depend on precision (we’ll come back to that with quantization). In practice, in Q4_K_M, the standard format, count ~0.56 GB per billion parameters. Add 0.5 to 2 GB for the system, and +30 to 50% if you work with a long context or several parallel requests.
A few reference points in Q4_K_M, to train your eye: a Llama 8B fits in ~6 GB, a dense 32B model in ~19 GB, a Llama 70B demands 40-43 GB. And here’s the table to keep handy, by memory tier:
| Available memory | What you run (Q4) | Recommended model |
|---|---|---|
| 16 GB | A dense 7-8B, or a small compact MoE | qwen2.5-coder:7b |
| 32 GB | A 30B Q4 comfortably ✅ | qwen3-coder:30b |
| 64 GB | A 70B Q4/Q5, or GLM-4.5-Air | now local gets serious |
| 96 GB + | KV headroom for long context, big well-quantized MoE | your call |
| 250 GB + | The giants (Qwen3-Coder 480B, GLM-4.6, Kimi) | workstation only |
The shortcut: let a tool size it for you
The whole rule we just laid down (counting GBs, gauging the quant, spotting MoEs), a small open-source tool does it for you in two seconds: llmfit (MIT license, repo AlexsJones/llmfit). It scans your machine (RAM, CPU cores, VRAM, multi-GPU) and the installed runtimes (Ollama, llama.cpp, LM Studio, MLX, Docker Model Runner), cross-references that with a catalog of models and quants, and gives you a scored ranking: what fits, what will run fast, and the quant to aim for. Enough to avoid pulling 20 GB of model for nothing.
Install it
Through your usual package manager, depending on your system:
# macOS / Linux (Homebrew)
brew install AlexsJones/llmfit/llmfit
# macOS / Linux (quick script)
curl -fsSL https://llmfit.axjns.dev/install.sh | sh
# Windows (Scoop)
scoop install llmfit
# Without installing anything permanently: Docker
docker run ghcr.io/alexsjones/llmfit
Run the analysis
With no argument, you get an interactive terminal UI (and a web dashboard at http://<machine-ip>:8787). Prefer a raw table or JSON to script with? The options are there.
llmfit # interactive UI + web dashboard
llmfit --cli # classic table in the terminal
llmfit recommend --json --use-case coding --limit 3 # top 3 for coding, as JSON
The 2026 model landscape
The headline of the year: Chinese labs dominate open-weight. Two useful silhouettes stand out. The giant MoE (≈1000 billion total parameters, ~30B active) for the cloud or the workstation. And the compact MoE (~30B total, ~3B active) which actually run on a mini-PC. It’s this second family that interests us.
Here are the realistic choices for a local machine:
| Model | Type / size | What for | Can you run it? |
|---|---|---|---|
| Qwen3-Coder 30B | MoE, 30B total / 3.3B active | Code, refactor, debug, 256K context | ✅ From 32 GB, the best mini-PC pick |
| Qwen2.5-Coder 7B/14B/32B | Dense | Autocomplete (FIM), completion | ✅ Depending on RAM (7B from 16 GB) |
| GLM-4.5-Air | MoE, ~106B | The realistic GLM locally | ⚠️ If you have 64-96 GB |
| Gemma 4 | MoE, ~26B (Google) | An honest non-Chinese starting point | ✅ On a decent machine |
| GLM-4.6 | ~357B, MIT license | Claude Sonnet 4.x level on real code | ❌ Workstation / API |
| DeepSeek-V3.2 | MoE + sparse attention | Top “intelligence,” excellent long context | ❌ Too big for a mini-PC |
| Kimi K2.6 (Moonshot) | 1000B MoE / 32B active | King of open-weight agentic benchmarks | ❌ API / big workstation |
The first choice on a mini-PC is qwen3-coder:30b, no hesitation. For background autocomplete, qwen2.5-coder:14b or 32b are perfect. And if you have a well-equipped machine (64-96 GB), GLM-4.5-Air is the GLM that fits.
Quantization, plainly
You’ve seen names like Q4_K_M go by. Let’s decode them. GGUF is the file format for local models, and the suffix indicates the precision: how many bits per weight you keep. Fewer bits = lighter file = it fits in less RAM. The trade-off is a small loss of quality.
| Level | Quality | When to use it |
|---|---|---|
Q8_0 | Near perfect | Rarely justified for code (too heavy) |
Q6_K / Q5_K_M | Excellent | Critical reasoning, if you have the memory headroom |
Q4_K_M | Very good | The de facto standard : your default choice |
< Q4 | Visible degradation | Only if memory forces you |
Q4_K_M weighs about 28% of the original FP16 model for only ~2-3% loss. For code specifically, it holds up very well, moving to Q5 or Q6 doesn’t perceptibly improve code generation. You only drop below Q4 when forced, and you feel it fast (code and reasoning are the first to suffer).
Wiring the right quant into Ollama
In practice, it’s simple: with Ollama, the tag already encodes the quant. You choose your precision by writing it into the name.
Pull the model at the right quant
The suffix after the tag is the quantization. Without a suffix, Ollama takes a default (often Q4_K_M).
# The default (usually Q4_K_M), perfect to start
ollama pull qwen3-coder:30b
# Or the explicit quant, if you want to be sure
ollama pull qwen3-coder:30b-a3b-q4_K_M
Fit the context to the real need
Context costs RAM (the famous KV cache). By default OLLAMA_CONTEXT_LENGTH is 2048, which is ridiculously low for code. But bumping it too high inflates RAM and can push you into swap : that’s the #1 cause of “Ollama is slow.” Set num_ctx to what the task really needs, not to the maximum by reflex.
Honestly: local or Claude?
We won’t sell you a dream. As of mid-2026, here’s where local stands, unfiltered.
For bounded and private tasks, short-window generation, refactoring, autocomplete, targeted debugging, local is genuinely competitive. It’s private, it’s free to use, and the benchmark gap has narrowed sharply (GLM-4.6 rivals Claude Sonnet 4.x, Kimi K2.6 is strong on agentics). And a point too often forgotten: an open-weight model performs much better in a good harness (a well-configured OpenCode) than in a raw chat.
But let’s be clear on two limits:
- The absolute summit is still Claude. Claude Fable 5 runs around ~95% on SWE-bench Verified, clearly above the open-weight cluster (~80%).
- The best open models don’t run on a mini-PC. Kimi (1000B), GLM-5.2, MiniMax M3, that’s API or workstation. What fits locally (Qwen3-Coder 30B, GLM-4.5-Air) is a notch below on long-haul agentic reliability and tool use.