Step 18 · Local AI Intermediate · 16 min

🧠Choosing and sizing your model

Which local model should you run on YOUR machine? The 2026 open-weight landscape (mostly Chinese), and the simple rule for fitting them to your RAM.

Here’s the truth nobody likes to hear: you don’t choose your model, your machine does. You can dream of running the biggest open-weight monster on the planet, but if you have 32 GB of RAM, it’ll never launch. The good news is that in 2026 the local landscape has become excellent, as long as you aim right. We’ll first lay down the rule that decides everything (sizing), then look at the catalog, and finally settle the real question: when your mini-PC is enough, and when Claude stays in front.

The sizing rule (the one that decides everything)

Before model names, physics. A model is billions of parameters (the “weights”), and each one takes up space in memory. The mental formula fits on one line:

RAM needed ≈ (number of parameters × bytes per weight) + the context cache (the KV cache)

The bytes per weight depend on precision (we’ll come back to that with quantization). In practice, in Q4_K_M, the standard format, count ~0.56 GB per billion parameters. Add 0.5 to 2 GB for the system, and +30 to 50% if you work with a long context or several parallel requests.

A few reference points in Q4_K_M, to train your eye: a Llama 8B fits in ~6 GB, a dense 32B model in ~19 GB, a Llama 70B demands 40-43 GB. And here’s the table to keep handy, by memory tier:

Available memory	What you run (Q4)	Recommended model
16 GB	A dense 7-8B, or a small compact MoE	`qwen2.5-coder:7b`
32 GB	A 30B Q4 comfortably ✅	`qwen3-coder:30b`
64 GB	A 70B Q4/Q5, or GLM-4.5-Air	now local gets serious
96 GB +	KV headroom for long context, big well-quantized MoE	your call
250 GB +	The giants (Qwen3-Coder 480B, GLM-4.6, Kimi)	workstation only

The shortcut: let a tool size it for you

The whole rule we just laid down (counting GBs, gauging the quant, spotting MoEs), a small open-source tool does it for you in two seconds: llmfit (MIT license, repo AlexsJones/llmfit). It scans your machine (RAM, CPU cores, VRAM, multi-GPU) and the installed runtimes (Ollama, llama.cpp, LM Studio, MLX, Docker Model Runner), cross-references that with a catalog of models and quants, and gives you a scored ranking: what fits, what will run fast, and the quant to aim for. Enough to avoid pulling 20 GB of model for nothing.

Install it

Through your usual package manager, depending on your system:

# macOS / Linux (Homebrew)
brew install AlexsJones/llmfit/llmfit
# macOS / Linux (quick script)
curl -fsSL https://llmfit.axjns.dev/install.sh | sh
# Windows (Scoop)
scoop install llmfit
# Without installing anything permanently: Docker
docker run ghcr.io/alexsjones/llmfit

Run the analysis

With no argument, you get an interactive terminal UI (and a web dashboard at http://<machine-ip>:8787). Prefer a raw table or JSON to script with? The options are there.

llmfit              # interactive UI + web dashboard
llmfit --cli        # classic table in the terminal
llmfit recommend --json --use-case coding --limit 3   # top 3 for coding, as JSON

The mode that helps BEFORE you buy

llmfit can simulate a config that isn’t (yet) yours. Torn between 32 and 64 GB before ordering a machine? Ask it what each tier would unlock:

llmfit --memory=24G --ram=64G --cpu-cores=8 fit

--ram = total RAM, --memory = the GPU’s VRAM. Enough to settle a purchase on real numbers rather than gut feel. It’s the “by memory tier” table from the start, but computed for YOUR hardware.

The 2026 model landscape

The headline of the year: Chinese labs dominate open-weight. Two useful silhouettes stand out. The giant MoE (≈1000 billion total parameters, ~30B active) for the cloud or the workstation. And the compact MoE (~30B total, ~3B active) which actually run on a mini-PC. It’s this second family that interests us.

Here are the realistic choices for a local machine:

Model	Type / size	What for	Can you run it?
Qwen3-Coder 30B	MoE, 30B total / 3.3B active	Code, refactor, debug, 256K context	✅ From 32 GB, the best mini-PC pick
Qwen2.5-Coder 7B/14B/32B	Dense	Autocomplete (FIM), completion	✅ Depending on RAM (7B from 16 GB)
GLM-4.5-Air	MoE, ~106B	The realistic GLM locally	⚠️ If you have 64-96 GB
Gemma 4	MoE, ~26B (Google)	An honest non-Chinese starting point	✅ On a decent machine
GLM-4.6	~357B, MIT license	Claude Sonnet 4.x level on real code	❌ Workstation / API
DeepSeek-V3.2	MoE + sparse attention	Top “intelligence,” excellent long context	❌ Too big for a mini-PC
Kimi K2.6 (Moonshot)	1000B MoE / 32B active	King of open-weight agentic benchmarks	❌ API / big workstation

The first choice on a mini-PC is qwen3-coder:30b, no hesitation. For background autocomplete, qwen2.5-coder:14b or 32b are perfect. And if you have a well-equipped machine (64-96 GB), GLM-4.5-Air is the GLM that fits.

Quantization, plainly

You’ve seen names like Q4_K_M go by. Let’s decode them. GGUF is the file format for local models, and the suffix indicates the precision: how many bits per weight you keep. Fewer bits = lighter file = it fits in less RAM. The trade-off is a small loss of quality.

Level	Quality	When to use it
`Q8_0`	Near perfect	Rarely justified for code (too heavy)
`Q6_K` / `Q5_K_M`	Excellent	Critical reasoning, if you have the memory headroom
`Q4_K_M`	Very good	The de facto standard : your default choice
`< Q4`	Visible degradation	Only if memory forces you

Q4_K_M weighs about 28% of the original FP16 model for only ~2-3% loss. For code specifically, it holds up very well, moving to Q5 or Q6 doesn’t perceptibly improve code generation. You only drop below Q4 when forced, and you feel it fast (code and reasoning are the first to suffer).

Wiring the right quant into Ollama

In practice, it’s simple: with Ollama, the tag already encodes the quant. You choose your precision by writing it into the name.

Pull the model at the right quant

The suffix after the tag is the quantization. Without a suffix, Ollama takes a default (often Q4_K_M).

# The default (usually Q4_K_M), perfect to start
ollama pull qwen3-coder:30b
# Or the explicit quant, if you want to be sure
ollama pull qwen3-coder:30b-a3b-q4_K_M

Fit the context to the real need

Context costs RAM (the famous KV cache). By default OLLAMA_CONTEXT_LENGTH is 2048, which is ridiculously low for code. But bumping it too high inflates RAM and can push you into swap : that’s the #1 cause of “Ollama is slow.” Set num_ctx to what the task really needs, not to the maximum by reflex.

Honestly: local or Claude?

We won’t sell you a dream. As of mid-2026, here’s where local stands, unfiltered.

For bounded and private tasks, short-window generation, refactoring, autocomplete, targeted debugging, local is genuinely competitive. It’s private, it’s free to use, and the benchmark gap has narrowed sharply (GLM-4.6 rivals Claude Sonnet 4.x, Kimi K2.6 is strong on agentics). And a point too often forgotten: an open-weight model performs much better in a good harness (a well-configured OpenCode) than in a raw chat.

But let’s be clear on two limits:

The absolute summit is still Claude. Claude Fable 5 runs around ~95% on SWE-bench Verified, clearly above the open-weight cluster (~80%).
The best open models don’t run on a mini-PC. Kimi (1000B), GLM-5.2, MiniMax M3, that’s API or workstation. What fits locally (Qwen3-Coder 30B, GLM-4.5-Air) is a notch below on long-haul agentic reliability and tool use.