Step 17 · Local AI Easy · 12 min

🦙Ollama & local models

Run real AI models at home, free and private. Installation, your first model, and how to wire it up to your coding agent.

Up to now, your coding agent was talking to models in the cloud. Now we run real AI models directly on your mini-PC. And the thing that makes that painless is Ollama.

Ollama is the simplest way to run open-weight LLMs locally. One command to install it, one to download a model, and you end up with a local API on port 11434 that speaks the OpenAI format, so it’s compatible with just about every tool out there. Three advantages that need no comment: it’s private (nothing leaves the machine), it’s free, and it works offline.

Installing Ollama

One line. The script installs the binary and launches it as a system service, it runs in the background, ready to answer.

curl -fsSL https://ollama.com/install.sh | sh

That’s it. Ollama is now listening on http://localhost:11434.

Your first model

Let’s start small and useful: qwen2.5-coder:7b, a 7-billion-parameter coding model, a solid starting point that fits on a modest machine.

ollama run qwen2.5-coder:7b

The first launch downloads the model (a few GB, be patient). Then you land in a chat right in the terminal: ask it a question, ask it for a snippet of code, check that it answers. Type /bye to exit.

See what you have

ollama list   # every downloaded model, with its size

See what's running

ollama ps     # the models loaded in memory right now

Clean house

ollama rm qwen2.5-coder:7b   # delete a model to free up space

Wiring it up to your coding agent

This is where the two agents really diverge. Pick your tab, because reality isn’t the same on both sides.

Claude Code is built to run on Anthropic’s Claude models. It doesn’t natively plug into a local Ollama model, there’s no magic flag for that, and I’m not going to invent one for you.

So for a fully local coding agent, OpenCode is what we use (the tab next door). Claude Code, for its part, stays your cloud reactor: the big jobs, the long reasoning, top-tier reliability.

That doesn’t mean Ollama is useless alongside Claude Code. You can absolutely keep Ollama’s local API for side tasks: embeddings, quick classification, summaries, small scripts that hit http://localhost:11434 without costing a cent or sending your data elsewhere. The cloud for the brain, local for the plumbing.

This is the royal road to a 100% local coding agent. OpenCode talks natively to Ollama. You declare Ollama as a provider by pointing it at the local API:

http://localhost:11434/v1

Concretely, you configure the Ollama provider in OpenCode (via /login or your config file), then you pick the local model with /models, for example qwen2.5-coder. From there, your agent thinks, reads your code, and writes files without a single byte leaving the machine. Free, private, offline. This is exactly the scenario OpenCode exists for.

The three settings that matter

Ollama works right out of the box, but three environment variables make all the difference when you push it. You set them in the service’s environment (systemctl edit ollama then Environment="...").

The trap to burn into your memory

Nine times out of ten, when someone says “Ollama is slow on my machine,” the culprit is the context set too high.

Here’s why. The bigger the context window, the more the KV cache eats RAM, and it grows fast. If you ask for a huge context on a tight machine, you spill over into swap (the disk used as backup RAM), and then everything collapses: the model crawls, every token takes forever, you think your hardware is junk when it’s actually suffocating.

The rule: set the context to what the task needs, not to the maximum. A small script? 4096 is plenty. A big multi-file refactor? Bump it up, but watch your RAM (ollama ps shows you the size actually loaded). The right setting is the smallest one that does the job.