Running LLMs Locally

About this post

Recently I have seen lot of coder, researcher, and other that were trying to find other solutions to benefit from AI than suscribing to a cloud AI/LLM. I have made some research, and I will talk about it here. This knowledge is quite volatile and might be outdated next month! However, if it mights help !

Who this is for

You want to use an AI assistant but you do not want to pay a monthly subscription, and you do not want your conversations, documents, or code sent to a company’s servers. This guide is for you.

It is not for people who need maximum performance at any cost. Local models are genuinely good now — but they are not GPT-4o or Claude Opus. I will be straight about that.


The summary upfront

Running a large language model locally is now accessible to most people with a modern computer. The tooling has matured enormously in 2025–2026. If you have a recent Mac with an M-series chip, or a PC with a decent GPU, you can have a capable AI assistant running in under 10 minutes, completely offline, for free.

The catch: the experience is meaningfully worse than frontier cloud models, and the hardware requirements for the best local models are not trivial.


How it actually works

A large language model is a file — a .gguf or similar format, typically 2–25 GB depending on the model. Your computer loads it into RAM or GPU memory and runs inference locally. No internet required after the initial download.

The tool that made this accessible to non-engineers is Ollama. It handles model downloading, memory management, and exposes a local API. You do not need to compile anything.


Getting started: step by step

1. Install Ollama

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: download the installer from ollama.com.

That is it. Ollama runs as a background service.

2. Pull a model

ollama pull gemma3:4b        # lightweight, fast, good for most tasks
ollama pull gemma3:27b       # much better quality, needs ~20 GB RAM
ollama pull llama3.2:3b      # Meta's small model, very fast
ollama pull mistral:7b       # strong reasoning, good default
ollama pull phi4-mini        # excellent for coding, tiny footprint

If not used to it, you have to run it in your terminal here is a nice Tutorial on how to use it

Start with gemma3:4b or llama3.2:3b if you are unsure about your hardware.

3. Chat immediately in the terminal

ollama run gemma3:4b

You are now talking to a local AI. No account. No API key. No data leaving your machine.

The terminal works, but a chat UI is more comfortable for daily use.

Open WebUI — the most polished option. Runs in your browser, connects to Ollama automatically.

docker run -d -p 3000:80 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000.

Alternatives: Chatbox, LM Studio (includes its own model downloader, no Docker required).


Hardware: the honest picture

This is where most guides get vague. Here is the reality.

RAM is the bottleneck

Models run in RAM (or VRAM if you have a GPU). If the model does not fit in RAM, it spills to disk and becomes unusably slow.

Model sizeRAM neededWhat you get
3B parameters~3 GBFast, limited reasoning, good for simple tasks
7–8B parameters~6–8 GBSolid daily driver, handles most tasks
14–27B parameters~12–20 GBClose to early GPT-4 quality
70B+ parameters~48 GB+Frontier quality — requires a workstation

Practical baseline: 16 GB RAM gets you a good 7B model with room to breathe. 32 GB RAM gives you access to 27B models, which are genuinely impressive.

Apple Silicon is exceptional for this

M1/M2/M3/M4 Macs have unified memory, meaning the CPU and GPU share RAM. A MacBook Pro M3 with 36 GB RAM runs a 27B model smoothly. This is not possible on most PC laptops. If privacy and local AI matter to you and you are buying hardware, Apple Silicon is currently the best value for this use case.

NVIDIA GPUs on Windows/Linux

If you have an NVIDIA GPU with enough VRAM (8 GB+), Ollama uses it automatically. A 4090 with 24 GB VRAM can run a 27B model at high speed. Older cards with less VRAM will offload layers to RAM, which works but is slower.

No GPU? It still works

CPU-only inference is slow but usable for 3B–7B models. Expect 5–15 tokens per second on a modern CPU. Enough for writing and Q&A. Not great for real-time coding assistance.


Which models to actually use

As of April 2026, these are the best options for local use:

For general use:

  • gemma3:27b (Google, Apache 2.0) — excellent instruction following, long context (128K tokens), multilingual
  • llama3.3:70b (Meta, Llama license) — best open-weight general model if you have the RAM
  • mistral-small:22b — strong reasoning, good coding

For coding:

  • qwen2.5-coder:14b (Alibaba) — best dedicated coding model in this size range
  • phi4:14b (Microsoft) — punches above its weight for reasoning and code

For low-end hardware (≤8 GB RAM):

  • gemma3:4b — best small model overall
  • llama3.2:3b — very fast, decent quality
  • phi4-mini — excellent at coding for its size

To check what is currently available:

ollama list        # models you have downloaded
# or browse: https://ollama.com/library

What local LLMs are actually good at

  • Summarizing documents and PDFs you do not want to upload anywhere
  • Drafting emails, reports, and text
  • Explaining code you paste in
  • Writing and refactoring code (especially with a good coding model)
  • Translation
  • Brainstorming and ideation
  • Answering questions about local files via tools like AnythingLLM or Open WebUI’s document upload

Where they fall short — honestly

Reasoning on hard problems. The best local 27B model is roughly equivalent to mid-2023 frontier models. For complex multi-step reasoning, math proofs, or nuanced analysis, cloud models are still ahead.

Speed. A 27B model on a MacBook Pro generates 20–40 tokens per second. Frontier cloud APIs return 100+ tokens per second. For long outputs, local can feel slow.

Context window in practice. Models may advertise 128K context but performance degrades in the latter half of a long context. Keep important information near the end of your prompt.

Hallucination. Local models hallucinate as much or more than comparable cloud models. Verify factual claims. Do not use them to answer questions about recent events without enabling web search tools.

Vision / multimodal. Gemma 3 supports image inputs. Most other local models do not, or the image understanding is weak. If you need vision, check model support explicitly before relying on it.

No automatic updates. Cloud models improve silently. With local models, you must manually pull new versions to get improvements.


Privacy: what “local” actually means

When you run a model with Ollama and no external tools:

  • Nothing leaves your machine. The model runs entirely on your CPU/GPU. No telemetry, no logging to external servers.
  • Your prompts are not used for training. There is nothing to send.
  • Your documents stay local. If you load a contract, a medical record, or sensitive code into the context, it goes nowhere.

Caveats:

  • Ollama itself checks for updates on startup. You can disable this if needed.
  • If you use Open WebUI or similar interfaces with cloud features enabled (web search, external integrations), those specific requests leave your machine.
  • The model file was trained on internet data by Google, Meta, or whichever organization released it. You are trusting their training process, not their servers.

Cost

ItemCost
OllamaFree, open source
Models (Gemma, Llama, Mistral, etc.)Free to download
Open WebUI / LM StudioFree
ElectricityNegligible for inference (a 7B model on CPU draws ~20–40W)
HardwareThis is where it gets real

If you already have a computer with 16 GB+ RAM, your cost is zero. If you need to buy hardware to run good models, the economics change. A used Mac Mini M2 with 16 GB RAM is the best entry point for a dedicated local AI machine.

There is no subscription to cancel. There is no per-token cost. There is no API rate limit at 2 AM.


Connecting local models to your existing tools

Ollama exposes a local API compatible with the OpenAI format. Many tools support this out of the box.

VS Code / Cursor / Continue.dev — use local models for code completion and chat in your editor.

Obsidian — plugins like Smart Connections connect to Ollama for local AI in your notes.

Python / scripts:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="gemma3:4b",
    messages=[{"role": "user", "content": "Summarize this text: ..."}]
)
print(response.choices[0].message.content)

No API key needed. No billing. Runs offline.


Should you use a local model or a cloud model?

This is not a binary choice. Many people use both.

Use local when:

  • The content is sensitive (medical, legal, financial, personal)
  • You need to process many documents without per-token costs
  • You are offline or on a slow connection
  • You want zero dependency on a company’s pricing or availability

Use cloud when:

  • You need the best possible quality for a high-stakes output
  • You need the latest information (local models have a training cutoff)
  • You are doing something computationally heavy and your hardware is limited
  • Speed matters more than privacy for this specific task

Quick start summary

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a model (adjust to your RAM)
ollama pull gemma3:4b      # ≤8 GB RAM
ollama pull gemma3:27b     # 24+ GB RAM

# 3. Chat
ollama run gemma3:4b

# 4. Optional: run Open WebUI for a browser interface
docker run -d -p 3000:80 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main
# then open http://localhost:3000

Useful resources


Last updated: April 2026. Model recommendations change fast — check the Ollama library for what is current.

Pierre Beaucoral
Pierre Beaucoral
PhD Candidate in Development Economics

I’m a development economist in training at CERDI. I spend most of my time debugging my R and Python codes trying to understand where “climate money” goes, what it is for, what it changes locally, and how data and ML can help answer these questions.