Local LLMs with OpenClaw: Ollama, Llama 3.3, Qwen 3.5 & MiniMax M2.5 – A Practical Benchmark

28. Februar 20266 min read

TL;DR: „Local LLMs with OpenClaw are production-ready in 2026. Llama 3.3 is the all-rounder, Qwen 3.5 the efficiency champion, MiniMax M2.5 the coding beast. All run via Ollama – no cloud, no cost, no privacy trade-offs."

— Till Freitag

Why Local LLMs?

Cloud APIs are convenient – but they come with three problems:

Cost: GPT-4o costs ~$15 per million output tokens. With heavy agent use, $300–700/month is realistic.
Privacy: Every API call sends data to US servers. GDPR-compliant? Only with a data processing agreement and risk assessment.
Dependency: API down? Rate limit reached? Your agent stops working.

Local LLMs solve all three problems. And in 2026, they're finally good enough for production use.

30-second version: Install Ollama, pull a model, connect OpenClaw – done. No API key, no per-token cost, no data shared with third parties.

The Candidates

We tested four models suitable for local use with OpenClaw:

Model	Provider	Parameters	Active Params	Context	Architecture
Llama 3.3	Meta	70B	70B	128K	Dense
Qwen 3.5 27B	Alibaba	27B	27B	256K	Dense
Qwen 3.5 35B-A3B	Alibaba	35B	3B	256K	MoE
MiniMax M2.5	MiniMax	230B	10B	200K	MoE

What Does MoE Mean?

Mixture of Experts (MoE) is the secret behind the new models: although the model has 230B parameters, only 10B are activated per token. The result: GPT-4-level quality at a fraction of the compute.

Installation via Ollama

All models can be downloaded with a single command:

# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh

# Pull models
ollama pull llama3.3           # 40 GB – needs 48 GB RAM
ollama pull qwen3.5:27b        # 16 GB – runs on 22 GB RAM
ollama pull qwen3.5:35b        # 20 GB – only 3B active (MoE)
ollama pull minimax-m2.5       # 101 GB (3-bit) – needs 128 GB RAM

Connect to OpenClaw

openclaw config set models.providers.ollama.apiKey "ollama-local"
openclaw config set agents.defaults.model.primary "ollama/qwen3.5:27b"

Performance Benchmarks

Tested on Apple M3 Max (128 GB RAM) and NVIDIA RTX 4090 (24 GB VRAM):

Speed (Tokens/Second)

Model	M3 Max (128 GB)	RTX 4090 (24 GB)	Notes
Llama 3.3 70B	~18 t/s	~25 t/s	Needs a lot of RAM
Qwen 3.5 27B	~35 t/s	~55 t/s	Best speed/quality trade-off
Qwen 3.5 35B-A3B	~60 t/s	~80 t/s	MoE turbo: only 3B active
MiniMax M2.5	~15 t/s	Not possible*	Needs >24 GB VRAM

*MiniMax M2.5 requires at least 64 GB RAM or a multi-GPU setup.

Quality (Benchmarks)

Model	MMLU-Pro	HumanEval	SWE-Bench	Agentic Use
Llama 3.3 70B	68.9	82.5	–	★★★★☆
Qwen 3.5 27B	71.2	85.1	–	★★★★☆
Qwen 3.5 35B-A3B	69.5	83.8	–	★★★★☆
MiniMax M2.5	74.1	89.3	80.2%	★★★★★

Result: Qwen 3.5 27B offers the best trade-off between speed, quality, and resource consumption. MiniMax M2.5 is the strongest model but requires significantly more hardware.

Cost Comparison: Cloud vs. Local

Cloud Costs (per month, estimated at 50M tokens)

Provider	Model	Input	Output	Total/Month
OpenAI	GPT-4o	$2.50/1M	$10/1M	~$300
Anthropic	Claude 3.5 Sonnet	$3/1M	$15/1M	~$400
OpenAI	GPT-4o mini	$0.15/1M	$0.60/1M	~$20

Local Costs (one-time + electricity)

Setup	Hardware	One-time	Electricity/Month	Break-Even
Mac mini M4 Pro	48 GB RAM	~$2,400	~$15	7–8 months
Mac Studio M3 Max	128 GB RAM	~$4,900	~$25	12–15 months
Linux Server + RTX 4090	64 GB RAM	~$3,200	~$40	8–10 months
Raspberry Pi 5	8 GB RAM	~$130	~$5	1 month

Bottom line: After ~8 months, self-hosting is cheaper than any cloud API. With heavy usage (>100M tokens/month), break-even drops to 3–4 months.

Offline Scenarios

Local LLMs have one decisive advantage no cloud can offer: They work without internet.

When Is Offline Relevant?

On the road: On trains, planes, construction sites – anywhere without stable internet
Air-gapped environments: Security-critical infrastructure (government, military, healthcare)
Edge deployments: IoT gateways, factory floors, remote offices
Resilience: When the cloud API goes down, your agent keeps running

Recommended Offline Setup

# Compact model for offline use on modest hardware
ollama pull qwen3.5:35b    # MoE: only 3B active, runs on 22 GB RAM

# For Raspberry Pi / edge devices
ollama pull phi-3:mini      # 3.8B parameters, 4 GB RAM

OpenClaw Offline Config

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3.5:35b",
        "fallbacks": ["ollama/phi-3:mini"]
      }
    }
  },
  "network": {
    "offline_mode": true,
    "web_search": false
  }
}

Which Model for Which Use Case?

Use Case	Recommended Model	Why
Email triage	Qwen 3.5 27B	Fast, 256K context for long threads
Code analysis	MiniMax M2.5	SWE-Bench 80.2%, best coding model
Quick responses	Qwen 3.5 35B-A3B	MoE: 60+ t/s on Apple Silicon
Summarization	Llama 3.3 70B	Solid quality, broad language understanding
Offline / edge	Qwen 3.5 35B-A3B	MoE + 256K context at low resource use
Raspberry Pi	Phi-3 Mini	Only model under 4 GB RAM

Qwen 3.5: The Newcomer in Detail

Alibaba's Qwen 3.5 deserves special attention. The model family brings several firsts in 2026:

256K context: Twice as much as Llama 3.3 – ideal for long email threads or document analysis
201 languages: A true multilingual model, perfect for international teams
Multimodal: The 27B and 122B variants can also process images
Thinking mode: Built-in chain-of-thought reasoning, toggleable per parameter
MoE variants: 35B-A3B activates only 3B parameters – runs on a MacBook Air

# Enable thinking mode (for complex tasks)
ollama run qwen3.5:27b --thinking

MiniMax M2.5: The Coding Beast

MiniMax M2.5 from Shanghai took the AI community by surprise:

SWE-Bench Verified: 80.2% – on par with Claude Opus 4.6
230B parameters, 10B active: MoE architecture for efficiency
Agentic design: Natively optimized for tool calling and search
200K context: Enough for complete codebases

The catch: You need at least 64 GB RAM (ideally 128 GB) for the 3-bit quantized model. But if you have the hardware, you get a model that competes with the best cloud APIs – at zero cost.

# MiniMax M2.5 via Ollama (needs a lot of RAM!)
ollama pull minimax-m2.5
openclaw config set agents.defaults.model.primary "ollama/minimax-m2.5"

Hybrid Strategy: Best of Both Worlds

Our recommendation for productive teams:

Task	Model	Local/Cloud
Email & customer data	Qwen 3.5 27B	🏠 Local
Code reviews	MiniMax M2.5	🏠 Local
Quick routine tasks	Qwen 3.5 35B-A3B	🏠 Local
Complex analysis (non-sensitive)	Claude 3.5 Sonnet	☁️ Cloud
Image generation	DALL-E 3 / Flux	☁️ Cloud

Rule of thumb: Personal data → always local. Everything else → based on budget and quality requirements.

Conclusion

Local LLMs are no longer a compromise in 2026 – they're a strategic decision. With Qwen 3.5 as the efficiency champion, MiniMax M2.5 as the coding powerhouse, and Llama 3.3 as the proven all-rounder, there's a model for every use case.

Combined with OpenClaw and Ollama, you get an AI agent stack that:

Costs nothing (after hardware amortization)
Works offline
Is GDPR-compliant (no data shared with third parties)
Matches cloud APIs in many scenarios

Break-even is at 3–8 months. After that, every token is free.

Want to run local LLMs with OpenClaw in production? Talk to us – we help with hardware recommendations, setup, and model selection.

More on this topic: What is OpenClaw? · OpenClaw Self-Hosting Guide · NanoClaw: The lean successor

TeilenLinkedIn WhatsApp E-Mail

February 28, 20264 min

OpenClaw Self-Hosting Guide: GDPR-Compliant in 30 Minutes

Self-host OpenClaw with Docker, persistent storage, and local LLMs via Ollama – fully GDPR-compliant because no data eve…

OpenClaw audit: an inventory of promises that held – and the ones that fizzled

June 8, 20264 min

The OpenClaw Audit 2026: What's Left of All the Promises?

OpenClaw was the hot thing in 2024, a LinkedIn religion in 2025, and supposedly dead in 2026. An honest audit: what held…

OpenClaw audit: inventory of promises kept – and promises that fizzled

June 8, 20264 min

OpenClaw Audit 2026: What's Left of All Those Promises?

OpenClaw was the hot new thing in 2024, a LinkedIn religion in 2025, and supposedly dead in 2026. A sober audit: what he…

June 4, 20264 min

Self-Hosted & Privacy Layer 2026: Ontheia, Anything LLM & Privacy Router

If you take GDPR seriously, there's no way around self-hosting. Ontheia, Anything LLM, NanoClaw and the Privacy Router c…

NVIDIA RTX Spark – Local AI First: laptop as a local AI cloud while hyperscaler infrastructure shows cracks

June 3, 20265 min

NVIDIA RTX Spark: When the Laptop Becomes the AI Cloud – Local AI First Gets Real

DGX Spark was the prelude, RTX Spark is the rollout. Why NVIDIA's RTX Spark platform flips the cloud-default assumption …

Diagram of a Privacy Router: local models for sensitive data, cloud models for everything else

March 17, 20264 min

NemoClaw: NVIDIA's Privacy Router and What It Means for Agent Architecture

NVIDIA enters the Claw ecosystem with NemoClaw – and brings a concept that could reshape agent architecture: Privacy Rou…

Architecture diagram of a Privacy Router: data flow split into local and cloud paths

March 17, 20266 min

Building a Privacy Router with OpenClaw: A Practical Guide with Code

Privacy Routing is the concept – but how do you build it? A practical guide with OpenClaw, a policy engine, and concrete…