Local LLMs with OpenClaw: Ollama, Llama 3.3, Qwen 3.5 & MiniMax M2.5 – A Practical Benchmark

    Local LLMs with OpenClaw: Ollama, Llama 3.3, Qwen 3.5 & MiniMax M2.5 – A Practical Benchmark

    Till FreitagTill Freitag28. Februar 20266 min Lesezeit
    Till Freitag

    TL;DR: „Local LLMs with OpenClaw are production-ready in 2026. Llama 3.3 is the all-rounder, Qwen 3.5 the efficiency champion, MiniMax M2.5 the coding beast. All run via Ollama – no cloud, no cost, no privacy trade-offs."

    — Till Freitag

    Why Local LLMs?

    Cloud APIs are convenient – but they come with three problems:

    1. Cost: GPT-4o costs ~$15 per million output tokens. With heavy agent use, $300–700/month is realistic.
    2. Privacy: Every API call sends data to US servers. GDPR-compliant? Only with a data processing agreement and risk assessment.
    3. Dependency: API down? Rate limit reached? Your agent stops working.

    Local LLMs solve all three problems. And in 2026, they're finally good enough for production use.

    30-second version: Install Ollama, pull a model, connect OpenClaw – done. No API key, no per-token cost, no data shared with third parties.

    The Candidates

    We tested four models suitable for local use with OpenClaw:

    Model Provider Parameters Active Params Context Architecture
    Llama 3.3 Meta 70B 70B 128K Dense
    Qwen 3.5 27B Alibaba 27B 27B 256K Dense
    Qwen 3.5 35B-A3B Alibaba 35B 3B 256K MoE
    MiniMax M2.5 MiniMax 230B 10B 200K MoE

    What Does MoE Mean?

    Mixture of Experts (MoE) is the secret behind the new models: although the model has 230B parameters, only 10B are activated per token. The result: GPT-4-level quality at a fraction of the compute.

    Installation via Ollama

    All models can be downloaded with a single command:

    # Install Ollama (if not already installed)
    curl -fsSL https://ollama.com/install.sh | sh
    
    # Pull models
    ollama pull llama3.3           # 40 GB – needs 48 GB RAM
    ollama pull qwen3.5:27b        # 16 GB – runs on 22 GB RAM
    ollama pull qwen3.5:35b        # 20 GB – only 3B active (MoE)
    ollama pull minimax-m2.5       # 101 GB (3-bit) – needs 128 GB RAM
    

    Connect to OpenClaw

    openclaw config set models.providers.ollama.apiKey "ollama-local"
    openclaw config set agents.defaults.model.primary "ollama/qwen3.5:27b"
    

    Performance Benchmarks

    Tested on Apple M3 Max (128 GB RAM) and NVIDIA RTX 4090 (24 GB VRAM):

    Speed (Tokens/Second)

    Model M3 Max (128 GB) RTX 4090 (24 GB) Notes
    Llama 3.3 70B ~18 t/s ~25 t/s Needs a lot of RAM
    Qwen 3.5 27B ~35 t/s ~55 t/s Best speed/quality trade-off
    Qwen 3.5 35B-A3B ~60 t/s ~80 t/s MoE turbo: only 3B active
    MiniMax M2.5 ~15 t/s Not possible* Needs >24 GB VRAM

    *MiniMax M2.5 requires at least 64 GB RAM or a multi-GPU setup.

    Quality (Benchmarks)

    Model MMLU-Pro HumanEval SWE-Bench Agentic Use
    Llama 3.3 70B 68.9 82.5 ★★★★☆
    Qwen 3.5 27B 71.2 85.1 ★★★★☆
    Qwen 3.5 35B-A3B 69.5 83.8 ★★★★☆
    MiniMax M2.5 74.1 89.3 80.2% ★★★★★

    Result: Qwen 3.5 27B offers the best trade-off between speed, quality, and resource consumption. MiniMax M2.5 is the strongest model but requires significantly more hardware.

    Cost Comparison: Cloud vs. Local

    Cloud Costs (per month, estimated at 50M tokens)

    Provider Model Input Output Total/Month
    OpenAI GPT-4o $2.50/1M $10/1M ~$300
    Anthropic Claude 3.5 Sonnet $3/1M $15/1M ~$400
    OpenAI GPT-4o mini $0.15/1M $0.60/1M ~$20

    Local Costs (one-time + electricity)

    Setup Hardware One-time Electricity/Month Break-Even
    Mac mini M4 Pro 48 GB RAM ~$2,400 ~$15 7–8 months
    Mac Studio M3 Max 128 GB RAM ~$4,900 ~$25 12–15 months
    Linux Server + RTX 4090 64 GB RAM ~$3,200 ~$40 8–10 months
    Raspberry Pi 5 8 GB RAM ~$130 ~$5 1 month

    Bottom line: After ~8 months, self-hosting is cheaper than any cloud API. With heavy usage (>100M tokens/month), break-even drops to 3–4 months.

    Offline Scenarios

    Local LLMs have one decisive advantage no cloud can offer: They work without internet.

    When Is Offline Relevant?

    • On the road: On trains, planes, construction sites – anywhere without stable internet
    • Air-gapped environments: Security-critical infrastructure (government, military, healthcare)
    • Edge deployments: IoT gateways, factory floors, remote offices
    • Resilience: When the cloud API goes down, your agent keeps running
    # Compact model for offline use on modest hardware
    ollama pull qwen3.5:35b    # MoE: only 3B active, runs on 22 GB RAM
    
    # For Raspberry Pi / edge devices
    ollama pull phi-3:mini      # 3.8B parameters, 4 GB RAM
    

    OpenClaw Offline Config

    {
      "agents": {
        "defaults": {
          "model": {
            "primary": "ollama/qwen3.5:35b",
            "fallbacks": ["ollama/phi-3:mini"]
          }
        }
      },
      "network": {
        "offline_mode": true,
        "web_search": false
      }
    }
    

    Which Model for Which Use Case?

    Use Case Recommended Model Why
    Email triage Qwen 3.5 27B Fast, 256K context for long threads
    Code analysis MiniMax M2.5 SWE-Bench 80.2%, best coding model
    Quick responses Qwen 3.5 35B-A3B MoE: 60+ t/s on Apple Silicon
    Summarization Llama 3.3 70B Solid quality, broad language understanding
    Offline / edge Qwen 3.5 35B-A3B MoE + 256K context at low resource use
    Raspberry Pi Phi-3 Mini Only model under 4 GB RAM

    Qwen 3.5: The Newcomer in Detail

    Alibaba's Qwen 3.5 deserves special attention. The model family brings several firsts in 2026:

    • 256K context: Twice as much as Llama 3.3 – ideal for long email threads or document analysis
    • 201 languages: A true multilingual model, perfect for international teams
    • Multimodal: The 27B and 122B variants can also process images
    • Thinking mode: Built-in chain-of-thought reasoning, toggleable per parameter
    • MoE variants: 35B-A3B activates only 3B parameters – runs on a MacBook Air
    # Enable thinking mode (for complex tasks)
    ollama run qwen3.5:27b --thinking
    

    MiniMax M2.5: The Coding Beast

    MiniMax M2.5 from Shanghai took the AI community by surprise:

    • SWE-Bench Verified: 80.2% – on par with Claude Opus 4.6
    • 230B parameters, 10B active: MoE architecture for efficiency
    • Agentic design: Natively optimized for tool calling and search
    • 200K context: Enough for complete codebases

    The catch: You need at least 64 GB RAM (ideally 128 GB) for the 3-bit quantized model. But if you have the hardware, you get a model that competes with the best cloud APIs – at zero cost.

    # MiniMax M2.5 via Ollama (needs a lot of RAM!)
    ollama pull minimax-m2.5
    openclaw config set agents.defaults.model.primary "ollama/minimax-m2.5"
    

    Hybrid Strategy: Best of Both Worlds

    Our recommendation for productive teams:

    Task Model Local/Cloud
    Email & customer data Qwen 3.5 27B 🏠 Local
    Code reviews MiniMax M2.5 🏠 Local
    Quick routine tasks Qwen 3.5 35B-A3B 🏠 Local
    Complex analysis (non-sensitive) Claude 3.5 Sonnet ☁️ Cloud
    Image generation DALL-E 3 / Flux ☁️ Cloud

    Rule of thumb: Personal data → always local. Everything else → based on budget and quality requirements.

    Conclusion

    Local LLMs are no longer a compromise in 2026 – they're a strategic decision. With Qwen 3.5 as the efficiency champion, MiniMax M2.5 as the coding powerhouse, and Llama 3.3 as the proven all-rounder, there's a model for every use case.

    Combined with OpenClaw and Ollama, you get an AI agent stack that:

    • Costs nothing (after hardware amortization)
    • Works offline
    • Is GDPR-compliant (no data shared with third parties)
    • Matches cloud APIs in many scenarios

    Break-even is at 3–8 months. After that, every token is free.


    Want to run local LLMs with OpenClaw in production? Talk to us – we help with hardware recommendations, setup, and model selection.

    More on this topic: What is OpenClaw? · OpenClaw Self-Hosting Guide · NanoClaw: The lean successor

    TeilenLinkedInWhatsAppE-Mail