Ollama’s new desktop app makes local LLM onboarding dead simple.
So here’s something that made my week. Ollama just dropped their official macOS and Windows desktop app in v0.10.0. You do not need to explain to your teammates how to install the CLI, manage services, or troubleshoot Docker containers just to run a local model. You literally download an app, click a checkbox, and boom – you’ve got an OpenAI-compatible API running on your machine.
Attach Gateway speaks the exact same API, so all your secure multi-agent workflows just got way easier to set up.
What actually changed in Ollama 0.10
I went through the release notes so you don’t have to. Here’s what’s new and why you should care:
ollama ps
shows context length now – Finally! You can see how much memory each loaded model is actually using. Super useful when you’re running multiple models and wondering why your laptop is melting.
Gemma models are 2-3x faster – If you’re using any of the Gemma variants for code generation or reasoning, you’ll notice this immediately. Same quality, way less waiting around.
Multi-GPU performance boost (10-30%) – This is huge if you’re running on cloud instances with multiple GPUs. Better performance = lower costs when you’re scaling agent workloads.
Parallel requests default to 1 – Okay, this one’s important. They changed the default behavior for concurrent requests. If you’re running production workloads, you’ll want to set OLLAMA_NUM_PARALLEL
explicitly. Check their FAQ for details.
Tool calling actually works now – Fixed some annoying bugs with granite3.3
and mistral-nemo
models. Also solved that weird issue where tools with similar names (like add
and get_address
) would confuse the model.
WebP image support in the API – This unlocks all sorts of vision workflows. You can now do “screenshot debugging” demos without jumping through image format hoops.
Better error messages – ollama run
won’t just silently fail anymore, and they fixed some bugs in ollama show
. Small stuff that makes debugging way less painful.
Why this matters for your headless stack
Look, I’ve been running local LLMs in production for a while now, and the friction has always been in the setup. You’d spend an hour getting Ollama running properly, another hour figuring out auth, and then realize you need memory persistence for your agents to work properly.
The desktop app kills that friction. Your teammate can download it, flip the “API server” toggle, and they’re ready to integrate with your gateway in under two minutes.
But here’s the real win – performance matters when you’re running dozens of agents. That multi-GPU boost and Gemma optimization translate to actual cost savings. When you’re processing hundreds of requests per minute across multiple models, every millisecond counts.
And the tool calling fixes? Game changer for agent workflows. I can’t tell you how many times we’ve had mysterious failures when our planner agent tries to hand off work to the coder agent. Those days are hopefully behind us.
Getting up and running
If you’re on macOS or Windows
This is stupidly simple now:
- Uninstall the CLI version if you have it (just to avoid conflicts)
- Download the desktop app from ollama.com/download
- Enable the API server in settings (it exposes port 11434)
- Done. Your existing
OPENAI_BASE_URL=http://localhost:11434/v1
config just works.
If you’re running headless (Linux servers)
You’ll still use the regular installation method, but make sure to bump the version:
# Install/update Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Or if you're using their Docker setup
docker pull ollama/ollama:0.10.0
# Important: set parallel processing for production
export OLLAMA_NUM_PARALLEL=4
Setting up the Attach Gateway
Here’s where it gets interesting. Copy the .env.example
to .env
and fill in your details:
# Copy the example config
cp .env.example .env
Your .env
should look something like this:
# Required: Your Auth0 or OIDC setup
OIDC_ISSUER=https://your-domain.auth0.com
OIDC_AUD=your-api-identifier
# Point to your Ollama instance
ENGINE_URL=http://localhost:11434
# Optional: Enable memory (I recommend it)
MEM_BACKEND=weaviate
WEAVIATE_URL=http://localhost:8081
# Production settings
MAX_TOKENS_PER_MIN=60000
USAGE_METERING=prometheus
Then install and run:
# Install the gateway
pip install attach-dev
# Start Weaviate for memory (separate terminal)
docker run --rm -p 8081:8080 \
-e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true \
semitechnologies/weaviate:1.30.5
# Run the gateway
attach-gateway --port 8080
Now you can make authenticated requests that get logged to memory:
# Get your JWT token (from Auth0 or wherever)
export JWT="your-token-here"
# Make a request through the gateway
curl -H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}]}' \
http://localhost:8080/v1/chat/completions
The gateway adds X-Attach-User
and X-Attach-Session
headers so every downstream service knows who’s making the request. Plus everything gets logged to Weaviate automatically.
Vision workflows are now possible
This is pretty cool. You can now send WebP images directly:
curl -H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{
"model": "llava",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What bug do you see in this screenshot?"},
{"type": "image_url", "url": "data:image/webp;base64,iVBORw0KGgoAAAANS..."}
]
}]
}' \
http://localhost:8080/v1/chat/completions
Perfect for building “AI debugger” demos or visual QA agents.
What I’m excited about
Honestly, this feels like a tipping point. The desktop app removes the last barrier for teams who want to try local LLMs without committing to a full infrastructure setup.
I’m seeing more teams adopt the “hybrid” approach – use OpenAI for prototyping, then move critical workflows to self-hosted models with the gateway. You get the security and cost benefits of local inference while keeping the same API interface.
The tool calling improvements are huge too. We’ve been building multi-agent workflows where a planning agent analyzes requirements, then hands off to specialized agents for coding, testing, and deployment. When tool calling works reliably, these workflows become incredibly powerful.
Quick demo tomorrow
I’m recording a walkthrough tomorrow showing the complete setup – from desktop app install to authenticated multi-agent chat in under 60 seconds. I’ll drop the link here when it’s ready.
What’s next
If this workflow clicks for you, star the gateway repo. We’re seeing serious adoption from teams who want Auth0 SSO and memory persistence without the infrastructure overhead.
Share your setup on Twitter with #SelfHostedLLM
– I love seeing how people are combining Ollama + Attach for everything from code review bots to customer support agents.
And keep an eye out for Attach Studio next month. Think Cursor meets Retool for multi-agent workflows. Early access list opens soon.