Just use Llama.cpp
2025-10-12 12:42:04 +01:00 by Mark Smith
It occurred to me this morning that in yesterday’s piece I forgot to mention the route I ended up going with for the local LLMs. I had been using Ollama for several weeks, and though the features are pretty good, certainly as far as downloading and managing models, they do not support vulkan, and that’s the only way to get GPU acceleration working in containers on a Mac. What’s more it appears as though they have no real interest to add this functionality. Ollama actually runs ontop of Llama.cpp, and that project does support vulkan.
Since I got an impression that at least some people have been going back to running the core project, the next thing was to just try out Llama.cpp directly. It certainly wasn’t without issue, but the llama.cpp folks were helpful. It’s a bit more hardcore, because it doesn't do things like model switching out of the box, but in reality you don't need anything too fancy. There are some model switching proxies you can setup, but I found these to be overly complex. For my uses, some simple npm scripts is good enough, and likely there are some advantages to connecting directly to the models. I also had to write a simple script to download models, basically curl inside an until loop, with a persistent cache folder.
I'm trying to avoid unnecessary complexity, things are complicated enough when you are developing in containers. #