Are my GPUs being used? (2025/10/15)

Are my GPUs being used?

2025-10-15 17:45:06 +01:00 by Mark Smith

I’ve been using the local LLMs quite a lot today, mostly via Open WebUI. I started to suspect that the GPUs were not being used because the conversations got very slow. Long wait times between prompts. And then the models appeared to run out of context window, with errors in the UI. So I have spent the past few hours verifying that the GPUs were still being used. Interactive conversations using llama-cli in the backend are extremely quick, yet over HTTP they get very slow.

In the end I was able to verify the GPUs were being used in both cases. First by looking at the startup logs. There are messages that say the vulkan device (GPU) is detected and that a number of layers have been off loaded to it. So at least at startup things are operating correctly. But how to verify during normal operation?

Turns out you can do this using just Activity Monitor. I had of course tried this, but I didn't know you could open Window -> GPU History, and get a nice real-time graph of utilization. And sure enough during conversations with the local LLMs you can see the utilization spike and then return to baseline.

I also installed fluidtop which is a cli tool for real-time utilization of CPU and GPU cores, specifically aimed at the new M* series Macs. Works really well. Also, my first time installing python stuff using uv. Also works really well, though the commands you end up having to use to run the tools you install aren't exactly rolling off the keyboard. Had to add an alias to my shell, which isn’t the worst thing in the world I suppose, as long as the tool works and I don't get lost in another python stack trace debug horribleness.

It seems then that the slow down is due to how the application, in this case Open WebUI, is using context windows. But it gets very very complicated very quickly. With the cloud models you just don't have to worry about this, at least not until the conversation has gotten stupidly long. With the local LLMs it seems to happen after about 10 chat messages, especially if you are posting terminal output into the chat. And then you open up the settings and peer in and it’s the most complicated unintelligible tangle of settings you have ever seen. And you quickly close and exit.

I dunno maybe I am being overly dramatic, I do like all the functionality, but you almost need an LLM just to help you figure out the settings that control the LLMs. Surely that’s not a good sign. #