In part 3 of this series, I’ve set-up my headless local LLM execution environment with Ollama so I’m ready for further experiments. One of the questions I wanted to answer with this setup is how much faster LLMs run on a GPU vs. the CPU. Part of the LLMs one can download is a configuration file that defines how the LLM is set-up for execution. One parameter in this file defines how many of the neuron layers of the LLM are to be executed on the GPU and how many of them should be run on the CPU. This is what is called ‘layer offloading’, and splitting up the work between the CPU and GPU can be useful when GPU memory is smaller than what is necessary to run the LLM on the GPU alone.
Fast, Faster, Fastest!
For my experiment, I decided to use one of the LLMs that I’ve previously downloaded to translate two pages of German text into English on the GPU and then do the same again on the CPU by changing the parameters in the configuration file of the LLM. Testing the extremes if you want. My speed difference: On the GPU, the translation task took 31 seconds. When running the same translation task on the CPU, the task took 1 minute and 22 seconds. In other words, the LLM I used was three times faster on the GPU than on the CPU for this particular task. For the details of my setup, see my previous posts. My results are pretty much in line with what I have read online, so I’m happy with the outcome.
Why are LLMs Faster on the GPU?
So why are CPUs slower at LLM tasks than GPUs? It’s one of the foundational questions and this article on Medium gives a good introduction to this particular aspect. My elevator pitch:
LLMs require a massive amount of mathematical calculations that can be run in parallel. The CPU is very fast but can only perform a few tasks in parallel. The other limitation is the interface to system memory, which is heavily used while an LLM is running. The GPU on the other hand is optimized to run the mathematical calculations required for LLMs in a massively parallel way and has a very fast interface to its dedicated video memory (VRAM) to shift data back and forth to the units doing the parallel calculations. So why are graphics cards used for LLM acceleration in a ‘home environment’? Because the mathematical operations required for fast 2D/3D video generation are the same as for running LLMs.
Comparison with Higher End Setups And Massive Online LLMs
Obviously, my 8 GB of VRAM setup at home is very limited when comparing it to what kind of hardware public LLMs such as ChatGPT, Perplexity, Claude, etc. are executed on. The article linked to above also gives some details what kind of LLM models can be run at home if one is willing to invest 5 to 10k Euros in hardware to get 48 GB of VRAM spread over two Nvidia graphics cards or, alternatively, spend around the same amount of money on a 96 GB unified memory Mac system. Well, that’s a bit beyond where I’m willing to go for this.
But if I had such as setup, the benefits would be significant. For one thing, I could run much bigger and more capable models. Also, it would allow me to significantly increase the size of the context window, i.e. the amount of input and output text that a model can work on. As far as I understand, most of the models I can run on 8 GB of VRAM are limited to 4096 input tokens, while very high end LLMs use context windows of several 100.000’s of tokens and even going beyond the 1 million threshold. I don’t have personal experience on what kind context window is possible with a system that has 48 GB of VRAM, but a number of sources I found on the Internet say that the maximum context window for such a setup is 32k. That’s an order to two away from what AI companies are doing with their huge setups. Scaling up an order of magnitude will also increase the price tag of the hardware required for this by an order of magnitude, and at this point one must wonder if such setups are sustainable from a commercial point of view in the long run. That’s a question for another post, however.
Summary and Next Steps
You see where this is going, but it is important to see what kind of scales we are talking about. And while my setup at home is tiny compared to what is done today at the very high end, it’s nonetheless still useful for experimenting and perhaps doing a number of interesting and useful things. Next up in the blog series: Let’s put an open source web GUI around my local setup and use a reverse proxy for https encryption so I can use my LLMs over the Internet.