Why LLMs Crave GPUs and VRAM: A Quick Guide

DONT FORGET TO REPLACE ME LATER

Why GPUs and VRAM

Running Large Language Models (LLMs) locally with tools like Ollama often hits a hardware wall. The reason? They have an insatiable appetite for GPUs and VRAM. Here’s a simple breakdown of why.

An LLM’s performance hinges on three key factors: its size, memory speed, and processing architecture.

  1. Massive Size (Parameters): LLMs store their knowledge in billions of “parameters.” To generate text, the entire model—all these parameters—must be loaded into your computer’s memory.
  2. Need for Speed (VRAM > RAM): VRAM is the ultra-fast memory located directly on your GPU. Fetching from Regular RAM is much slower than VRAM. For an LLM to generate responses quickly, it needs its parameters on the high-speed workbench.
  3. Parallel Power (GPU > CPU): A CPU is a versatile generalist, tackling tasks sequentially. A GPU is a specialist with thousands of cores designed for parallel processing—doing many similar things at once. The massive mathematical workload of an LLM is perfectly suited for a GPU’s parallel power, making it orders of magnitude faster than a CPU for this task.

In short, the VRAM on your GPU becomes the primary bottleneck. The model must fit in this fast memory to be processed efficiently by the GPU’s parallel cores.

The good news? A technique called quantization shrinks models by reducing the precision of their parameters. This allows larger, more capable models to fit into the limited VRAM of consumer-grade GPUs, making local AI more accessible for everyone.

Quantization

Quantization works by reducing the numerical precision of a model’s parameters (its “weights”), which is like storing its knowledge with less detail.

How Quantization Works: From High Precision to Smart Approximation

Think of an LLM’s knowledge as a massive collection of numbers. Typically, these are stored as 16-bit floating-point numbers (FP16), which can represent a wide range of values with high precision (like 3.14159).

Quantization is the process of converting these high-precision numbers into a format with fewer bits, like 4-bit integers (INT4). A 4-bit integer can only represent unique values.

The process essentially involves two steps:

  1. Find the Range: The algorithm first looks at all the FP16 weights in a layer of the model and finds their range (e.g., the minimum value is -5.0 and the maximum is +5.0).
  2. Map to “Buckets”: It then maps this continuous range onto the 16 available “buckets” that the INT4 format can represent.4 Each high-precision FP16 weight is rounded to the nearest available INT4 value.

A great analogy is reducing the color depth of an image. An FP16 model is like a high-resolution photo with millions of distinct color shades. A quantized INT4 model is like a GIF image that can only use a palette of 16 colors. You lose some of the fine-grained color detail, but the overall picture remains clear and the file size is dramatically smaller.

Why 16-bit to 4-bit is a 4x Reduction

The math is straightforward. The size of a model is determined by the number of parameters multiplied by the number of bits required to store each parameter.

The reduction factor is the ratio of the bit-depths:

So, the model becomes exactly 4 times smaller.

For example, for a model with 7 billion parameters:

At 16-bit: 7B parameters×16 bits/parameter÷8 bits/byte≈14 GB

At 4-bit: 7B parameters×4 bits/parameter÷8 bits/byte≈3.5 GB

The Inevitable Trade-Off: A Small Loss for Big Gains

Reducing precision isn’t free. This “rounding” process introduces a small amount of error, which can lead to a slight degradation in the model’s performance and nuance.

However, modern quantization techniques are very sophisticated, minimizing this quality loss. The benefits often far outweigh the minor trade-off: