Running Large Language Models (LLMs) locally with tools like Ollama often hits a hardware wall. The reason? They have an insatiable appetite for GPUs and VRAM. Here’s a simple breakdown of why.
An LLM’s performance hinges on three key factors: its size, memory speed, and processing architecture.
In short, the VRAM on your GPU becomes the primary bottleneck. The model must fit in this fast memory to be processed efficiently by the GPU’s parallel cores.
The good news? A technique called quantization shrinks models by reducing the precision of their parameters. This allows larger, more capable models to fit into the limited VRAM of consumer-grade GPUs, making local AI more accessible for everyone.
Quantization works by reducing the numerical precision of a model’s parameters (its “weights”), which is like storing its knowledge with less detail.
Think of an LLM’s knowledge as a massive collection of numbers. Typically, these are stored as 16-bit floating-point numbers (FP16), which can represent a wide range of values with high precision (like 3.14159).
Quantization is the process of converting these high-precision numbers into a format with fewer bits, like 4-bit integers (INT4). A 4-bit integer can only represent unique values.
The process essentially involves two steps:
FP16 weights in a layer of the model and finds their range (e.g., the minimum value is -5.0 and the maximum is +5.0).INT4 format can represent.4 Each high-precision FP16 weight is rounded to the nearest available INT4 value.A great analogy is reducing the color depth of an image. An FP16 model is like a high-resolution photo with millions of distinct color shades. A quantized INT4 model is like a GIF image that can only use a palette of 16 colors. You lose some of the fine-grained color detail, but the overall picture remains clear and the file size is dramatically smaller.
The math is straightforward. The size of a model is determined by the number of parameters multiplied by the number of bits required to store each parameter.
FP16): Each parameter takes up 16 bits of storage.INT4): Each parameter takes up only 4 bits of storage.The reduction factor is the ratio of the bit-depths:
So, the model becomes exactly 4 times smaller.
For example, for a model with 7 billion parameters:
At 16-bit: 7B parameters×16 bits/parameter÷8 bits/byte≈14 GB
At 4-bit: 7B parameters×4 bits/parameter÷8 bits/byte≈3.5 GB
Reducing precision isn’t free. This “rounding” process introduces a small amount of error, which can lead to a slight degradation in the model’s performance and nuance.
However, modern quantization techniques are very sophisticated, minimizing this quality loss. The benefits often far outweigh the minor trade-off:
INT4) much faster than on more complex floating-point numbers (FP16), speeding up text generation.