127 lines
6.3 KiB
Markdown
127 lines
6.3 KiB
Markdown
# Bartowski Quantisation Analysis
|
||
|
||
Analysis of Bartowski GGUF files reveals why these models work so well: the "M" variants don't
|
||
apply uniform quantisation as their names suggest.
|
||
|
||
1. [The Hidden Sophistication of M Variants](#the-hidden-sophistication-of-m-variants)
|
||
2. [The Complete Quantisation Map](#the-complete-quantisation-map)
|
||
3. [The Architecture of Intelligence](#the-architecture-of-intelligence)
|
||
4. [The Economics of Enhancement](#the-economics-of-enhancement)
|
||
5. [Why Q3\_K Gets Special Treatment](#why-q3_k-gets-special-treatment)
|
||
6. [Implementation Insights](#implementation-insights)
|
||
7. [The Deeper Pattern](#the-deeper-pattern)
|
||
|
||
## The Hidden Sophistication of M Variants
|
||
|
||
When creating a Q4_K_M model, llama.cpp doesn't apply Q4_K throughout. Instead, it strategically
|
||
enhances critical components – embeddings jump to Q6_K, attention V layers get Q6_K, and FFN down
|
||
projections receive the same treatment. This represents years of empirical optimisation baked
|
||
directly into the quantisation logic.
|
||
|
||
The L and XL models make surgical adjustments to an already-optimised foundation. Q4_K_L simply
|
||
takes the enhanced Q4_K_M and upgrades embeddings from Q6_K to Q8_0. This explains why file size
|
||
increases are modest relative to quality gains.
|
||
|
||
## The Complete Quantisation Map
|
||
|
||
Here's what's actually happening inside these models, based on analysis of real GGUF files:
|
||
|
||
| Variant | Embed | Output | Q | K | V | Gate | Up | Down |
|
||
|----------|-------|--------|-------|-------|-------|-------|-------|-------|
|
||
| Q3_K_M | Q6_K | Q4_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
|
||
| Q3_K_L | Q6_K | Q5_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
|
||
| Q3_K_XL | Q8_0 | Q5_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
|
||
| Q4_K_M | Q6_K | Q4_K | Q4_K | Q4_K | Q6_K | Q4_K | Q4_K | Q6_K |
|
||
| Q4_K_L | Q8_0 | Q4_K | Q4_K | Q4_K | Q6_K | Q4_K | Q4_K | Q6_K |
|
||
| Q5_K_M | Q6_K | Q5_K | Q5_K | Q5_K | Q6_K | Q5_K | Q5_K | Q6_K |
|
||
| Q5_K_L | Q8_0 | Q5_K | Q5_K | Q5_K | Q6_K | Q5_K | Q5_K | Q6_K |
|
||
| Q6_K_L | Q8_0 | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K |
|
||
|
||
Key patterns: M variants boost embeddings to Q6_K, enhance attention V layers (Q3→Q5, Q4/Q5→Q6),
|
||
and upgrade FFN down projections. L variants change just embeddings or output. Only Q3_K has an XL
|
||
variant as it has room for both improvements without competing with the next tier.
|
||
|
||
## The Architecture of Intelligence
|
||
|
||
Using a Qwen3 4B model as reference: embeddings comprise just 9.7% of parameters (389M, 0.78GB at
|
||
F16) yet fundamentally determine vocabulary understanding. Poor embedding quantisation prevents the
|
||
model from distinguishing similar tokens. Upgrading from Q4 to Q8 adds only 0.17GB but dramatically
|
||
improves handling of technical terms and rare words.
|
||
|
||
Attention (Q, K, V) accounts for 14.1% of parameters (566M, 1.13GB). Value vectors (V) are critical
|
||
– they're what the model retrieves when attending to context. M variants enhance V layers whilst
|
||
leaving Q and K at base quantisation for better information retrieval without excessive size increase.
|
||
|
||
Feed-forward network trade-offs: Gate and up projections (44.6% of parameters, 1,793M, 3.59GB)
|
||
stay at base quantisation as enhancement would double file sizes for modest gains. Down projections
|
||
(22.3%, 897M, 1.79GB) get enhanced in M variants as they're the final transformation affecting all
|
||
downstream processing.
|
||
|
||
The output layer (9.4% of parameters, 378M, 0.75GB) determines final token predictions. Q3_K_L
|
||
targets it for enhancement as improved output precision can mean the difference between coherent
|
||
and garbled text for Q3-based models.
|
||
|
||
## The Economics of Enhancement
|
||
|
||
Q4_K_M at 2.26GB already includes strategic Q6_K enhancements. The L variant adds just 0.44GB (19%
|
||
increase) by upgrading only embeddings to Q8_0, leveraging existing enhancements whilst maximising
|
||
vocabulary understanding. A naive approach of upgrading everything would add gigabytes for marginal
|
||
improvements.
|
||
|
||
Bartowski's popularity stems from carefully chosen points in the size-quality space. Each variant
|
||
represents a local optimum – better quality requires jumping tiers, smaller size sacrifices key
|
||
capabilities.
|
||
|
||
## Why Q3_K Gets Special Treatment
|
||
|
||
Q3_K uniquely has an XL variant because it starts from the lowest practical quantisation with room
|
||
for improvement. The progression from Q3_K_M (1.5GB) through L (1.6GB) to XL (1.8GB) provides
|
||
granular control for memory-constrained environments, with each 15-20% size increase delivering
|
||
meaningful quality improvements.
|
||
|
||
Q4_K_XL or Q5_K_XL don't exist because they'd compete with the next tier. A hypothetical Q4_K_XL
|
||
at 2.75GB would match Q5_K_M's size, but Q5_K_M's superior base quantisation provides better
|
||
quality than selectively enhanced Q4_K layers.
|
||
|
||
The pattern is consistent: significant enhancements to Q5_K or Q6_K mean you should jump to the
|
||
next base type. Sweet spots: Q3 family for extreme memory constraints, Q4/Q5 for mainstream use,
|
||
Q6/Q8 when quality matters more than size.
|
||
|
||
## Implementation Insights
|
||
|
||
Since llama.cpp's M variants already include sophisticated enhancements, replicating Bartowski's
|
||
variants requires minimal configuration:
|
||
|
||
```python
|
||
# Q3_K_L: Only upgrade output from M baseline
|
||
config = {
|
||
"base": "Q3_K_M", # Inherits Q6_K embeddings, Q5_K V/FFN-down
|
||
"output": "Q5_K" # Single surgical change
|
||
}
|
||
|
||
# Q4_K_L: Only upgrade embeddings from M baseline
|
||
config = {
|
||
"base": "Q4_K_M", # Inherits Q6_K V/FFN-down
|
||
"embeddings": "Q8_0" # Single surgical change
|
||
}
|
||
|
||
# Q3_K_XL: The only variant needing two changes
|
||
config = {
|
||
"base": "Q3_K_M",
|
||
"embeddings": "Q8_0",
|
||
"output": "Q5_K"
|
||
}
|
||
```
|
||
|
||
This minimalist approach recognises that M variants already embody years of community optimisation.
|
||
Bartowski's contribution lies in identifying where small adjustments yield outsized returns.
|
||
|
||
## The Deeper Pattern
|
||
|
||
This system evolved through countless experiments rather than top-down design. M variants encode
|
||
hard-won knowledge about critical layers. L variants build on this foundation. The absence of most
|
||
XL variants shows where diminishing returns set in.
|
||
|
||
Bartowski's quantisations work because they embody years of collective learning about what matters
|
||
in practice. They demonstrate that the best solutions often come from understanding and building
|
||
upon what already works, rather than grand redesigns.
|