llm-gguf-tools/docs/bartowski_analysis.md

127 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Bartowski Quantisation Analysis
Analysis of Bartowski GGUF files reveals why these models work so well: the "M" variants don't
apply uniform quantisation as their names suggest.
1. [The Hidden Sophistication of M Variants](#the-hidden-sophistication-of-m-variants)
2. [The Complete Quantisation Map](#the-complete-quantisation-map)
3. [The Architecture of Intelligence](#the-architecture-of-intelligence)
4. [The Economics of Enhancement](#the-economics-of-enhancement)
5. [Why Q3\_K Gets Special Treatment](#why-q3_k-gets-special-treatment)
6. [Implementation Insights](#implementation-insights)
7. [The Deeper Pattern](#the-deeper-pattern)
## The Hidden Sophistication of M Variants
When creating a Q4_K_M model, llama.cpp doesn't apply Q4_K throughout. Instead, it strategically
enhances critical components embeddings jump to Q6_K, attention V layers get Q6_K, and FFN down
projections receive the same treatment. This represents years of empirical optimisation baked
directly into the quantisation logic.
The L and XL models make surgical adjustments to an already-optimised foundation. Q4_K_L simply
takes the enhanced Q4_K_M and upgrades embeddings from Q6_K to Q8_0. This explains why file size
increases are modest relative to quality gains.
## The Complete Quantisation Map
Here's what's actually happening inside these models, based on analysis of real GGUF files:
| Variant | Embed | Output | Q | K | V | Gate | Up | Down |
|----------|-------|--------|-------|-------|-------|-------|-------|-------|
| Q3_K_M | Q6_K | Q4_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
| Q3_K_L | Q6_K | Q5_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
| Q3_K_XL | Q8_0 | Q5_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
| Q4_K_M | Q6_K | Q4_K | Q4_K | Q4_K | Q6_K | Q4_K | Q4_K | Q6_K |
| Q4_K_L | Q8_0 | Q4_K | Q4_K | Q4_K | Q6_K | Q4_K | Q4_K | Q6_K |
| Q5_K_M | Q6_K | Q5_K | Q5_K | Q5_K | Q6_K | Q5_K | Q5_K | Q6_K |
| Q5_K_L | Q8_0 | Q5_K | Q5_K | Q5_K | Q6_K | Q5_K | Q5_K | Q6_K |
| Q6_K_L | Q8_0 | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K |
Key patterns: M variants boost embeddings to Q6_K, enhance attention V layers (Q3→Q5, Q4/Q5→Q6),
and upgrade FFN down projections. L variants change just embeddings or output. Only Q3_K has an XL
variant as it has room for both improvements without competing with the next tier.
## The Architecture of Intelligence
Using a Qwen3 4B model as reference: embeddings comprise just 9.7% of parameters (389M, 0.78GB at
F16) yet fundamentally determine vocabulary understanding. Poor embedding quantisation prevents the
model from distinguishing similar tokens. Upgrading from Q4 to Q8 adds only 0.17GB but dramatically
improves handling of technical terms and rare words.
Attention (Q, K, V) accounts for 14.1% of parameters (566M, 1.13GB). Value vectors (V) are critical
they're what the model retrieves when attending to context. M variants enhance V layers whilst
leaving Q and K at base quantisation for better information retrieval without excessive size increase.
Feed-forward network trade-offs: Gate and up projections (44.6% of parameters, 1,793M, 3.59GB)
stay at base quantisation as enhancement would double file sizes for modest gains. Down projections
(22.3%, 897M, 1.79GB) get enhanced in M variants as they're the final transformation affecting all
downstream processing.
The output layer (9.4% of parameters, 378M, 0.75GB) determines final token predictions. Q3_K_L
targets it for enhancement as improved output precision can mean the difference between coherent
and garbled text for Q3-based models.
## The Economics of Enhancement
Q4_K_M at 2.26GB already includes strategic Q6_K enhancements. The L variant adds just 0.44GB (19%
increase) by upgrading only embeddings to Q8_0, leveraging existing enhancements whilst maximising
vocabulary understanding. A naive approach of upgrading everything would add gigabytes for marginal
improvements.
Bartowski's popularity stems from carefully chosen points in the size-quality space. Each variant
represents a local optimum better quality requires jumping tiers, smaller size sacrifices key
capabilities.
## Why Q3_K Gets Special Treatment
Q3_K uniquely has an XL variant because it starts from the lowest practical quantisation with room
for improvement. The progression from Q3_K_M (1.5GB) through L (1.6GB) to XL (1.8GB) provides
granular control for memory-constrained environments, with each 15-20% size increase delivering
meaningful quality improvements.
Q4_K_XL or Q5_K_XL don't exist because they'd compete with the next tier. A hypothetical Q4_K_XL
at 2.75GB would match Q5_K_M's size, but Q5_K_M's superior base quantisation provides better
quality than selectively enhanced Q4_K layers.
The pattern is consistent: significant enhancements to Q5_K or Q6_K mean you should jump to the
next base type. Sweet spots: Q3 family for extreme memory constraints, Q4/Q5 for mainstream use,
Q6/Q8 when quality matters more than size.
## Implementation Insights
Since llama.cpp's M variants already include sophisticated enhancements, replicating Bartowski's
variants requires minimal configuration:
```python
# Q3_K_L: Only upgrade output from M baseline
config = {
"base": "Q3_K_M", # Inherits Q6_K embeddings, Q5_K V/FFN-down
"output": "Q5_K" # Single surgical change
}
# Q4_K_L: Only upgrade embeddings from M baseline
config = {
"base": "Q4_K_M", # Inherits Q6_K V/FFN-down
"embeddings": "Q8_0" # Single surgical change
}
# Q3_K_XL: The only variant needing two changes
config = {
"base": "Q3_K_M",
"embeddings": "Q8_0",
"output": "Q5_K"
}
```
This minimalist approach recognises that M variants already embody years of community optimisation.
Bartowski's contribution lies in identifying where small adjustments yield outsized returns.
## The Deeper Pattern
This system evolved through countless experiments rather than top-down design. M variants encode
hard-won knowledge about critical layers. L variants build on this foundation. The absence of most
XL variants shows where diminishing returns set in.
Bartowski's quantisations work because they embody years of collective learning about what matters
in practice. They demonstrate that the best solutions often come from understanding and building
upon what already works, rather than grand redesigns.