llm-gguf-tools/docs/bartowski_analysis.md

6.3 KiB
Raw Permalink Blame History

Bartowski Quantisation Analysis

Analysis of Bartowski GGUF files reveals why these models work so well: the "M" variants don't apply uniform quantisation as their names suggest.

  1. The Hidden Sophistication of M Variants
  2. The Complete Quantisation Map
  3. The Architecture of Intelligence
  4. The Economics of Enhancement
  5. Why Q3_K Gets Special Treatment
  6. Implementation Insights
  7. The Deeper Pattern

The Hidden Sophistication of M Variants

When creating a Q4_K_M model, llama.cpp doesn't apply Q4_K throughout. Instead, it strategically enhances critical components embeddings jump to Q6_K, attention V layers get Q6_K, and FFN down projections receive the same treatment. This represents years of empirical optimisation baked directly into the quantisation logic.

The L and XL models make surgical adjustments to an already-optimised foundation. Q4_K_L simply takes the enhanced Q4_K_M and upgrades embeddings from Q6_K to Q8_0. This explains why file size increases are modest relative to quality gains.

The Complete Quantisation Map

Here's what's actually happening inside these models, based on analysis of real GGUF files:

Variant Embed Output Q K V Gate Up Down
Q3_K_M Q6_K Q4_K Q3_K Q3_K Q5_K Q3_K Q3_K Q5_K
Q3_K_L Q6_K Q5_K Q3_K Q3_K Q5_K Q3_K Q3_K Q5_K
Q3_K_XL Q8_0 Q5_K Q3_K Q3_K Q5_K Q3_K Q3_K Q5_K
Q4_K_M Q6_K Q4_K Q4_K Q4_K Q6_K Q4_K Q4_K Q6_K
Q4_K_L Q8_0 Q4_K Q4_K Q4_K Q6_K Q4_K Q4_K Q6_K
Q5_K_M Q6_K Q5_K Q5_K Q5_K Q6_K Q5_K Q5_K Q6_K
Q5_K_L Q8_0 Q5_K Q5_K Q5_K Q6_K Q5_K Q5_K Q6_K
Q6_K_L Q8_0 Q6_K Q6_K Q6_K Q6_K Q6_K Q6_K Q6_K

Key patterns: M variants boost embeddings to Q6_K, enhance attention V layers (Q3→Q5, Q4/Q5→Q6), and upgrade FFN down projections. L variants change just embeddings or output. Only Q3_K has an XL variant as it has room for both improvements without competing with the next tier.

The Architecture of Intelligence

Using a Qwen3 4B model as reference: embeddings comprise just 9.7% of parameters (389M, 0.78GB at F16) yet fundamentally determine vocabulary understanding. Poor embedding quantisation prevents the model from distinguishing similar tokens. Upgrading from Q4 to Q8 adds only 0.17GB but dramatically improves handling of technical terms and rare words.

Attention (Q, K, V) accounts for 14.1% of parameters (566M, 1.13GB). Value vectors (V) are critical they're what the model retrieves when attending to context. M variants enhance V layers whilst leaving Q and K at base quantisation for better information retrieval without excessive size increase.

Feed-forward network trade-offs: Gate and up projections (44.6% of parameters, 1,793M, 3.59GB) stay at base quantisation as enhancement would double file sizes for modest gains. Down projections (22.3%, 897M, 1.79GB) get enhanced in M variants as they're the final transformation affecting all downstream processing.

The output layer (9.4% of parameters, 378M, 0.75GB) determines final token predictions. Q3_K_L targets it for enhancement as improved output precision can mean the difference between coherent and garbled text for Q3-based models.

The Economics of Enhancement

Q4_K_M at 2.26GB already includes strategic Q6_K enhancements. The L variant adds just 0.44GB (19% increase) by upgrading only embeddings to Q8_0, leveraging existing enhancements whilst maximising vocabulary understanding. A naive approach of upgrading everything would add gigabytes for marginal improvements.

Bartowski's popularity stems from carefully chosen points in the size-quality space. Each variant represents a local optimum better quality requires jumping tiers, smaller size sacrifices key capabilities.

Why Q3_K Gets Special Treatment

Q3_K uniquely has an XL variant because it starts from the lowest practical quantisation with room for improvement. The progression from Q3_K_M (1.5GB) through L (1.6GB) to XL (1.8GB) provides granular control for memory-constrained environments, with each 15-20% size increase delivering meaningful quality improvements.

Q4_K_XL or Q5_K_XL don't exist because they'd compete with the next tier. A hypothetical Q4_K_XL at 2.75GB would match Q5_K_M's size, but Q5_K_M's superior base quantisation provides better quality than selectively enhanced Q4_K layers.

The pattern is consistent: significant enhancements to Q5_K or Q6_K mean you should jump to the next base type. Sweet spots: Q3 family for extreme memory constraints, Q4/Q5 for mainstream use, Q6/Q8 when quality matters more than size.

Implementation Insights

Since llama.cpp's M variants already include sophisticated enhancements, replicating Bartowski's variants requires minimal configuration:

# Q3_K_L: Only upgrade output from M baseline
config = {
    "base": "Q3_K_M",  # Inherits Q6_K embeddings, Q5_K V/FFN-down
    "output": "Q5_K"   # Single surgical change
}

# Q4_K_L: Only upgrade embeddings from M baseline
config = {
    "base": "Q4_K_M",     # Inherits Q6_K V/FFN-down
    "embeddings": "Q8_0"  # Single surgical change
}

# Q3_K_XL: The only variant needing two changes
config = {
    "base": "Q3_K_M",
    "embeddings": "Q8_0",
    "output": "Q5_K"
}

This minimalist approach recognises that M variants already embody years of community optimisation. Bartowski's contribution lies in identifying where small adjustments yield outsized returns.

The Deeper Pattern

This system evolved through countless experiments rather than top-down design. M variants encode hard-won knowledge about critical layers. L variants build on this foundation. The absence of most XL variants shows where diminishing returns set in.

Bartowski's quantisations work because they embody years of collective learning about what matters in practice. They demonstrate that the best solutions often come from understanding and building upon what already works, rather than grand redesigns.