Switch to llama-cpp-python

2025-08-08 21:40:15 +01:00 · 2025-08-08 21:40:15 +01:00 · d937f2d5fa
commit d937f2d5fa
parent ef7df1a8c3
25 changed files with 2957 additions and 1181 deletions
--- a/docs/bartowski_analysis.md
+++ b/docs/bartowski_analysis.md
@ -0,0 +1,127 @@
+# Bartowski Quantisation Analysis
+
+Analysis of Bartowski GGUF files reveals why these models work so well: the "M" variants don't
+apply uniform quantisation as their names suggest.
+
+1. [The Hidden Sophistication of M Variants](#the-hidden-sophistication-of-m-variants)
+2. [The Complete Quantisation Map](#the-complete-quantisation-map)
+3. [The Architecture of Intelligence](#the-architecture-of-intelligence)
+4. [The Economics of Enhancement](#the-economics-of-enhancement)
+5. [Why Q3\_K Gets Special Treatment](#why-q3_k-gets-special-treatment)
+6. [Implementation Insights](#implementation-insights)
+7. [The Deeper Pattern](#the-deeper-pattern)
+
+## The Hidden Sophistication of M Variants
+
+When creating a Q4_K_M model, llama.cpp doesn't apply Q4_K throughout. Instead, it strategically
+enhances critical components – embeddings jump to Q6_K, attention V layers get Q6_K, and FFN down
+projections receive the same treatment. This represents years of empirical optimisation baked
+directly into the quantisation logic.
+
+The L and XL models make surgical adjustments to an already-optimised foundation. Q4_K_L simply
+takes the enhanced Q4_K_M and upgrades embeddings from Q6_K to Q8_0. This explains why file size
+increases are modest relative to quality gains.
+
+## The Complete Quantisation Map
+
+Here's what's actually happening inside these models, based on analysis of real GGUF files:
+
+| Variant  | Embed | Output | Q     | K     | V     | Gate  | Up    | Down  |
+|----------|-------|--------|-------|-------|-------|-------|-------|-------|
+| Q3_K_M   | Q6_K  | Q4_K   | Q3_K  | Q3_K  | Q5_K  | Q3_K  | Q3_K  | Q5_K  |
+| Q3_K_L   | Q6_K  | Q5_K   | Q3_K  | Q3_K  | Q5_K  | Q3_K  | Q3_K  | Q5_K  |
+| Q3_K_XL  | Q8_0  | Q5_K   | Q3_K  | Q3_K  | Q5_K  | Q3_K  | Q3_K  | Q5_K  |
+| Q4_K_M   | Q6_K  | Q4_K   | Q4_K  | Q4_K  | Q6_K  | Q4_K  | Q4_K  | Q6_K  |
+| Q4_K_L   | Q8_0  | Q4_K   | Q4_K  | Q4_K  | Q6_K  | Q4_K  | Q4_K  | Q6_K  |
+| Q5_K_M   | Q6_K  | Q5_K   | Q5_K  | Q5_K  | Q6_K  | Q5_K  | Q5_K  | Q6_K  |
+| Q5_K_L   | Q8_0  | Q5_K   | Q5_K  | Q5_K  | Q6_K  | Q5_K  | Q5_K  | Q6_K  |
+| Q6_K_L   | Q8_0  | Q6_K   | Q6_K  | Q6_K  | Q6_K  | Q6_K  | Q6_K  | Q6_K  |
+
+Key patterns: M variants boost embeddings to Q6_K, enhance attention V layers (Q3→Q5, Q4/Q5→Q6),
+and upgrade FFN down projections. L variants change just embeddings or output. Only Q3_K has an XL
+variant as it has room for both improvements without competing with the next tier.
+
+## The Architecture of Intelligence
+
+Using a Qwen3 4B model as reference: embeddings comprise just 9.7% of parameters (389M, 0.78GB at
+F16) yet fundamentally determine vocabulary understanding. Poor embedding quantisation prevents the
+model from distinguishing similar tokens. Upgrading from Q4 to Q8 adds only 0.17GB but dramatically
+improves handling of technical terms and rare words.
+
+Attention (Q, K, V) accounts for 14.1% of parameters (566M, 1.13GB). Value vectors (V) are critical
+– they're what the model retrieves when attending to context. M variants enhance V layers whilst
+leaving Q and K at base quantisation for better information retrieval without excessive size increase.
+
+Feed-forward network trade-offs: Gate and up projections (44.6% of parameters, 1,793M, 3.59GB)
+stay at base quantisation as enhancement would double file sizes for modest gains. Down projections
+(22.3%, 897M, 1.79GB) get enhanced in M variants as they're the final transformation affecting all
+downstream processing.
+
+The output layer (9.4% of parameters, 378M, 0.75GB) determines final token predictions. Q3_K_L
+targets it for enhancement as improved output precision can mean the difference between coherent
+and garbled text for Q3-based models.
+
+## The Economics of Enhancement
+
+Q4_K_M at 2.26GB already includes strategic Q6_K enhancements. The L variant adds just 0.44GB (19%
+increase) by upgrading only embeddings to Q8_0, leveraging existing enhancements whilst maximising
+vocabulary understanding. A naive approach of upgrading everything would add gigabytes for marginal
+improvements.
+
+Bartowski's popularity stems from carefully chosen points in the size-quality space. Each variant
+represents a local optimum – better quality requires jumping tiers, smaller size sacrifices key
+capabilities.
+
+## Why Q3_K Gets Special Treatment
+
+Q3_K uniquely has an XL variant because it starts from the lowest practical quantisation with room
+for improvement. The progression from Q3_K_M (1.5GB) through L (1.6GB) to XL (1.8GB) provides
+granular control for memory-constrained environments, with each 15-20% size increase delivering
+meaningful quality improvements.
+
+Q4_K_XL or Q5_K_XL don't exist because they'd compete with the next tier. A hypothetical Q4_K_XL
+at 2.75GB would match Q5_K_M's size, but Q5_K_M's superior base quantisation provides better
+quality than selectively enhanced Q4_K layers.
+
+The pattern is consistent: significant enhancements to Q5_K or Q6_K mean you should jump to the
+next base type. Sweet spots: Q3 family for extreme memory constraints, Q4/Q5 for mainstream use,
+Q6/Q8 when quality matters more than size.
+
+## Implementation Insights
+
+Since llama.cpp's M variants already include sophisticated enhancements, replicating Bartowski's
+variants requires minimal configuration:
+
+```python
+# Q3_K_L: Only upgrade output from M baseline
+config = {
+    "base": "Q3_K_M",  # Inherits Q6_K embeddings, Q5_K V/FFN-down
+    "output": "Q5_K"   # Single surgical change
+}
+
+# Q4_K_L: Only upgrade embeddings from M baseline
+config = {
+    "base": "Q4_K_M",     # Inherits Q6_K V/FFN-down
+    "embeddings": "Q8_0"  # Single surgical change
+}
+
+# Q3_K_XL: The only variant needing two changes
+config = {
+    "base": "Q3_K_M",
+    "embeddings": "Q8_0",
+    "output": "Q5_K"
+}
+```
+
+This minimalist approach recognises that M variants already embody years of community optimisation.
+Bartowski's contribution lies in identifying where small adjustments yield outsized returns.
+
+## The Deeper Pattern
+
+This system evolved through countless experiments rather than top-down design. M variants encode
+hard-won knowledge about critical layers. L variants build on this foundation. The absence of most
+XL variants shows where diminishing returns set in.
+
+Bartowski's quantisations work because they embody years of collective learning about what matters
+in practice. They demonstrate that the best solutions often come from understanding and building
+upon what already works, rather than grand redesigns.