Switch to llama-cpp-python
This commit is contained in:
parent
ef7df1a8c3
commit
d937f2d5fa
25 changed files with 2957 additions and 1181 deletions
127
docs/bartowski_analysis.md
Normal file
127
docs/bartowski_analysis.md
Normal file
|
@ -0,0 +1,127 @@
|
|||
# Bartowski Quantisation Analysis
|
||||
|
||||
Analysis of Bartowski GGUF files reveals why these models work so well: the "M" variants don't
|
||||
apply uniform quantisation as their names suggest.
|
||||
|
||||
1. [The Hidden Sophistication of M Variants](#the-hidden-sophistication-of-m-variants)
|
||||
2. [The Complete Quantisation Map](#the-complete-quantisation-map)
|
||||
3. [The Architecture of Intelligence](#the-architecture-of-intelligence)
|
||||
4. [The Economics of Enhancement](#the-economics-of-enhancement)
|
||||
5. [Why Q3\_K Gets Special Treatment](#why-q3_k-gets-special-treatment)
|
||||
6. [Implementation Insights](#implementation-insights)
|
||||
7. [The Deeper Pattern](#the-deeper-pattern)
|
||||
|
||||
## The Hidden Sophistication of M Variants
|
||||
|
||||
When creating a Q4_K_M model, llama.cpp doesn't apply Q4_K throughout. Instead, it strategically
|
||||
enhances critical components – embeddings jump to Q6_K, attention V layers get Q6_K, and FFN down
|
||||
projections receive the same treatment. This represents years of empirical optimisation baked
|
||||
directly into the quantisation logic.
|
||||
|
||||
The L and XL models make surgical adjustments to an already-optimised foundation. Q4_K_L simply
|
||||
takes the enhanced Q4_K_M and upgrades embeddings from Q6_K to Q8_0. This explains why file size
|
||||
increases are modest relative to quality gains.
|
||||
|
||||
## The Complete Quantisation Map
|
||||
|
||||
Here's what's actually happening inside these models, based on analysis of real GGUF files:
|
||||
|
||||
| Variant | Embed | Output | Q | K | V | Gate | Up | Down |
|
||||
|----------|-------|--------|-------|-------|-------|-------|-------|-------|
|
||||
| Q3_K_M | Q6_K | Q4_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
|
||||
| Q3_K_L | Q6_K | Q5_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
|
||||
| Q3_K_XL | Q8_0 | Q5_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
|
||||
| Q4_K_M | Q6_K | Q4_K | Q4_K | Q4_K | Q6_K | Q4_K | Q4_K | Q6_K |
|
||||
| Q4_K_L | Q8_0 | Q4_K | Q4_K | Q4_K | Q6_K | Q4_K | Q4_K | Q6_K |
|
||||
| Q5_K_M | Q6_K | Q5_K | Q5_K | Q5_K | Q6_K | Q5_K | Q5_K | Q6_K |
|
||||
| Q5_K_L | Q8_0 | Q5_K | Q5_K | Q5_K | Q6_K | Q5_K | Q5_K | Q6_K |
|
||||
| Q6_K_L | Q8_0 | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K |
|
||||
|
||||
Key patterns: M variants boost embeddings to Q6_K, enhance attention V layers (Q3→Q5, Q4/Q5→Q6),
|
||||
and upgrade FFN down projections. L variants change just embeddings or output. Only Q3_K has an XL
|
||||
variant as it has room for both improvements without competing with the next tier.
|
||||
|
||||
## The Architecture of Intelligence
|
||||
|
||||
Using a Qwen3 4B model as reference: embeddings comprise just 9.7% of parameters (389M, 0.78GB at
|
||||
F16) yet fundamentally determine vocabulary understanding. Poor embedding quantisation prevents the
|
||||
model from distinguishing similar tokens. Upgrading from Q4 to Q8 adds only 0.17GB but dramatically
|
||||
improves handling of technical terms and rare words.
|
||||
|
||||
Attention (Q, K, V) accounts for 14.1% of parameters (566M, 1.13GB). Value vectors (V) are critical
|
||||
– they're what the model retrieves when attending to context. M variants enhance V layers whilst
|
||||
leaving Q and K at base quantisation for better information retrieval without excessive size increase.
|
||||
|
||||
Feed-forward network trade-offs: Gate and up projections (44.6% of parameters, 1,793M, 3.59GB)
|
||||
stay at base quantisation as enhancement would double file sizes for modest gains. Down projections
|
||||
(22.3%, 897M, 1.79GB) get enhanced in M variants as they're the final transformation affecting all
|
||||
downstream processing.
|
||||
|
||||
The output layer (9.4% of parameters, 378M, 0.75GB) determines final token predictions. Q3_K_L
|
||||
targets it for enhancement as improved output precision can mean the difference between coherent
|
||||
and garbled text for Q3-based models.
|
||||
|
||||
## The Economics of Enhancement
|
||||
|
||||
Q4_K_M at 2.26GB already includes strategic Q6_K enhancements. The L variant adds just 0.44GB (19%
|
||||
increase) by upgrading only embeddings to Q8_0, leveraging existing enhancements whilst maximising
|
||||
vocabulary understanding. A naive approach of upgrading everything would add gigabytes for marginal
|
||||
improvements.
|
||||
|
||||
Bartowski's popularity stems from carefully chosen points in the size-quality space. Each variant
|
||||
represents a local optimum – better quality requires jumping tiers, smaller size sacrifices key
|
||||
capabilities.
|
||||
|
||||
## Why Q3_K Gets Special Treatment
|
||||
|
||||
Q3_K uniquely has an XL variant because it starts from the lowest practical quantisation with room
|
||||
for improvement. The progression from Q3_K_M (1.5GB) through L (1.6GB) to XL (1.8GB) provides
|
||||
granular control for memory-constrained environments, with each 15-20% size increase delivering
|
||||
meaningful quality improvements.
|
||||
|
||||
Q4_K_XL or Q5_K_XL don't exist because they'd compete with the next tier. A hypothetical Q4_K_XL
|
||||
at 2.75GB would match Q5_K_M's size, but Q5_K_M's superior base quantisation provides better
|
||||
quality than selectively enhanced Q4_K layers.
|
||||
|
||||
The pattern is consistent: significant enhancements to Q5_K or Q6_K mean you should jump to the
|
||||
next base type. Sweet spots: Q3 family for extreme memory constraints, Q4/Q5 for mainstream use,
|
||||
Q6/Q8 when quality matters more than size.
|
||||
|
||||
## Implementation Insights
|
||||
|
||||
Since llama.cpp's M variants already include sophisticated enhancements, replicating Bartowski's
|
||||
variants requires minimal configuration:
|
||||
|
||||
```python
|
||||
# Q3_K_L: Only upgrade output from M baseline
|
||||
config = {
|
||||
"base": "Q3_K_M", # Inherits Q6_K embeddings, Q5_K V/FFN-down
|
||||
"output": "Q5_K" # Single surgical change
|
||||
}
|
||||
|
||||
# Q4_K_L: Only upgrade embeddings from M baseline
|
||||
config = {
|
||||
"base": "Q4_K_M", # Inherits Q6_K V/FFN-down
|
||||
"embeddings": "Q8_0" # Single surgical change
|
||||
}
|
||||
|
||||
# Q3_K_XL: The only variant needing two changes
|
||||
config = {
|
||||
"base": "Q3_K_M",
|
||||
"embeddings": "Q8_0",
|
||||
"output": "Q5_K"
|
||||
}
|
||||
```
|
||||
|
||||
This minimalist approach recognises that M variants already embody years of community optimisation.
|
||||
Bartowski's contribution lies in identifying where small adjustments yield outsized returns.
|
||||
|
||||
## The Deeper Pattern
|
||||
|
||||
This system evolved through countless experiments rather than top-down design. M variants encode
|
||||
hard-won knowledge about critical layers. L variants build on this foundation. The absence of most
|
||||
XL variants shows where diminishing returns set in.
|
||||
|
||||
Bartowski's quantisations work because they embody years of collective learning about what matters
|
||||
in practice. They demonstrate that the best solutions often come from understanding and building
|
||||
upon what already works, rather than grand redesigns.
|
Loading…
Add table
Add a link
Reference in a new issue