151 lines
7.3 KiB
Markdown
151 lines
7.3 KiB
Markdown
# quantise_gguf.py - Advanced GGUF Quantisation
|
||
|
||
Transforms language models into optimised GGUF formats, from aggressive Q2 compression to
|
||
high-precision Q8_0. Based on analysis of community quantisation patterns, it achieves excellent
|
||
quality-to-size ratios whilst working within Python-to-C++ interop constraints.
|
||
|
||
1. [The Full Picture](#the-full-picture)
|
||
2. [Understanding the Variants](#understanding-the-variants)
|
||
3. [Practical Usage](#practical-usage)
|
||
4. [The Architecture Behind the Magic](#the-architecture-behind-the-magic)
|
||
5. [Environment and Performance](#environment-and-performance)
|
||
6. [Output and Organisation](#output-and-organisation)
|
||
|
||
## The Full Picture
|
||
|
||
GGUF quantisation isn't uniform precision reduction. The tool supports the complete llama.cpp
|
||
spectrum: K-quant series (Q3_K-Q6_K) with S/M/L variants, legacy formats (Q4_0, Q5_1), experimental
|
||
integer types (IQ2-IQ4), and full precision F16/BF16. The key is understanding strategic usage.
|
||
|
||
Replicating Bartowski's patterns revealed an interesting limitation. Llama-cpp-python provides
|
||
embedding and output layer control, but the sophisticated `tensor_types` parameter expects a C++
|
||
`std::vector<tensor_quantization>` pointer – impossible to create from Python. This architectural
|
||
boundary between Python and C++ cannot be worked around without significant redesign.
|
||
|
||
Analysis of Bartowski's GGUF files shows this limitation doesn't matter. M variants already include
|
||
per-layer enhancements – Q4_K_M uses Q6_K for embeddings, attention V, and FFN down layers.
|
||
Bartowski's L and XL variants only tweak embeddings and output layers, precisely what we can control.
|
||
Working with constraints rather than against them.
|
||
|
||
For further optimisation, importance matrix (imatrix) files guide quantisation based on usage
|
||
patterns, outperforming fixed rules. See the [IMatrix Guide](./imatrix_data.md) for obtaining or
|
||
generating these files – particularly crucial at lower bit rates.
|
||
|
||
## Understanding the Variants
|
||
|
||
Our profiles match Bartowski's exact configurations from GGUF analysis. M variants aren't middle
|
||
ground but optimised baselines – Q4_K_M uses Q6_K for critical layers whilst maintaining Q4_K
|
||
elsewhere, a balance proven through years of community experimentation.
|
||
|
||
L variants make minimal but impactful changes. Q4_K_L upgrades embeddings from Q6_K to Q8_0 (+19%
|
||
size, better vocabulary). Q3_K_L upgrades output to Q5_K. Q3_K_XL combines both strategies. No
|
||
Q4_K_XL or Q5_K_XL exist – at those sizes, Q5_K_M's superior base quantisation wins.
|
||
|
||
Q5_K_L and Q6_K_L upgrade embeddings to Q8_0, providing stepping stones between major levels for
|
||
fine-grained size-quality control. See [Bartowski Analysis](./bartowski_analysis.md) for detailed
|
||
architectural interactions.
|
||
|
||
## Practical Usage
|
||
|
||
The tool handles the complete workflow: fetches from HuggingFace, converts to GGUF, checks for
|
||
imatrix files, processes multiple variants with parallel uploads, generates documentation, and
|
||
uploads with metadata. Fire-and-forget design – start it and return to completed models.
|
||
|
||
The Python API enables custom configurations (limited to embedding and output layers due to
|
||
llama-cpp-python constraints):
|
||
|
||
```python
|
||
from helpers.services.llama_python import LlamaCppPythonAPI
|
||
|
||
api = LlamaCppPythonAPI()
|
||
|
||
# Q4_K_L profile - upgrades embeddings to Q8_0
|
||
api.quantise_model_flexible(
|
||
input_path="model-f16.gguf",
|
||
output_path="model-Q4_K_L.gguf",
|
||
base_type="Q4_K_M", # Q4_K_M uses Q6_K for embeddings, attn_v, and ffn_down (not flat Q4_K!)
|
||
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
|
||
output_type=None # Keep default from base type
|
||
)
|
||
|
||
# Example 2: Q3_K_L profile - upgrades output to Q5_K
|
||
api.quantise_model_flexible(
|
||
input_path="model-f16.gguf",
|
||
output_path="model-Q3_K_L.gguf",
|
||
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down (not flat Q3_K!)
|
||
embedding_type=None, # Keep the already-enhanced Q6_K embeddings from base
|
||
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
|
||
)
|
||
|
||
# Q3_K_XL profile - upgrades both embeddings and output
|
||
api.quantise_model_flexible(
|
||
input_path="model-f16.gguf",
|
||
output_path="model-Q3_K_XL.gguf",
|
||
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down
|
||
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
|
||
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
|
||
)
|
||
|
||
# Example 4: Custom experimental configuration
|
||
api.quantise_model_flexible(
|
||
input_path="model-f16.gguf",
|
||
output_path="model-custom.gguf",
|
||
base_type="Q5_K_M", # Q5_K_M uses Q6_K embeddings, Q6_K attn_v, Q6_K ffn_down
|
||
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
|
||
output_type="Q8_0" # Upgrade output to maximum precision Q8_0
|
||
)
|
||
```
|
||
|
||
Command-line usage is even simpler. Just point it at a HuggingFace model and let it work:
|
||
|
||
```bash
|
||
# Basic usage
|
||
uv run quantise_gguf.py https://huggingface.co/meta-llama/Llama-3.2-1B
|
||
|
||
# Skip imatrix checking for speed
|
||
uv run quantise_gguf.py <model_url> --no-imatrix
|
||
|
||
# Local testing without upload
|
||
uv run quantise_gguf.py <model_url> --no-upload
|
||
|
||
# Custom profiles
|
||
uv run quantise_gguf.py <model_url> --profiles Q3_K_M Q4_K_L Q6_K
|
||
```
|
||
|
||
## The Architecture Behind the Magic
|
||
|
||
Based on Qwen3 4B analysis: embeddings (9.7% of parameters) critically affect vocabulary – Q4 to Q8
|
||
adds just 0.17GB but dramatically improves rare tokens. Attention (14.1% total) has V layers (4.7%)
|
||
enhanced in M variants whilst Q and K stay at base for size control.
|
||
|
||
Feed-forward layers show clear trade-offs: gate/up projections (44.6% of parameters) stay at base
|
||
as enhancement would double size for modest gains. Down projections (22.3%) are enhanced in M
|
||
variants for feature transformation quality. Output layer (9.4%) gets special attention in Q3_K_L
|
||
for prediction quality.
|
||
|
||
For an 8B model: Q4_K_M baseline is ~4.5GB with Q6_K enhancements. Q4_K_L adds 753MB (5.3GB total)
|
||
for Q8_0 embeddings. A hypothetical Q4_K_XL would reach 6.6GB – at which point Q5_K_M's superior
|
||
base quantisation makes more sense.
|
||
|
||
## Environment and Performance
|
||
|
||
Configuration via environment variables: `HF_TOKEN` for uploads, `LLAMA_CPP_DIR` for custom
|
||
binaries, `DEBUG=true` for verbose logging. Uses llama-cpp-python (auto-installed via uv),
|
||
benefits from imatrix files, requires HuggingFace account only for uploads.
|
||
|
||
Requirements scale predictably: disk needs ~3x model size (original, F32, outputs), memory tracks
|
||
model size with streaming optimisations. Processing takes minutes to hours depending on size.
|
||
Downloads range from gigabytes to 100GB+ for largest models.
|
||
|
||
Comprehensive error handling: automatic retry with exponential backoff, early dependency detection,
|
||
disk space checks, actionable API error messages, detailed conversion failure logs. Resilient
|
||
workflow keeps you informed whilst handling large model processing challenges.
|
||
|
||
## Output and Organisation
|
||
|
||
Outputs organised per model: F32/F16 base, quantisation variants, imatrix files, documentation.
|
||
Naming pattern: `model-name-variant.gguf`. Successful uploads auto-clean local files; failures
|
||
preserve for manual intervention. READMEs document variant characteristics and technical details.
|
||
|
||
Uploads include metadata, quantisation tags, and model cards explaining trade-offs. Parallel upload
|
||
system maximises throughput with full progress visibility.
|