llm-gguf-tools/docs/quantise_gguf.md

151 lines
7.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# quantise_gguf.py - Advanced GGUF Quantisation
Transforms language models into optimised GGUF formats, from aggressive Q2 compression to
high-precision Q8_0. Based on analysis of community quantisation patterns, it achieves excellent
quality-to-size ratios whilst working within Python-to-C++ interop constraints.
1. [The Full Picture](#the-full-picture)
2. [Understanding the Variants](#understanding-the-variants)
3. [Practical Usage](#practical-usage)
4. [The Architecture Behind the Magic](#the-architecture-behind-the-magic)
5. [Environment and Performance](#environment-and-performance)
6. [Output and Organisation](#output-and-organisation)
## The Full Picture
GGUF quantisation isn't uniform precision reduction. The tool supports the complete llama.cpp
spectrum: K-quant series (Q3_K-Q6_K) with S/M/L variants, legacy formats (Q4_0, Q5_1), experimental
integer types (IQ2-IQ4), and full precision F16/BF16. The key is understanding strategic usage.
Replicating Bartowski's patterns revealed an interesting limitation. Llama-cpp-python provides
embedding and output layer control, but the sophisticated `tensor_types` parameter expects a C++
`std::vector<tensor_quantization>` pointer impossible to create from Python. This architectural
boundary between Python and C++ cannot be worked around without significant redesign.
Analysis of Bartowski's GGUF files shows this limitation doesn't matter. M variants already include
per-layer enhancements Q4_K_M uses Q6_K for embeddings, attention V, and FFN down layers.
Bartowski's L and XL variants only tweak embeddings and output layers, precisely what we can control.
Working with constraints rather than against them.
For further optimisation, importance matrix (imatrix) files guide quantisation based on usage
patterns, outperforming fixed rules. See the [IMatrix Guide](./imatrix_data.md) for obtaining or
generating these files particularly crucial at lower bit rates.
## Understanding the Variants
Our profiles match Bartowski's exact configurations from GGUF analysis. M variants aren't middle
ground but optimised baselines Q4_K_M uses Q6_K for critical layers whilst maintaining Q4_K
elsewhere, a balance proven through years of community experimentation.
L variants make minimal but impactful changes. Q4_K_L upgrades embeddings from Q6_K to Q8_0 (+19%
size, better vocabulary). Q3_K_L upgrades output to Q5_K. Q3_K_XL combines both strategies. No
Q4_K_XL or Q5_K_XL exist at those sizes, Q5_K_M's superior base quantisation wins.
Q5_K_L and Q6_K_L upgrade embeddings to Q8_0, providing stepping stones between major levels for
fine-grained size-quality control. See [Bartowski Analysis](./bartowski_analysis.md) for detailed
architectural interactions.
## Practical Usage
The tool handles the complete workflow: fetches from HuggingFace, converts to GGUF, checks for
imatrix files, processes multiple variants with parallel uploads, generates documentation, and
uploads with metadata. Fire-and-forget design start it and return to completed models.
The Python API enables custom configurations (limited to embedding and output layers due to
llama-cpp-python constraints):
```python
from helpers.services.llama_python import LlamaCppPythonAPI
api = LlamaCppPythonAPI()
# Q4_K_L profile - upgrades embeddings to Q8_0
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-Q4_K_L.gguf",
base_type="Q4_K_M", # Q4_K_M uses Q6_K for embeddings, attn_v, and ffn_down (not flat Q4_K!)
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
output_type=None # Keep default from base type
)
# Example 2: Q3_K_L profile - upgrades output to Q5_K
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-Q3_K_L.gguf",
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down (not flat Q3_K!)
embedding_type=None, # Keep the already-enhanced Q6_K embeddings from base
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
)
# Q3_K_XL profile - upgrades both embeddings and output
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-Q3_K_XL.gguf",
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
)
# Example 4: Custom experimental configuration
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-custom.gguf",
base_type="Q5_K_M", # Q5_K_M uses Q6_K embeddings, Q6_K attn_v, Q6_K ffn_down
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
output_type="Q8_0" # Upgrade output to maximum precision Q8_0
)
```
Command-line usage is even simpler. Just point it at a HuggingFace model and let it work:
```bash
# Basic usage
uv run quantise_gguf.py https://huggingface.co/meta-llama/Llama-3.2-1B
# Skip imatrix checking for speed
uv run quantise_gguf.py <model_url> --no-imatrix
# Local testing without upload
uv run quantise_gguf.py <model_url> --no-upload
# Custom profiles
uv run quantise_gguf.py <model_url> --profiles Q3_K_M Q4_K_L Q6_K
```
## The Architecture Behind the Magic
Based on Qwen3 4B analysis: embeddings (9.7% of parameters) critically affect vocabulary Q4 to Q8
adds just 0.17GB but dramatically improves rare tokens. Attention (14.1% total) has V layers (4.7%)
enhanced in M variants whilst Q and K stay at base for size control.
Feed-forward layers show clear trade-offs: gate/up projections (44.6% of parameters) stay at base
as enhancement would double size for modest gains. Down projections (22.3%) are enhanced in M
variants for feature transformation quality. Output layer (9.4%) gets special attention in Q3_K_L
for prediction quality.
For an 8B model: Q4_K_M baseline is ~4.5GB with Q6_K enhancements. Q4_K_L adds 753MB (5.3GB total)
for Q8_0 embeddings. A hypothetical Q4_K_XL would reach 6.6GB at which point Q5_K_M's superior
base quantisation makes more sense.
## Environment and Performance
Configuration via environment variables: `HF_TOKEN` for uploads, `LLAMA_CPP_DIR` for custom
binaries, `DEBUG=true` for verbose logging. Uses llama-cpp-python (auto-installed via uv),
benefits from imatrix files, requires HuggingFace account only for uploads.
Requirements scale predictably: disk needs ~3x model size (original, F32, outputs), memory tracks
model size with streaming optimisations. Processing takes minutes to hours depending on size.
Downloads range from gigabytes to 100GB+ for largest models.
Comprehensive error handling: automatic retry with exponential backoff, early dependency detection,
disk space checks, actionable API error messages, detailed conversion failure logs. Resilient
workflow keeps you informed whilst handling large model processing challenges.
## Output and Organisation
Outputs organised per model: F32/F16 base, quantisation variants, imatrix files, documentation.
Naming pattern: `model-name-variant.gguf`. Successful uploads auto-clean local files; failures
preserve for manual intervention. READMEs document variant characteristics and technical details.
Uploads include metadata, quantisation tags, and model cards explaining trade-offs. Parallel upload
system maximises throughput with full progress visibility.