# quantise_gguf.py - Advanced GGUF Quantisation Transforms language models into optimised GGUF formats, from aggressive Q2 compression to high-precision Q8_0. Based on analysis of community quantisation patterns, it achieves excellent quality-to-size ratios whilst working within Python-to-C++ interop constraints. 1. [The Full Picture](#the-full-picture) 2. [Understanding the Variants](#understanding-the-variants) 3. [Practical Usage](#practical-usage) 4. [The Architecture Behind the Magic](#the-architecture-behind-the-magic) 5. [Environment and Performance](#environment-and-performance) 6. [Output and Organisation](#output-and-organisation) ## The Full Picture GGUF quantisation isn't uniform precision reduction. The tool supports the complete llama.cpp spectrum: K-quant series (Q3_K-Q6_K) with S/M/L variants, legacy formats (Q4_0, Q5_1), experimental integer types (IQ2-IQ4), and full precision F16/BF16. The key is understanding strategic usage. Replicating Bartowski's patterns revealed an interesting limitation. Llama-cpp-python provides embedding and output layer control, but the sophisticated `tensor_types` parameter expects a C++ `std::vector` pointer – impossible to create from Python. This architectural boundary between Python and C++ cannot be worked around without significant redesign. Analysis of Bartowski's GGUF files shows this limitation doesn't matter. M variants already include per-layer enhancements – Q4_K_M uses Q6_K for embeddings, attention V, and FFN down layers. Bartowski's L and XL variants only tweak embeddings and output layers, precisely what we can control. Working with constraints rather than against them. For further optimisation, importance matrix (imatrix) files guide quantisation based on usage patterns, outperforming fixed rules. See the [IMatrix Guide](./imatrix_data.md) for obtaining or generating these files – particularly crucial at lower bit rates. ## Understanding the Variants Our profiles match Bartowski's exact configurations from GGUF analysis. M variants aren't middle ground but optimised baselines – Q4_K_M uses Q6_K for critical layers whilst maintaining Q4_K elsewhere, a balance proven through years of community experimentation. L variants make minimal but impactful changes. Q4_K_L upgrades embeddings from Q6_K to Q8_0 (+19% size, better vocabulary). Q3_K_L upgrades output to Q5_K. Q3_K_XL combines both strategies. No Q4_K_XL or Q5_K_XL exist – at those sizes, Q5_K_M's superior base quantisation wins. Q5_K_L and Q6_K_L upgrade embeddings to Q8_0, providing stepping stones between major levels for fine-grained size-quality control. See [Bartowski Analysis](./bartowski_analysis.md) for detailed architectural interactions. ## Practical Usage The tool handles the complete workflow: fetches from HuggingFace, converts to GGUF, checks for imatrix files, processes multiple variants with parallel uploads, generates documentation, and uploads with metadata. Fire-and-forget design – start it and return to completed models. The Python API enables custom configurations (limited to embedding and output layers due to llama-cpp-python constraints): ```python from helpers.services.llama_python import LlamaCppPythonAPI api = LlamaCppPythonAPI() # Q4_K_L profile - upgrades embeddings to Q8_0 api.quantise_model_flexible( input_path="model-f16.gguf", output_path="model-Q4_K_L.gguf", base_type="Q4_K_M", # Q4_K_M uses Q6_K for embeddings, attn_v, and ffn_down (not flat Q4_K!) embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0 output_type=None # Keep default from base type ) # Example 2: Q3_K_L profile - upgrades output to Q5_K api.quantise_model_flexible( input_path="model-f16.gguf", output_path="model-Q3_K_L.gguf", base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down (not flat Q3_K!) embedding_type=None, # Keep the already-enhanced Q6_K embeddings from base output_type="Q5_K" # Upgrade output from Q4_K to Q5_K ) # Q3_K_XL profile - upgrades both embeddings and output api.quantise_model_flexible( input_path="model-f16.gguf", output_path="model-Q3_K_XL.gguf", base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0 output_type="Q5_K" # Upgrade output from Q4_K to Q5_K ) # Example 4: Custom experimental configuration api.quantise_model_flexible( input_path="model-f16.gguf", output_path="model-custom.gguf", base_type="Q5_K_M", # Q5_K_M uses Q6_K embeddings, Q6_K attn_v, Q6_K ffn_down embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0 output_type="Q8_0" # Upgrade output to maximum precision Q8_0 ) ``` Command-line usage is even simpler. Just point it at a HuggingFace model and let it work: ```bash # Basic usage uv run quantise_gguf.py https://huggingface.co/meta-llama/Llama-3.2-1B # Skip imatrix checking for speed uv run quantise_gguf.py --no-imatrix # Local testing without upload uv run quantise_gguf.py --no-upload # Custom profiles uv run quantise_gguf.py --profiles Q3_K_M Q4_K_L Q6_K ``` ## The Architecture Behind the Magic Based on Qwen3 4B analysis: embeddings (9.7% of parameters) critically affect vocabulary – Q4 to Q8 adds just 0.17GB but dramatically improves rare tokens. Attention (14.1% total) has V layers (4.7%) enhanced in M variants whilst Q and K stay at base for size control. Feed-forward layers show clear trade-offs: gate/up projections (44.6% of parameters) stay at base as enhancement would double size for modest gains. Down projections (22.3%) are enhanced in M variants for feature transformation quality. Output layer (9.4%) gets special attention in Q3_K_L for prediction quality. For an 8B model: Q4_K_M baseline is ~4.5GB with Q6_K enhancements. Q4_K_L adds 753MB (5.3GB total) for Q8_0 embeddings. A hypothetical Q4_K_XL would reach 6.6GB – at which point Q5_K_M's superior base quantisation makes more sense. ## Environment and Performance Configuration via environment variables: `HF_TOKEN` for uploads, `LLAMA_CPP_DIR` for custom binaries, `DEBUG=true` for verbose logging. Uses llama-cpp-python (auto-installed via uv), benefits from imatrix files, requires HuggingFace account only for uploads. Requirements scale predictably: disk needs ~3x model size (original, F32, outputs), memory tracks model size with streaming optimisations. Processing takes minutes to hours depending on size. Downloads range from gigabytes to 100GB+ for largest models. Comprehensive error handling: automatic retry with exponential backoff, early dependency detection, disk space checks, actionable API error messages, detailed conversion failure logs. Resilient workflow keeps you informed whilst handling large model processing challenges. ## Output and Organisation Outputs organised per model: F32/F16 base, quantisation variants, imatrix files, documentation. Naming pattern: `model-name-variant.gguf`. Successful uploads auto-clean local files; failures preserve for manual intervention. READMEs document variant characteristics and technical details. Uploads include metadata, quantisation tags, and model cards explaining trade-offs. Parallel upload system maximises throughput with full progress visibility.