llm-gguf-tools/docs/quantise_gguf.md

7.3 KiB
Raw Blame History

quantise_gguf.py - Advanced GGUF Quantisation

Transforms language models into optimised GGUF formats, from aggressive Q2 compression to high-precision Q8_0. Based on analysis of community quantisation patterns, it achieves excellent quality-to-size ratios whilst working within Python-to-C++ interop constraints.

  1. The Full Picture
  2. Understanding the Variants
  3. Practical Usage
  4. The Architecture Behind the Magic
  5. Environment and Performance
  6. Output and Organisation

The Full Picture

GGUF quantisation isn't uniform precision reduction. The tool supports the complete llama.cpp spectrum: K-quant series (Q3_K-Q6_K) with S/M/L variants, legacy formats (Q4_0, Q5_1), experimental integer types (IQ2-IQ4), and full precision F16/BF16. The key is understanding strategic usage.

Replicating Bartowski's patterns revealed an interesting limitation. Llama-cpp-python provides embedding and output layer control, but the sophisticated tensor_types parameter expects a C++ std::vector<tensor_quantization> pointer impossible to create from Python. This architectural boundary between Python and C++ cannot be worked around without significant redesign.

Analysis of Bartowski's GGUF files shows this limitation doesn't matter. M variants already include per-layer enhancements Q4_K_M uses Q6_K for embeddings, attention V, and FFN down layers. Bartowski's L and XL variants only tweak embeddings and output layers, precisely what we can control. Working with constraints rather than against them.

For further optimisation, importance matrix (imatrix) files guide quantisation based on usage patterns, outperforming fixed rules. See the IMatrix Guide for obtaining or generating these files particularly crucial at lower bit rates.

Understanding the Variants

Our profiles match Bartowski's exact configurations from GGUF analysis. M variants aren't middle ground but optimised baselines Q4_K_M uses Q6_K for critical layers whilst maintaining Q4_K elsewhere, a balance proven through years of community experimentation.

L variants make minimal but impactful changes. Q4_K_L upgrades embeddings from Q6_K to Q8_0 (+19% size, better vocabulary). Q3_K_L upgrades output to Q5_K. Q3_K_XL combines both strategies. No Q4_K_XL or Q5_K_XL exist at those sizes, Q5_K_M's superior base quantisation wins.

Q5_K_L and Q6_K_L upgrade embeddings to Q8_0, providing stepping stones between major levels for fine-grained size-quality control. See Bartowski Analysis for detailed architectural interactions.

Practical Usage

The tool handles the complete workflow: fetches from HuggingFace, converts to GGUF, checks for imatrix files, processes multiple variants with parallel uploads, generates documentation, and uploads with metadata. Fire-and-forget design start it and return to completed models.

The Python API enables custom configurations (limited to embedding and output layers due to llama-cpp-python constraints):

from helpers.services.llama_python import LlamaCppPythonAPI

api = LlamaCppPythonAPI()

# Q4_K_L profile - upgrades embeddings to Q8_0
api.quantise_model_flexible(
    input_path="model-f16.gguf",
    output_path="model-Q4_K_L.gguf",
    base_type="Q4_K_M",      # Q4_K_M uses Q6_K for embeddings, attn_v, and ffn_down (not flat Q4_K!)
    embedding_type="Q8_0",   # Further upgrade embeddings from Q6_K to Q8_0
    output_type=None         # Keep default from base type
)

# Example 2: Q3_K_L profile - upgrades output to Q5_K
api.quantise_model_flexible(
    input_path="model-f16.gguf",
    output_path="model-Q3_K_L.gguf",
    base_type="Q3_K_M",      # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down (not flat Q3_K!)
    embedding_type=None,     # Keep the already-enhanced Q6_K embeddings from base
    output_type="Q5_K"       # Upgrade output from Q4_K to Q5_K
)

# Q3_K_XL profile - upgrades both embeddings and output
api.quantise_model_flexible(
    input_path="model-f16.gguf",
    output_path="model-Q3_K_XL.gguf",
    base_type="Q3_K_M",    # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down
    embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
    output_type="Q5_K"     # Upgrade output from Q4_K to Q5_K
)

# Example 4: Custom experimental configuration
api.quantise_model_flexible(
    input_path="model-f16.gguf",
    output_path="model-custom.gguf",
    base_type="Q5_K_M",    # Q5_K_M uses Q6_K embeddings, Q6_K attn_v, Q6_K ffn_down
    embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
    output_type="Q8_0"     # Upgrade output to maximum precision Q8_0
)

Command-line usage is even simpler. Just point it at a HuggingFace model and let it work:

# Basic usage
uv run quantise_gguf.py https://huggingface.co/meta-llama/Llama-3.2-1B

# Skip imatrix checking for speed
uv run quantise_gguf.py <model_url> --no-imatrix

# Local testing without upload
uv run quantise_gguf.py <model_url> --no-upload

# Custom profiles
uv run quantise_gguf.py <model_url> --profiles Q3_K_M Q4_K_L Q6_K

The Architecture Behind the Magic

Based on Qwen3 4B analysis: embeddings (9.7% of parameters) critically affect vocabulary Q4 to Q8 adds just 0.17GB but dramatically improves rare tokens. Attention (14.1% total) has V layers (4.7%) enhanced in M variants whilst Q and K stay at base for size control.

Feed-forward layers show clear trade-offs: gate/up projections (44.6% of parameters) stay at base as enhancement would double size for modest gains. Down projections (22.3%) are enhanced in M variants for feature transformation quality. Output layer (9.4%) gets special attention in Q3_K_L for prediction quality.

For an 8B model: Q4_K_M baseline is ~4.5GB with Q6_K enhancements. Q4_K_L adds 753MB (5.3GB total) for Q8_0 embeddings. A hypothetical Q4_K_XL would reach 6.6GB at which point Q5_K_M's superior base quantisation makes more sense.

Environment and Performance

Configuration via environment variables: HF_TOKEN for uploads, LLAMA_CPP_DIR for custom binaries, DEBUG=true for verbose logging. Uses llama-cpp-python (auto-installed via uv), benefits from imatrix files, requires HuggingFace account only for uploads.

Requirements scale predictably: disk needs ~3x model size (original, F32, outputs), memory tracks model size with streaming optimisations. Processing takes minutes to hours depending on size. Downloads range from gigabytes to 100GB+ for largest models.

Comprehensive error handling: automatic retry with exponential backoff, early dependency detection, disk space checks, actionable API error messages, detailed conversion failure logs. Resilient workflow keeps you informed whilst handling large model processing challenges.

Output and Organisation

Outputs organised per model: F32/F16 base, quantisation variants, imatrix files, documentation. Naming pattern: model-name-variant.gguf. Successful uploads auto-clean local files; failures preserve for manual intervention. READMEs document variant characteristics and technical details.

Uploads include metadata, quantisation tags, and model cards explaining trade-offs. Parallel upload system maximises throughput with full progress visibility.