llm-gguf-tools/docs/quantise_gguf.md
2025-08-07 18:29:12 +01:00

3.4 KiB

quantise.py - Advanced GGUF Quantisation

Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.

Overview

This tool automates the complete quantisation workflow for converting models to GGUF format with multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.

Quantisation Variants

The tool produces four quantisation variants based on Bartowski's method:

  • Q4_K_M: Standard baseline quantisation
  • Q4_K_L: Q6_K embeddings + Q6_K attention layers for better quality
  • Q4_K_XL: Q8_0 embeddings + Q6_K attention layers for enhanced precision
  • Q4_K_XXL: Q8_0 embeddings + Q8_0 attention for maximum precision

Features

  • Automatic model download: Downloads models from HuggingFace automatically
  • Importance matrix generation: Creates imatrix for improved quantisation quality
  • Parallel processing: Uploads multiple variants simultaneously
  • Progress tracking: Real-time status updates during conversion
  • README generation: Automatically creates model cards with quantisation details
  • HuggingFace integration: Direct upload to HuggingFace with proper metadata

Usage

Basic Usage

# Quantise a model from HuggingFace
uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B

Command Line Options

# Skip imatrix generation for faster processing
uv run quantise.py <model_url> --no-imatrix

# Local testing without upload
uv run quantise.py <model_url> --no-upload

# Custom output directory
uv run quantise.py <model_url> --output-dir ./my-models

# Use specific HuggingFace token
uv run quantise.py <model_url> --hf-token YOUR_TOKEN

Environment Variables

  • HF_TOKEN: HuggingFace API token for uploads
  • LLAMA_CPP_DIR: Custom path to llama.cpp binaries
  • DEBUG: Enable debug logging when set to "true"

Requirements

  • llama.cpp binaries: llama-quantize, llama-cli, llama-imatrix
  • Calibration data: resources/imatrix_data.txt for importance matrix generation
  • HuggingFace account: For uploading quantised models (optional)

Workflow

  1. Download: Fetches the model from HuggingFace
  2. Convert: Converts to initial GGUF format (F32)
  3. Generate imatrix: Creates importance matrix using calibration data
  4. Quantise: Produces multiple quantisation variants in parallel
  5. Upload: Pushes quantised models to HuggingFace with metadata
  6. Clean up: Removes temporary files and caches

Output Structure

output_dir/
├── model-F32.gguf           # Full precision conversion
├── model-Q4_K_M.gguf        # Standard quantisation
├── model-Q4_K_M-imat.gguf   # With importance matrix
├── model-Q4_K_L-imat.gguf   # Enhanced embeddings/attention
├── model-Q4_K_XL-imat.gguf  # High precision embeddings
├── model-Q4_K_XXL-imat.gguf # Maximum precision
└── imatrix.dat              # Generated importance matrix

Error Handling

The tool includes comprehensive error handling for:

  • Network failures during download
  • Missing binaries or dependencies
  • Insufficient disk space
  • HuggingFace API errors
  • Conversion failures

Performance Considerations

  • Disk space: Requires ~3x model size in free space
  • Memory: Needs RAM proportional to model size
  • Processing time: Varies from minutes to hours based on model size
  • Network: Downloads can be large (10-100+ GB for large models)