3.4 KiB
3.4 KiB
quantise.py - Advanced GGUF Quantisation
Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
Overview
This tool automates the complete quantisation workflow for converting models to GGUF format with multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
Quantisation Variants
The tool produces four quantisation variants based on Bartowski's method:
- Q4_K_M: Standard baseline quantisation
- Q4_K_L: Q6_K embeddings + Q6_K attention layers for better quality
- Q4_K_XL: Q8_0 embeddings + Q6_K attention layers for enhanced precision
- Q4_K_XXL: Q8_0 embeddings + Q8_0 attention for maximum precision
Features
- Automatic model download: Downloads models from HuggingFace automatically
- Importance matrix generation: Creates imatrix for improved quantisation quality
- Parallel processing: Uploads multiple variants simultaneously
- Progress tracking: Real-time status updates during conversion
- README generation: Automatically creates model cards with quantisation details
- HuggingFace integration: Direct upload to HuggingFace with proper metadata
Usage
Basic Usage
# Quantise a model from HuggingFace
uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
Command Line Options
# Skip imatrix generation for faster processing
uv run quantise.py <model_url> --no-imatrix
# Local testing without upload
uv run quantise.py <model_url> --no-upload
# Custom output directory
uv run quantise.py <model_url> --output-dir ./my-models
# Use specific HuggingFace token
uv run quantise.py <model_url> --hf-token YOUR_TOKEN
Environment Variables
HF_TOKEN
: HuggingFace API token for uploadsLLAMA_CPP_DIR
: Custom path to llama.cpp binariesDEBUG
: Enable debug logging when set to "true"
Requirements
- llama.cpp binaries:
llama-quantize
,llama-cli
,llama-imatrix
- Calibration data:
resources/imatrix_data.txt
for importance matrix generation - HuggingFace account: For uploading quantised models (optional)
Workflow
- Download: Fetches the model from HuggingFace
- Convert: Converts to initial GGUF format (F32)
- Generate imatrix: Creates importance matrix using calibration data
- Quantise: Produces multiple quantisation variants in parallel
- Upload: Pushes quantised models to HuggingFace with metadata
- Clean up: Removes temporary files and caches
Output Structure
output_dir/
├── model-F32.gguf # Full precision conversion
├── model-Q4_K_M.gguf # Standard quantisation
├── model-Q4_K_M-imat.gguf # With importance matrix
├── model-Q4_K_L-imat.gguf # Enhanced embeddings/attention
├── model-Q4_K_XL-imat.gguf # High precision embeddings
├── model-Q4_K_XXL-imat.gguf # Maximum precision
└── imatrix.dat # Generated importance matrix
Error Handling
The tool includes comprehensive error handling for:
- Network failures during download
- Missing binaries or dependencies
- Insufficient disk space
- HuggingFace API errors
- Conversion failures
Performance Considerations
- Disk space: Requires ~3x model size in free space
- Memory: Needs RAM proportional to model size
- Processing time: Varies from minutes to hours based on model size
- Network: Downloads can be large (10-100+ GB for large models)