# quantise.py - Advanced GGUF Quantisation Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline. ## Overview This tool automates the complete quantisation workflow for converting models to GGUF format with multiple precision variants, importance matrix generation, and automatic upload to HuggingFace. ## Quantisation Variants The tool produces four quantisation variants based on Bartowski's method: - **Q4_K_M**: Standard baseline quantisation - **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality - **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision - **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision ## Features - **Automatic model download**: Downloads models from HuggingFace automatically - **Importance matrix generation**: Creates imatrix for improved quantisation quality - **Parallel processing**: Uploads multiple variants simultaneously - **Progress tracking**: Real-time status updates during conversion - **README generation**: Automatically creates model cards with quantisation details - **HuggingFace integration**: Direct upload to HuggingFace with proper metadata ## Usage ### Basic Usage ```bash # Quantise a model from HuggingFace uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B ``` ### Command Line Options ```bash # Skip imatrix generation for faster processing uv run quantise.py --no-imatrix # Local testing without upload uv run quantise.py --no-upload # Custom output directory uv run quantise.py --output-dir ./my-models # Use specific HuggingFace token uv run quantise.py --hf-token YOUR_TOKEN ``` ## Environment Variables - `HF_TOKEN`: HuggingFace API token for uploads - `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries - `DEBUG`: Enable debug logging when set to "true" ## Requirements - **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix` - **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation - **HuggingFace account**: For uploading quantised models (optional) ## Workflow 1. **Download**: Fetches the model from HuggingFace 2. **Convert**: Converts to initial GGUF format (F32) 3. **Generate imatrix**: Creates importance matrix using calibration data 4. **Quantise**: Produces multiple quantisation variants in parallel 5. **Upload**: Pushes quantised models to HuggingFace with metadata 6. **Clean up**: Removes temporary files and caches ## Output Structure ```plain output_dir/ ├── model-F32.gguf # Full precision conversion ├── model-Q4_K_M.gguf # Standard quantisation ├── model-Q4_K_M-imat.gguf # With importance matrix ├── model-Q4_K_L-imat.gguf # Enhanced embeddings/attention ├── model-Q4_K_XL-imat.gguf # High precision embeddings ├── model-Q4_K_XXL-imat.gguf # Maximum precision └── imatrix.dat # Generated importance matrix ``` ## Error Handling The tool includes comprehensive error handling for: - Network failures during download - Missing binaries or dependencies - Insufficient disk space - HuggingFace API errors - Conversion failures ## Performance Considerations - **Disk space**: Requires ~3x model size in free space - **Memory**: Needs RAM proportional to model size - **Processing time**: Varies from minutes to hours based on model size - **Network**: Downloads can be large (10-100+ GB for large models)