102 lines
3.4 KiB
Markdown
102 lines
3.4 KiB
Markdown
# quantise.py - Advanced GGUF Quantisation
|
|
|
|
Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
|
|
|
|
## Overview
|
|
|
|
This tool automates the complete quantisation workflow for converting models to GGUF format with
|
|
multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
|
|
|
|
## Quantisation Variants
|
|
|
|
The tool produces four quantisation variants based on Bartowski's method:
|
|
|
|
- **Q4_K_M**: Standard baseline quantisation
|
|
- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
|
|
- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
|
|
- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
|
|
|
|
## Features
|
|
|
|
- **Automatic model download**: Downloads models from HuggingFace automatically
|
|
- **Importance matrix generation**: Creates imatrix for improved quantisation quality
|
|
- **Parallel processing**: Uploads multiple variants simultaneously
|
|
- **Progress tracking**: Real-time status updates during conversion
|
|
- **README generation**: Automatically creates model cards with quantisation details
|
|
- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# Quantise a model from HuggingFace
|
|
uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
|
|
```
|
|
|
|
### Command Line Options
|
|
|
|
```bash
|
|
# Skip imatrix generation for faster processing
|
|
uv run quantise.py <model_url> --no-imatrix
|
|
|
|
# Local testing without upload
|
|
uv run quantise.py <model_url> --no-upload
|
|
|
|
# Custom output directory
|
|
uv run quantise.py <model_url> --output-dir ./my-models
|
|
|
|
# Use specific HuggingFace token
|
|
uv run quantise.py <model_url> --hf-token YOUR_TOKEN
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
- `HF_TOKEN`: HuggingFace API token for uploads
|
|
- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
|
|
- `DEBUG`: Enable debug logging when set to "true"
|
|
|
|
## Requirements
|
|
|
|
- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
|
|
- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
|
|
- **HuggingFace account**: For uploading quantised models (optional)
|
|
|
|
## Workflow
|
|
|
|
1. **Download**: Fetches the model from HuggingFace
|
|
2. **Convert**: Converts to initial GGUF format (F32)
|
|
3. **Generate imatrix**: Creates importance matrix using calibration data
|
|
4. **Quantise**: Produces multiple quantisation variants in parallel
|
|
5. **Upload**: Pushes quantised models to HuggingFace with metadata
|
|
6. **Clean up**: Removes temporary files and caches
|
|
|
|
## Output Structure
|
|
|
|
```plain
|
|
output_dir/
|
|
├── model-F32.gguf # Full precision conversion
|
|
├── model-Q4_K_M.gguf # Standard quantisation
|
|
├── model-Q4_K_M-imat.gguf # With importance matrix
|
|
├── model-Q4_K_L-imat.gguf # Enhanced embeddings/attention
|
|
├── model-Q4_K_XL-imat.gguf # High precision embeddings
|
|
├── model-Q4_K_XXL-imat.gguf # Maximum precision
|
|
└── imatrix.dat # Generated importance matrix
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
The tool includes comprehensive error handling for:
|
|
|
|
- Network failures during download
|
|
- Missing binaries or dependencies
|
|
- Insufficient disk space
|
|
- HuggingFace API errors
|
|
- Conversion failures
|
|
|
|
## Performance Considerations
|
|
|
|
- **Disk space**: Requires ~3x model size in free space
|
|
- **Memory**: Needs RAM proportional to model size
|
|
- **Processing time**: Varies from minutes to hours based on model size
|
|
- **Network**: Downloads can be large (10-100+ GB for large models)
|