Switch to llama-cpp-python

This commit is contained in:
Tom Foster 2025-08-08 21:40:15 +01:00
parent ef7df1a8c3
commit d937f2d5fa
25 changed files with 2957 additions and 1181 deletions

View file

@ -1,102 +1,151 @@
# quantise.py - Advanced GGUF Quantisation
# quantise_gguf.py - Advanced GGUF Quantisation
Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
Transforms language models into optimised GGUF formats, from aggressive Q2 compression to
high-precision Q8_0. Based on analysis of community quantisation patterns, it achieves excellent
quality-to-size ratios whilst working within Python-to-C++ interop constraints.
## Overview
1. [The Full Picture](#the-full-picture)
2. [Understanding the Variants](#understanding-the-variants)
3. [Practical Usage](#practical-usage)
4. [The Architecture Behind the Magic](#the-architecture-behind-the-magic)
5. [Environment and Performance](#environment-and-performance)
6. [Output and Organisation](#output-and-organisation)
This tool automates the complete quantisation workflow for converting models to GGUF format with
multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
## The Full Picture
## Quantisation Variants
GGUF quantisation isn't uniform precision reduction. The tool supports the complete llama.cpp
spectrum: K-quant series (Q3_K-Q6_K) with S/M/L variants, legacy formats (Q4_0, Q5_1), experimental
integer types (IQ2-IQ4), and full precision F16/BF16. The key is understanding strategic usage.
The tool produces four quantisation variants based on Bartowski's method:
Replicating Bartowski's patterns revealed an interesting limitation. Llama-cpp-python provides
embedding and output layer control, but the sophisticated `tensor_types` parameter expects a C++
`std::vector<tensor_quantization>` pointer impossible to create from Python. This architectural
boundary between Python and C++ cannot be worked around without significant redesign.
- **Q4_K_M**: Standard baseline quantisation
- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
Analysis of Bartowski's GGUF files shows this limitation doesn't matter. M variants already include
per-layer enhancements Q4_K_M uses Q6_K for embeddings, attention V, and FFN down layers.
Bartowski's L and XL variants only tweak embeddings and output layers, precisely what we can control.
Working with constraints rather than against them.
## Features
For further optimisation, importance matrix (imatrix) files guide quantisation based on usage
patterns, outperforming fixed rules. See the [IMatrix Guide](./imatrix_data.md) for obtaining or
generating these files particularly crucial at lower bit rates.
- **Automatic model download**: Downloads models from HuggingFace automatically
- **Importance matrix generation**: Creates imatrix for improved quantisation quality
- **Parallel processing**: Uploads multiple variants simultaneously
- **Progress tracking**: Real-time status updates during conversion
- **README generation**: Automatically creates model cards with quantisation details
- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
## Understanding the Variants
## Usage
Our profiles match Bartowski's exact configurations from GGUF analysis. M variants aren't middle
ground but optimised baselines Q4_K_M uses Q6_K for critical layers whilst maintaining Q4_K
elsewhere, a balance proven through years of community experimentation.
### Basic Usage
L variants make minimal but impactful changes. Q4_K_L upgrades embeddings from Q6_K to Q8_0 (+19%
size, better vocabulary). Q3_K_L upgrades output to Q5_K. Q3_K_XL combines both strategies. No
Q4_K_XL or Q5_K_XL exist at those sizes, Q5_K_M's superior base quantisation wins.
```bash
# Quantise a model from HuggingFace
uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
Q5_K_L and Q6_K_L upgrade embeddings to Q8_0, providing stepping stones between major levels for
fine-grained size-quality control. See [Bartowski Analysis](./bartowski_analysis.md) for detailed
architectural interactions.
## Practical Usage
The tool handles the complete workflow: fetches from HuggingFace, converts to GGUF, checks for
imatrix files, processes multiple variants with parallel uploads, generates documentation, and
uploads with metadata. Fire-and-forget design start it and return to completed models.
The Python API enables custom configurations (limited to embedding and output layers due to
llama-cpp-python constraints):
```python
from helpers.services.llama_python import LlamaCppPythonAPI
api = LlamaCppPythonAPI()
# Q4_K_L profile - upgrades embeddings to Q8_0
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-Q4_K_L.gguf",
base_type="Q4_K_M", # Q4_K_M uses Q6_K for embeddings, attn_v, and ffn_down (not flat Q4_K!)
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
output_type=None # Keep default from base type
)
# Example 2: Q3_K_L profile - upgrades output to Q5_K
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-Q3_K_L.gguf",
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down (not flat Q3_K!)
embedding_type=None, # Keep the already-enhanced Q6_K embeddings from base
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
)
# Q3_K_XL profile - upgrades both embeddings and output
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-Q3_K_XL.gguf",
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
)
# Example 4: Custom experimental configuration
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-custom.gguf",
base_type="Q5_K_M", # Q5_K_M uses Q6_K embeddings, Q6_K attn_v, Q6_K ffn_down
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
output_type="Q8_0" # Upgrade output to maximum precision Q8_0
)
```
### Command Line Options
Command-line usage is even simpler. Just point it at a HuggingFace model and let it work:
```bash
# Skip imatrix generation for faster processing
uv run quantise.py <model_url> --no-imatrix
# Basic usage
uv run quantise_gguf.py https://huggingface.co/meta-llama/Llama-3.2-1B
# Skip imatrix checking for speed
uv run quantise_gguf.py <model_url> --no-imatrix
# Local testing without upload
uv run quantise.py <model_url> --no-upload
uv run quantise_gguf.py <model_url> --no-upload
# Custom output directory
uv run quantise.py <model_url> --output-dir ./my-models
# Use specific HuggingFace token
uv run quantise.py <model_url> --hf-token YOUR_TOKEN
# Custom profiles
uv run quantise_gguf.py <model_url> --profiles Q3_K_M Q4_K_L Q6_K
```
## Environment Variables
## The Architecture Behind the Magic
- `HF_TOKEN`: HuggingFace API token for uploads
- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
- `DEBUG`: Enable debug logging when set to "true"
Based on Qwen3 4B analysis: embeddings (9.7% of parameters) critically affect vocabulary Q4 to Q8
adds just 0.17GB but dramatically improves rare tokens. Attention (14.1% total) has V layers (4.7%)
enhanced in M variants whilst Q and K stay at base for size control.
## Requirements
Feed-forward layers show clear trade-offs: gate/up projections (44.6% of parameters) stay at base
as enhancement would double size for modest gains. Down projections (22.3%) are enhanced in M
variants for feature transformation quality. Output layer (9.4%) gets special attention in Q3_K_L
for prediction quality.
- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
- **HuggingFace account**: For uploading quantised models (optional)
For an 8B model: Q4_K_M baseline is ~4.5GB with Q6_K enhancements. Q4_K_L adds 753MB (5.3GB total)
for Q8_0 embeddings. A hypothetical Q4_K_XL would reach 6.6GB at which point Q5_K_M's superior
base quantisation makes more sense.
## Workflow
## Environment and Performance
1. **Download**: Fetches the model from HuggingFace
2. **Convert**: Converts to initial GGUF format (F32)
3. **Generate imatrix**: Creates importance matrix using calibration data
4. **Quantise**: Produces multiple quantisation variants in parallel
5. **Upload**: Pushes quantised models to HuggingFace with metadata
6. **Clean up**: Removes temporary files and caches
Configuration via environment variables: `HF_TOKEN` for uploads, `LLAMA_CPP_DIR` for custom
binaries, `DEBUG=true` for verbose logging. Uses llama-cpp-python (auto-installed via uv),
benefits from imatrix files, requires HuggingFace account only for uploads.
## Output Structure
Requirements scale predictably: disk needs ~3x model size (original, F32, outputs), memory tracks
model size with streaming optimisations. Processing takes minutes to hours depending on size.
Downloads range from gigabytes to 100GB+ for largest models.
```plain
output_dir/
├── model-F32.gguf # Full precision conversion
├── model-Q4_K_M.gguf # Standard quantisation
├── model-Q4_K_M-imat.gguf # With importance matrix
├── model-Q4_K_L-imat.gguf # Enhanced embeddings/attention
├── model-Q4_K_XL-imat.gguf # High precision embeddings
├── model-Q4_K_XXL-imat.gguf # Maximum precision
└── imatrix.dat # Generated importance matrix
```
Comprehensive error handling: automatic retry with exponential backoff, early dependency detection,
disk space checks, actionable API error messages, detailed conversion failure logs. Resilient
workflow keeps you informed whilst handling large model processing challenges.
## Error Handling
## Output and Organisation
The tool includes comprehensive error handling for:
Outputs organised per model: F32/F16 base, quantisation variants, imatrix files, documentation.
Naming pattern: `model-name-variant.gguf`. Successful uploads auto-clean local files; failures
preserve for manual intervention. READMEs document variant characteristics and technical details.
- Network failures during download
- Missing binaries or dependencies
- Insufficient disk space
- HuggingFace API errors
- Conversion failures
## Performance Considerations
- **Disk space**: Requires ~3x model size in free space
- **Memory**: Needs RAM proportional to model size
- **Processing time**: Varies from minutes to hours based on model size
- **Network**: Downloads can be large (10-100+ GB for large models)
Uploads include metadata, quantisation tags, and model cards explaining trade-offs. Parallel upload
system maximises throughput with full progress visibility.