Switch to llama-cpp-python
This commit is contained in:
parent
ef7df1a8c3
commit
d937f2d5fa
25 changed files with 2957 additions and 1181 deletions
|
@ -1,102 +1,151 @@
|
|||
# quantise.py - Advanced GGUF Quantisation
|
||||
# quantise_gguf.py - Advanced GGUF Quantisation
|
||||
|
||||
Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
|
||||
Transforms language models into optimised GGUF formats, from aggressive Q2 compression to
|
||||
high-precision Q8_0. Based on analysis of community quantisation patterns, it achieves excellent
|
||||
quality-to-size ratios whilst working within Python-to-C++ interop constraints.
|
||||
|
||||
## Overview
|
||||
1. [The Full Picture](#the-full-picture)
|
||||
2. [Understanding the Variants](#understanding-the-variants)
|
||||
3. [Practical Usage](#practical-usage)
|
||||
4. [The Architecture Behind the Magic](#the-architecture-behind-the-magic)
|
||||
5. [Environment and Performance](#environment-and-performance)
|
||||
6. [Output and Organisation](#output-and-organisation)
|
||||
|
||||
This tool automates the complete quantisation workflow for converting models to GGUF format with
|
||||
multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
|
||||
## The Full Picture
|
||||
|
||||
## Quantisation Variants
|
||||
GGUF quantisation isn't uniform precision reduction. The tool supports the complete llama.cpp
|
||||
spectrum: K-quant series (Q3_K-Q6_K) with S/M/L variants, legacy formats (Q4_0, Q5_1), experimental
|
||||
integer types (IQ2-IQ4), and full precision F16/BF16. The key is understanding strategic usage.
|
||||
|
||||
The tool produces four quantisation variants based on Bartowski's method:
|
||||
Replicating Bartowski's patterns revealed an interesting limitation. Llama-cpp-python provides
|
||||
embedding and output layer control, but the sophisticated `tensor_types` parameter expects a C++
|
||||
`std::vector<tensor_quantization>` pointer – impossible to create from Python. This architectural
|
||||
boundary between Python and C++ cannot be worked around without significant redesign.
|
||||
|
||||
- **Q4_K_M**: Standard baseline quantisation
|
||||
- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
|
||||
- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
|
||||
- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
|
||||
Analysis of Bartowski's GGUF files shows this limitation doesn't matter. M variants already include
|
||||
per-layer enhancements – Q4_K_M uses Q6_K for embeddings, attention V, and FFN down layers.
|
||||
Bartowski's L and XL variants only tweak embeddings and output layers, precisely what we can control.
|
||||
Working with constraints rather than against them.
|
||||
|
||||
## Features
|
||||
For further optimisation, importance matrix (imatrix) files guide quantisation based on usage
|
||||
patterns, outperforming fixed rules. See the [IMatrix Guide](./imatrix_data.md) for obtaining or
|
||||
generating these files – particularly crucial at lower bit rates.
|
||||
|
||||
- **Automatic model download**: Downloads models from HuggingFace automatically
|
||||
- **Importance matrix generation**: Creates imatrix for improved quantisation quality
|
||||
- **Parallel processing**: Uploads multiple variants simultaneously
|
||||
- **Progress tracking**: Real-time status updates during conversion
|
||||
- **README generation**: Automatically creates model cards with quantisation details
|
||||
- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
|
||||
## Understanding the Variants
|
||||
|
||||
## Usage
|
||||
Our profiles match Bartowski's exact configurations from GGUF analysis. M variants aren't middle
|
||||
ground but optimised baselines – Q4_K_M uses Q6_K for critical layers whilst maintaining Q4_K
|
||||
elsewhere, a balance proven through years of community experimentation.
|
||||
|
||||
### Basic Usage
|
||||
L variants make minimal but impactful changes. Q4_K_L upgrades embeddings from Q6_K to Q8_0 (+19%
|
||||
size, better vocabulary). Q3_K_L upgrades output to Q5_K. Q3_K_XL combines both strategies. No
|
||||
Q4_K_XL or Q5_K_XL exist – at those sizes, Q5_K_M's superior base quantisation wins.
|
||||
|
||||
```bash
|
||||
# Quantise a model from HuggingFace
|
||||
uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
|
||||
Q5_K_L and Q6_K_L upgrade embeddings to Q8_0, providing stepping stones between major levels for
|
||||
fine-grained size-quality control. See [Bartowski Analysis](./bartowski_analysis.md) for detailed
|
||||
architectural interactions.
|
||||
|
||||
## Practical Usage
|
||||
|
||||
The tool handles the complete workflow: fetches from HuggingFace, converts to GGUF, checks for
|
||||
imatrix files, processes multiple variants with parallel uploads, generates documentation, and
|
||||
uploads with metadata. Fire-and-forget design – start it and return to completed models.
|
||||
|
||||
The Python API enables custom configurations (limited to embedding and output layers due to
|
||||
llama-cpp-python constraints):
|
||||
|
||||
```python
|
||||
from helpers.services.llama_python import LlamaCppPythonAPI
|
||||
|
||||
api = LlamaCppPythonAPI()
|
||||
|
||||
# Q4_K_L profile - upgrades embeddings to Q8_0
|
||||
api.quantise_model_flexible(
|
||||
input_path="model-f16.gguf",
|
||||
output_path="model-Q4_K_L.gguf",
|
||||
base_type="Q4_K_M", # Q4_K_M uses Q6_K for embeddings, attn_v, and ffn_down (not flat Q4_K!)
|
||||
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
|
||||
output_type=None # Keep default from base type
|
||||
)
|
||||
|
||||
# Example 2: Q3_K_L profile - upgrades output to Q5_K
|
||||
api.quantise_model_flexible(
|
||||
input_path="model-f16.gguf",
|
||||
output_path="model-Q3_K_L.gguf",
|
||||
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down (not flat Q3_K!)
|
||||
embedding_type=None, # Keep the already-enhanced Q6_K embeddings from base
|
||||
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
|
||||
)
|
||||
|
||||
# Q3_K_XL profile - upgrades both embeddings and output
|
||||
api.quantise_model_flexible(
|
||||
input_path="model-f16.gguf",
|
||||
output_path="model-Q3_K_XL.gguf",
|
||||
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down
|
||||
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
|
||||
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
|
||||
)
|
||||
|
||||
# Example 4: Custom experimental configuration
|
||||
api.quantise_model_flexible(
|
||||
input_path="model-f16.gguf",
|
||||
output_path="model-custom.gguf",
|
||||
base_type="Q5_K_M", # Q5_K_M uses Q6_K embeddings, Q6_K attn_v, Q6_K ffn_down
|
||||
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
|
||||
output_type="Q8_0" # Upgrade output to maximum precision Q8_0
|
||||
)
|
||||
```
|
||||
|
||||
### Command Line Options
|
||||
Command-line usage is even simpler. Just point it at a HuggingFace model and let it work:
|
||||
|
||||
```bash
|
||||
# Skip imatrix generation for faster processing
|
||||
uv run quantise.py <model_url> --no-imatrix
|
||||
# Basic usage
|
||||
uv run quantise_gguf.py https://huggingface.co/meta-llama/Llama-3.2-1B
|
||||
|
||||
# Skip imatrix checking for speed
|
||||
uv run quantise_gguf.py <model_url> --no-imatrix
|
||||
|
||||
# Local testing without upload
|
||||
uv run quantise.py <model_url> --no-upload
|
||||
uv run quantise_gguf.py <model_url> --no-upload
|
||||
|
||||
# Custom output directory
|
||||
uv run quantise.py <model_url> --output-dir ./my-models
|
||||
|
||||
# Use specific HuggingFace token
|
||||
uv run quantise.py <model_url> --hf-token YOUR_TOKEN
|
||||
# Custom profiles
|
||||
uv run quantise_gguf.py <model_url> --profiles Q3_K_M Q4_K_L Q6_K
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
## The Architecture Behind the Magic
|
||||
|
||||
- `HF_TOKEN`: HuggingFace API token for uploads
|
||||
- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
|
||||
- `DEBUG`: Enable debug logging when set to "true"
|
||||
Based on Qwen3 4B analysis: embeddings (9.7% of parameters) critically affect vocabulary – Q4 to Q8
|
||||
adds just 0.17GB but dramatically improves rare tokens. Attention (14.1% total) has V layers (4.7%)
|
||||
enhanced in M variants whilst Q and K stay at base for size control.
|
||||
|
||||
## Requirements
|
||||
Feed-forward layers show clear trade-offs: gate/up projections (44.6% of parameters) stay at base
|
||||
as enhancement would double size for modest gains. Down projections (22.3%) are enhanced in M
|
||||
variants for feature transformation quality. Output layer (9.4%) gets special attention in Q3_K_L
|
||||
for prediction quality.
|
||||
|
||||
- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
|
||||
- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
|
||||
- **HuggingFace account**: For uploading quantised models (optional)
|
||||
For an 8B model: Q4_K_M baseline is ~4.5GB with Q6_K enhancements. Q4_K_L adds 753MB (5.3GB total)
|
||||
for Q8_0 embeddings. A hypothetical Q4_K_XL would reach 6.6GB – at which point Q5_K_M's superior
|
||||
base quantisation makes more sense.
|
||||
|
||||
## Workflow
|
||||
## Environment and Performance
|
||||
|
||||
1. **Download**: Fetches the model from HuggingFace
|
||||
2. **Convert**: Converts to initial GGUF format (F32)
|
||||
3. **Generate imatrix**: Creates importance matrix using calibration data
|
||||
4. **Quantise**: Produces multiple quantisation variants in parallel
|
||||
5. **Upload**: Pushes quantised models to HuggingFace with metadata
|
||||
6. **Clean up**: Removes temporary files and caches
|
||||
Configuration via environment variables: `HF_TOKEN` for uploads, `LLAMA_CPP_DIR` for custom
|
||||
binaries, `DEBUG=true` for verbose logging. Uses llama-cpp-python (auto-installed via uv),
|
||||
benefits from imatrix files, requires HuggingFace account only for uploads.
|
||||
|
||||
## Output Structure
|
||||
Requirements scale predictably: disk needs ~3x model size (original, F32, outputs), memory tracks
|
||||
model size with streaming optimisations. Processing takes minutes to hours depending on size.
|
||||
Downloads range from gigabytes to 100GB+ for largest models.
|
||||
|
||||
```plain
|
||||
output_dir/
|
||||
├── model-F32.gguf # Full precision conversion
|
||||
├── model-Q4_K_M.gguf # Standard quantisation
|
||||
├── model-Q4_K_M-imat.gguf # With importance matrix
|
||||
├── model-Q4_K_L-imat.gguf # Enhanced embeddings/attention
|
||||
├── model-Q4_K_XL-imat.gguf # High precision embeddings
|
||||
├── model-Q4_K_XXL-imat.gguf # Maximum precision
|
||||
└── imatrix.dat # Generated importance matrix
|
||||
```
|
||||
Comprehensive error handling: automatic retry with exponential backoff, early dependency detection,
|
||||
disk space checks, actionable API error messages, detailed conversion failure logs. Resilient
|
||||
workflow keeps you informed whilst handling large model processing challenges.
|
||||
|
||||
## Error Handling
|
||||
## Output and Organisation
|
||||
|
||||
The tool includes comprehensive error handling for:
|
||||
Outputs organised per model: F32/F16 base, quantisation variants, imatrix files, documentation.
|
||||
Naming pattern: `model-name-variant.gguf`. Successful uploads auto-clean local files; failures
|
||||
preserve for manual intervention. READMEs document variant characteristics and technical details.
|
||||
|
||||
- Network failures during download
|
||||
- Missing binaries or dependencies
|
||||
- Insufficient disk space
|
||||
- HuggingFace API errors
|
||||
- Conversion failures
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Disk space**: Requires ~3x model size in free space
|
||||
- **Memory**: Needs RAM proportional to model size
|
||||
- **Processing time**: Varies from minutes to hours based on model size
|
||||
- **Network**: Downloads can be large (10-100+ GB for large models)
|
||||
Uploads include metadata, quantisation tags, and model cards explaining trade-offs. Parallel upload
|
||||
system maximises throughput with full progress visibility.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue