Switch to llama-cpp-python

2025-08-08 21:40:15 +01:00 · 2025-08-08 21:40:15 +01:00 · d937f2d5fa
commit d937f2d5fa
parent ef7df1a8c3
25 changed files with 2957 additions and 1181 deletions
--- a/docs/quantise_gguf.md
+++ b/docs/quantise_gguf.md
@ -1,102 +1,151 @@
-# quantise.py - Advanced GGUF Quantisation
+# quantise_gguf.py - Advanced GGUF Quantisation

-Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
+Transforms language models into optimised GGUF formats, from aggressive Q2 compression to
+high-precision Q8_0. Based on analysis of community quantisation patterns, it achieves excellent
+quality-to-size ratios whilst working within Python-to-C++ interop constraints.

-## Overview
+1. [The Full Picture](#the-full-picture)
+2. [Understanding the Variants](#understanding-the-variants)
+3. [Practical Usage](#practical-usage)
+4. [The Architecture Behind the Magic](#the-architecture-behind-the-magic)
+5. [Environment and Performance](#environment-and-performance)
+6. [Output and Organisation](#output-and-organisation)

-This tool automates the complete quantisation workflow for converting models to GGUF format with
-multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
+## The Full Picture

-## Quantisation Variants
+GGUF quantisation isn't uniform precision reduction. The tool supports the complete llama.cpp
+spectrum: K-quant series (Q3_K-Q6_K) with S/M/L variants, legacy formats (Q4_0, Q5_1), experimental
+integer types (IQ2-IQ4), and full precision F16/BF16. The key is understanding strategic usage.

-The tool produces four quantisation variants based on Bartowski's method:
+Replicating Bartowski's patterns revealed an interesting limitation. Llama-cpp-python provides
+embedding and output layer control, but the sophisticated `tensor_types` parameter expects a C++
+`std::vector<tensor_quantization>` pointer – impossible to create from Python. This architectural
+boundary between Python and C++ cannot be worked around without significant redesign.

- **Q4_K_M**: Standard baseline quantisation
- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
+Analysis of Bartowski's GGUF files shows this limitation doesn't matter. M variants already include
+per-layer enhancements – Q4_K_M uses Q6_K for embeddings, attention V, and FFN down layers.
+Bartowski's L and XL variants only tweak embeddings and output layers, precisely what we can control.
+Working with constraints rather than against them.

-## Features
+For further optimisation, importance matrix (imatrix) files guide quantisation based on usage
+patterns, outperforming fixed rules. See the [IMatrix Guide](./imatrix_data.md) for obtaining or
+generating these files – particularly crucial at lower bit rates.

- **Automatic model download**: Downloads models from HuggingFace automatically
- **Importance matrix generation**: Creates imatrix for improved quantisation quality
- **Parallel processing**: Uploads multiple variants simultaneously
- **Progress tracking**: Real-time status updates during conversion
- **README generation**: Automatically creates model cards with quantisation details
- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
+## Understanding the Variants

-## Usage
+Our profiles match Bartowski's exact configurations from GGUF analysis. M variants aren't middle
+ground but optimised baselines – Q4_K_M uses Q6_K for critical layers whilst maintaining Q4_K
+elsewhere, a balance proven through years of community experimentation.

-### Basic Usage
+L variants make minimal but impactful changes. Q4_K_L upgrades embeddings from Q6_K to Q8_0 (+19%
+size, better vocabulary). Q3_K_L upgrades output to Q5_K. Q3_K_XL combines both strategies. No
+Q4_K_XL or Q5_K_XL exist – at those sizes, Q5_K_M's superior base quantisation wins.

-```bash
-# Quantise a model from HuggingFace
-uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
+Q5_K_L and Q6_K_L upgrade embeddings to Q8_0, providing stepping stones between major levels for
+fine-grained size-quality control. See [Bartowski Analysis](./bartowski_analysis.md) for detailed
+architectural interactions.
+
+## Practical Usage
+
+The tool handles the complete workflow: fetches from HuggingFace, converts to GGUF, checks for
+imatrix files, processes multiple variants with parallel uploads, generates documentation, and
+uploads with metadata. Fire-and-forget design – start it and return to completed models.
+
+The Python API enables custom configurations (limited to embedding and output layers due to
+llama-cpp-python constraints):
+
+```python
+from helpers.services.llama_python import LlamaCppPythonAPI
+
+api = LlamaCppPythonAPI()
+
+# Q4_K_L profile - upgrades embeddings to Q8_0
+api.quantise_model_flexible(
+    input_path="model-f16.gguf",
+    output_path="model-Q4_K_L.gguf",
+    base_type="Q4_K_M",      # Q4_K_M uses Q6_K for embeddings, attn_v, and ffn_down (not flat Q4_K!)
+    embedding_type="Q8_0",   # Further upgrade embeddings from Q6_K to Q8_0
+    output_type=None         # Keep default from base type
+)
+
+# Example 2: Q3_K_L profile - upgrades output to Q5_K
+api.quantise_model_flexible(
+    input_path="model-f16.gguf",
+    output_path="model-Q3_K_L.gguf",
+    base_type="Q3_K_M",      # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down (not flat Q3_K!)
+    embedding_type=None,     # Keep the already-enhanced Q6_K embeddings from base
+    output_type="Q5_K"       # Upgrade output from Q4_K to Q5_K
+)
+
+# Q3_K_XL profile - upgrades both embeddings and output
+api.quantise_model_flexible(
+    input_path="model-f16.gguf",
+    output_path="model-Q3_K_XL.gguf",
+    base_type="Q3_K_M",    # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down
+    embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
+    output_type="Q5_K"     # Upgrade output from Q4_K to Q5_K
+)
+
+# Example 4: Custom experimental configuration
+api.quantise_model_flexible(
+    input_path="model-f16.gguf",
+    output_path="model-custom.gguf",
+    base_type="Q5_K_M",    # Q5_K_M uses Q6_K embeddings, Q6_K attn_v, Q6_K ffn_down
+    embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
+    output_type="Q8_0"     # Upgrade output to maximum precision Q8_0
+)
 ```

-### Command Line Options
+Command-line usage is even simpler. Just point it at a HuggingFace model and let it work:

 ```bash
-# Skip imatrix generation for faster processing
-uv run quantise.py <model_url> --no-imatrix
+# Basic usage
+uv run quantise_gguf.py https://huggingface.co/meta-llama/Llama-3.2-1B
+
+# Skip imatrix checking for speed
+uv run quantise_gguf.py <model_url> --no-imatrix

 # Local testing without upload
-uv run quantise.py <model_url> --no-upload
+uv run quantise_gguf.py <model_url> --no-upload

-# Custom output directory
-uv run quantise.py <model_url> --output-dir ./my-models
-
-# Use specific HuggingFace token
-uv run quantise.py <model_url> --hf-token YOUR_TOKEN
+# Custom profiles
+uv run quantise_gguf.py <model_url> --profiles Q3_K_M Q4_K_L Q6_K
 ```

-## Environment Variables
+## The Architecture Behind the Magic

- `HF_TOKEN`: HuggingFace API token for uploads
- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
- `DEBUG`: Enable debug logging when set to "true"
+Based on Qwen3 4B analysis: embeddings (9.7% of parameters) critically affect vocabulary – Q4 to Q8
+adds just 0.17GB but dramatically improves rare tokens. Attention (14.1% total) has V layers (4.7%)
+enhanced in M variants whilst Q and K stay at base for size control.

-## Requirements
+Feed-forward layers show clear trade-offs: gate/up projections (44.6% of parameters) stay at base
+as enhancement would double size for modest gains. Down projections (22.3%) are enhanced in M
+variants for feature transformation quality. Output layer (9.4%) gets special attention in Q3_K_L
+for prediction quality.

- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
- **HuggingFace account**: For uploading quantised models (optional)
+For an 8B model: Q4_K_M baseline is ~4.5GB with Q6_K enhancements. Q4_K_L adds 753MB (5.3GB total)
+for Q8_0 embeddings. A hypothetical Q4_K_XL would reach 6.6GB – at which point Q5_K_M's superior
+base quantisation makes more sense.

-## Workflow
+## Environment and Performance

-1. **Download**: Fetches the model from HuggingFace
-2. **Convert**: Converts to initial GGUF format (F32)
-3. **Generate imatrix**: Creates importance matrix using calibration data
-4. **Quantise**: Produces multiple quantisation variants in parallel
-5. **Upload**: Pushes quantised models to HuggingFace with metadata
-6. **Clean up**: Removes temporary files and caches
+Configuration via environment variables: `HF_TOKEN` for uploads, `LLAMA_CPP_DIR` for custom
+binaries, `DEBUG=true` for verbose logging. Uses llama-cpp-python (auto-installed via uv),
+benefits from imatrix files, requires HuggingFace account only for uploads.

-## Output Structure
+Requirements scale predictably: disk needs ~3x model size (original, F32, outputs), memory tracks
+model size with streaming optimisations. Processing takes minutes to hours depending on size.
+Downloads range from gigabytes to 100GB+ for largest models.

-```plain
-output_dir/
-├── model-F32.gguf           # Full precision conversion
-├── model-Q4_K_M.gguf        # Standard quantisation
-├── model-Q4_K_M-imat.gguf   # With importance matrix
-├── model-Q4_K_L-imat.gguf   # Enhanced embeddings/attention
-├── model-Q4_K_XL-imat.gguf  # High precision embeddings
-├── model-Q4_K_XXL-imat.gguf # Maximum precision
-└── imatrix.dat              # Generated importance matrix
-```
+Comprehensive error handling: automatic retry with exponential backoff, early dependency detection,
+disk space checks, actionable API error messages, detailed conversion failure logs. Resilient
+workflow keeps you informed whilst handling large model processing challenges.

-## Error Handling
+## Output and Organisation

-The tool includes comprehensive error handling for:
+Outputs organised per model: F32/F16 base, quantisation variants, imatrix files, documentation.
+Naming pattern: `model-name-variant.gguf`. Successful uploads auto-clean local files; failures
+preserve for manual intervention. READMEs document variant characteristics and technical details.

- Network failures during download
- Missing binaries or dependencies
- Insufficient disk space
- HuggingFace API errors
- Conversion failures
-
-## Performance Considerations
-
- **Disk space**: Requires ~3x model size in free space
- **Memory**: Needs RAM proportional to model size
- **Processing time**: Varies from minutes to hours based on model size
- **Network**: Downloads can be large (10-100+ GB for large models)
+Uploads include metadata, quantisation tags, and model cards explaining trade-offs. Parallel upload
+system maximises throughput with full progress visibility.