Initial commit

2025-08-07 18:29:12 +01:00 · 2025-08-07 18:29:12 +01:00 · ef7df1a8c3
commit ef7df1a8c3
28 changed files with 6829 additions and 0 deletions
--- a/docs/quantise_gguf.md
+++ b/docs/quantise_gguf.md
@ -0,0 +1,102 @@
+# quantise.py - Advanced GGUF Quantisation
+
+Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
+
+## Overview
+
+This tool automates the complete quantisation workflow for converting models to GGUF format with
+multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
+
+## Quantisation Variants
+
+The tool produces four quantisation variants based on Bartowski's method:
+
+- **Q4_K_M**: Standard baseline quantisation
+- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
+- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
+- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
+
+## Features
+
+- **Automatic model download**: Downloads models from HuggingFace automatically
+- **Importance matrix generation**: Creates imatrix for improved quantisation quality
+- **Parallel processing**: Uploads multiple variants simultaneously
+- **Progress tracking**: Real-time status updates during conversion
+- **README generation**: Automatically creates model cards with quantisation details
+- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
+
+## Usage
+
+### Basic Usage
+
+```bash
+# Quantise a model from HuggingFace
+uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
+```
+
+### Command Line Options
+
+```bash
+# Skip imatrix generation for faster processing
+uv run quantise.py <model_url> --no-imatrix
+
+# Local testing without upload
+uv run quantise.py <model_url> --no-upload
+
+# Custom output directory
+uv run quantise.py <model_url> --output-dir ./my-models
+
+# Use specific HuggingFace token
+uv run quantise.py <model_url> --hf-token YOUR_TOKEN
+```
+
+## Environment Variables
+
+- `HF_TOKEN`: HuggingFace API token for uploads
+- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
+- `DEBUG`: Enable debug logging when set to "true"
+
+## Requirements
+
+- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
+- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
+- **HuggingFace account**: For uploading quantised models (optional)
+
+## Workflow
+
+1. **Download**: Fetches the model from HuggingFace
+2. **Convert**: Converts to initial GGUF format (F32)
+3. **Generate imatrix**: Creates importance matrix using calibration data
+4. **Quantise**: Produces multiple quantisation variants in parallel
+5. **Upload**: Pushes quantised models to HuggingFace with metadata
+6. **Clean up**: Removes temporary files and caches
+
+## Output Structure
+
+```plain
+output_dir/
+├── model-F32.gguf           # Full precision conversion
+├── model-Q4_K_M.gguf        # Standard quantisation
+├── model-Q4_K_M-imat.gguf   # With importance matrix
+├── model-Q4_K_L-imat.gguf   # Enhanced embeddings/attention
+├── model-Q4_K_XL-imat.gguf  # High precision embeddings
+├── model-Q4_K_XXL-imat.gguf # Maximum precision
+└── imatrix.dat              # Generated importance matrix
+```
+
+## Error Handling
+
+The tool includes comprehensive error handling for:
+
+- Network failures during download
+- Missing binaries or dependencies
+- Insufficient disk space
+- HuggingFace API errors
+- Conversion failures
+
+## Performance Considerations
+
+- **Disk space**: Requires ~3x model size in free space
+- **Memory**: Needs RAM proportional to model size
+- **Processing time**: Varies from minutes to hours based on model size
+- **Network**: Downloads can be large (10-100+ GB for large models)