Initial commit

2025-08-07 18:29:12 +01:00 · 2025-08-07 18:29:12 +01:00 · ef7df1a8c3
commit ef7df1a8c3
28 changed files with 6829 additions and 0 deletions
--- a/docs/development.md
+++ b/docs/development.md
@ -0,0 +1,86 @@
+# Development Guide
+
+This guide covers development setup, code quality standards, and project structure for contributors.
+
+## Code Quality
+
+```bash
+# Run linting
+uv run ruff check
+
+# Format code
+uv run ruff format
+
+# Run with debug logging
+DEBUG=true uv run <script>
+```
+
+## Project Structure
+
+```plain
+llm-gguf-tools/
+├── quantise.py                    # Bartowski quantisation tool
+├── direct_safetensors_to_gguf.py  # Direct conversion tool
+├── helpers/                       # Shared utilities
+│   ├── __init__.py
+│   └── logger.py                  # Colour-coded logging
+├── resources/                     # Resource files
+│   └── imatrix_data.txt          # Calibration data for imatrix
+├── docs/                          # Detailed documentation
+│   ├── quantise.md
+│   ├── direct_safetensors_to_gguf.md
+│   └── development.md
+└── pyproject.toml                # Project configuration
+```
+
+## Contributing Guidelines
+
+Contributions are welcome! Please ensure:
+
+1. Code follows the existing style (run `uv run ruff format`)
+2. All functions have Google-style docstrings
+3. Type hints are used throughout
+4. Tests pass (if applicable)
+
+## Development Workflow
+
+### Setting Up Development Environment
+
+```bash
+# Clone the repository
+git clone https://git.tomfos.tr/tom/llm-gguf-tools.git
+cd llm-gguf-tools
+
+# Install all dependencies including dev
+uv sync --all-groups
+```
+
+### Code Style
+
+- Follow PEP 8 with ruff enforcement
+- Use UK English spelling in comments and documentation
+- Maximum line length: 100 characters
+- Use type hints for all function parameters and returns
+
+### Testing
+
+While formal tests are not yet implemented, ensure:
+
+- Scripts run without errors on sample models
+- Logger output is correctly formatted
+- File I/O operations handle errors gracefully
+
+### Debugging
+
+Enable debug logging for verbose output:
+
+```bash
+DEBUG=true uv run quantise.py <model_url>
+```
+
+This will show additional information about:
+
+- Model download progress
+- Conversion steps
+- File operations
+- Error details
--- a/docs/quantise_gguf.md
+++ b/docs/quantise_gguf.md
@ -0,0 +1,102 @@
+# quantise.py - Advanced GGUF Quantisation
+
+Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
+
+## Overview
+
+This tool automates the complete quantisation workflow for converting models to GGUF format with
+multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
+
+## Quantisation Variants
+
+The tool produces four quantisation variants based on Bartowski's method:
+
+- **Q4_K_M**: Standard baseline quantisation
+- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
+- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
+- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
+
+## Features
+
+- **Automatic model download**: Downloads models from HuggingFace automatically
+- **Importance matrix generation**: Creates imatrix for improved quantisation quality
+- **Parallel processing**: Uploads multiple variants simultaneously
+- **Progress tracking**: Real-time status updates during conversion
+- **README generation**: Automatically creates model cards with quantisation details
+- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
+
+## Usage
+
+### Basic Usage
+
+```bash
+# Quantise a model from HuggingFace
+uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
+```
+
+### Command Line Options
+
+```bash
+# Skip imatrix generation for faster processing
+uv run quantise.py <model_url> --no-imatrix
+
+# Local testing without upload
+uv run quantise.py <model_url> --no-upload
+
+# Custom output directory
+uv run quantise.py <model_url> --output-dir ./my-models
+
+# Use specific HuggingFace token
+uv run quantise.py <model_url> --hf-token YOUR_TOKEN
+```
+
+## Environment Variables
+
+- `HF_TOKEN`: HuggingFace API token for uploads
+- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
+- `DEBUG`: Enable debug logging when set to "true"
+
+## Requirements
+
+- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
+- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
+- **HuggingFace account**: For uploading quantised models (optional)
+
+## Workflow
+
+1. **Download**: Fetches the model from HuggingFace
+2. **Convert**: Converts to initial GGUF format (F32)
+3. **Generate imatrix**: Creates importance matrix using calibration data
+4. **Quantise**: Produces multiple quantisation variants in parallel
+5. **Upload**: Pushes quantised models to HuggingFace with metadata
+6. **Clean up**: Removes temporary files and caches
+
+## Output Structure
+
+```plain
+output_dir/
+├── model-F32.gguf           # Full precision conversion
+├── model-Q4_K_M.gguf        # Standard quantisation
+├── model-Q4_K_M-imat.gguf   # With importance matrix
+├── model-Q4_K_L-imat.gguf   # Enhanced embeddings/attention
+├── model-Q4_K_XL-imat.gguf  # High precision embeddings
+├── model-Q4_K_XXL-imat.gguf # Maximum precision
+└── imatrix.dat              # Generated importance matrix
+```
+
+## Error Handling
+
+The tool includes comprehensive error handling for:
+
+- Network failures during download
+- Missing binaries or dependencies
+- Insufficient disk space
+- HuggingFace API errors
+- Conversion failures
+
+## Performance Considerations
+
+- **Disk space**: Requires ~3x model size in free space
+- **Memory**: Needs RAM proportional to model size
+- **Processing time**: Varies from minutes to hours based on model size
+- **Network**: Downloads can be large (10-100+ GB for large models)
--- a/docs/safetensors2gguf.md
+++ b/docs/safetensors2gguf.md
@ -0,0 +1,164 @@
+# direct_safetensors_to_gguf.py - Direct SafeTensors Conversion
+
+Direct SafeTensors to GGUF converter for unsupported architectures.
+
+## Overview
+
+This tool converts SafeTensors models directly to GGUF format without requiring specific
+architecture support in llama.cpp. It's particularly useful for experimental models, custom
+architectures, or when llama.cpp's standard conversion tools don't recognise your model
+architecture.
+
+## Features
+
+- **Architecture-agnostic**: Works with unsupported model architectures
+- **Automatic mapping**: Intelligently maps tensor names to GGUF conventions
+- **BFloat16 support**: Handles BF16 tensors with PyTorch (optional)
+- **Vision models**: Supports models with vision components
+- **Tokeniser preservation**: Extracts and includes tokeniser metadata
+- **Fallback mechanisms**: Provides sensible defaults for unknown architectures
+
+## Usage
+
+### Basic Usage
+
+```bash
+# Convert a local SafeTensors model
+uv run direct_safetensors_to_gguf.py /path/to/model/directory
+```
+
+### Command Line Options
+
+```bash
+# Specify output file
+uv run direct_safetensors_to_gguf.py /path/to/model -o output.gguf
+
+# Force specific architecture mapping
+uv run direct_safetensors_to_gguf.py /path/to/model --force-arch qwen2
+
+# Convert with custom output path
+uv run direct_safetensors_to_gguf.py ./my-model --output ./converted/my-model.gguf
+```
+
+## Supported Input Formats
+
+The tool automatically detects and handles:
+
+1. **Single file models**: `model.safetensors`
+2. **Sharded models**: `model-00001-of-00005.safetensors`, etc.
+3. **Custom names**: Any `*.safetensors` files in the directory
+
+## Architecture Mapping
+
+The tool includes built-in mappings for several architectures:
+
+- `DotsOCRForCausalLM` → `qwen2`
+- `GptOssForCausalLM` → `llama`
+- Unknown architectures → `llama` (fallback)
+
+You can override these with the `--force-arch` parameter.
+
+## Tensor Name Mapping
+
+The converter automatically maps common tensor patterns:
+
+| Original Pattern | GGUF Name |
+|-----------------|-----------|
+| `model.embed_tokens.weight` | `token_embd.weight` |
+| `model.norm.weight` | `output_norm.weight` |
+| `lm_head.weight` | `output.weight` |
+| `layers.N.self_attn.q_proj` | `blk.N.attn_q` |
+| `layers.N.self_attn.k_proj` | `blk.N.attn_k` |
+| `layers.N.self_attn.v_proj` | `blk.N.attn_v` |
+| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` |
+| `layers.N.mlp.up_proj` | `blk.N.ffn_up` |
+| `layers.N.mlp.down_proj` | `blk.N.ffn_down` |
+
+## Configuration Requirements
+
+The model directory must contain:
+
+- **config.json**: Model configuration file (required)
+- **\*.safetensors**: One or more SafeTensors files (required)
+- **tokenizer_config.json**: Tokeniser configuration (optional)
+- **tokenizer.json**: Tokeniser data (optional)
+
+## Output Format
+
+The tool produces a single GGUF file containing:
+
+- All model weights in F32 format
+- Model architecture metadata
+- Tokeniser configuration (if available)
+- Special token IDs (BOS, EOS, UNK, PAD)
+
+## Error Handling
+
+| Error | Message | Solution |
+|-------|---------|----------|
+| Missing config.json | `FileNotFoundError: Config file not found` | Ensure the model directory contains a valid `config.json` file |
+| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Check that the directory contains `.safetensors` files |
+| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch for BF16 support: `uv add torch` |
+| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Use `--force-arch` to specify a known compatible architecture |
+
+## Technical Details
+
+### Parameter Inference
+
+The tool infers GGUF parameters from the model configuration:
+
+- `vocab_size` → vocabulary size (default: 32000)
+- `max_position_embeddings` → context length (default: 2048)
+- `hidden_size` → embedding dimension (default: 4096)
+- `num_hidden_layers` → number of transformer blocks (default: 32)
+- `num_attention_heads` → attention head count (default: 32)
+- `num_key_value_heads` → KV head count (defaults to attention heads)
+- `rope_theta` → RoPE frequency base (default: 10000.0)
+- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5)
+
+### Vision Model Support
+
+For models with vision components, the tool extracts:
+
+- Vision embedding dimensions
+- Vision transformer block count
+- Vision attention heads
+- Vision feed-forward dimensions
+- Patch size and spatial merge parameters
+
+## Limitations
+
+- **F32 only**: Currently outputs only full precision (F32) models
+- **Architecture guessing**: May require manual architecture specification
+- **Tokeniser compatibility**: Uses llama tokeniser as default fallback
+- **Memory usage**: Requires loading full tensors into memory
+
+## Examples
+
+### Converting a custom model
+
+```bash
+# Download a model first
+git clone https://huggingface.co/my-org/my-model ./my-model
+
+# Convert to GGUF
+uv run direct_safetensors_to_gguf.py ./my-model
+
+# Output will be at ./my-model/my-model-f32.gguf
+```
+
+### Converting with specific architecture
+
+```bash
+# For a Qwen2-based model
+uv run direct_safetensors_to_gguf.py ./qwen-model --force-arch qwen2
+```
+
+### Batch conversion
+
+```bash
+# Convert multiple models
+for model in ./models/*; do
+    uv run direct_safetensors_to_gguf.py "$model" -o "./gguf/$(basename $model).gguf"
+done
+```