Initial commit

This commit is contained in:
Tom Foster 2025-08-07 18:29:12 +01:00
commit ef7df1a8c3
28 changed files with 6829 additions and 0 deletions

86
docs/development.md Normal file
View file

@ -0,0 +1,86 @@
# Development Guide
This guide covers development setup, code quality standards, and project structure for contributors.
## Code Quality
```bash
# Run linting
uv run ruff check
# Format code
uv run ruff format
# Run with debug logging
DEBUG=true uv run <script>
```
## Project Structure
```plain
llm-gguf-tools/
├── quantise.py # Bartowski quantisation tool
├── direct_safetensors_to_gguf.py # Direct conversion tool
├── helpers/ # Shared utilities
│ ├── __init__.py
│ └── logger.py # Colour-coded logging
├── resources/ # Resource files
│ └── imatrix_data.txt # Calibration data for imatrix
├── docs/ # Detailed documentation
│ ├── quantise.md
│ ├── direct_safetensors_to_gguf.md
│ └── development.md
└── pyproject.toml # Project configuration
```
## Contributing Guidelines
Contributions are welcome! Please ensure:
1. Code follows the existing style (run `uv run ruff format`)
2. All functions have Google-style docstrings
3. Type hints are used throughout
4. Tests pass (if applicable)
## Development Workflow
### Setting Up Development Environment
```bash
# Clone the repository
git clone https://git.tomfos.tr/tom/llm-gguf-tools.git
cd llm-gguf-tools
# Install all dependencies including dev
uv sync --all-groups
```
### Code Style
- Follow PEP 8 with ruff enforcement
- Use UK English spelling in comments and documentation
- Maximum line length: 100 characters
- Use type hints for all function parameters and returns
### Testing
While formal tests are not yet implemented, ensure:
- Scripts run without errors on sample models
- Logger output is correctly formatted
- File I/O operations handle errors gracefully
### Debugging
Enable debug logging for verbose output:
```bash
DEBUG=true uv run quantise.py <model_url>
```
This will show additional information about:
- Model download progress
- Conversion steps
- File operations
- Error details

102
docs/quantise_gguf.md Normal file
View file

@ -0,0 +1,102 @@
# quantise.py - Advanced GGUF Quantisation
Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
## Overview
This tool automates the complete quantisation workflow for converting models to GGUF format with
multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
## Quantisation Variants
The tool produces four quantisation variants based on Bartowski's method:
- **Q4_K_M**: Standard baseline quantisation
- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
## Features
- **Automatic model download**: Downloads models from HuggingFace automatically
- **Importance matrix generation**: Creates imatrix for improved quantisation quality
- **Parallel processing**: Uploads multiple variants simultaneously
- **Progress tracking**: Real-time status updates during conversion
- **README generation**: Automatically creates model cards with quantisation details
- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
## Usage
### Basic Usage
```bash
# Quantise a model from HuggingFace
uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
```
### Command Line Options
```bash
# Skip imatrix generation for faster processing
uv run quantise.py <model_url> --no-imatrix
# Local testing without upload
uv run quantise.py <model_url> --no-upload
# Custom output directory
uv run quantise.py <model_url> --output-dir ./my-models
# Use specific HuggingFace token
uv run quantise.py <model_url> --hf-token YOUR_TOKEN
```
## Environment Variables
- `HF_TOKEN`: HuggingFace API token for uploads
- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
- `DEBUG`: Enable debug logging when set to "true"
## Requirements
- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
- **HuggingFace account**: For uploading quantised models (optional)
## Workflow
1. **Download**: Fetches the model from HuggingFace
2. **Convert**: Converts to initial GGUF format (F32)
3. **Generate imatrix**: Creates importance matrix using calibration data
4. **Quantise**: Produces multiple quantisation variants in parallel
5. **Upload**: Pushes quantised models to HuggingFace with metadata
6. **Clean up**: Removes temporary files and caches
## Output Structure
```plain
output_dir/
├── model-F32.gguf # Full precision conversion
├── model-Q4_K_M.gguf # Standard quantisation
├── model-Q4_K_M-imat.gguf # With importance matrix
├── model-Q4_K_L-imat.gguf # Enhanced embeddings/attention
├── model-Q4_K_XL-imat.gguf # High precision embeddings
├── model-Q4_K_XXL-imat.gguf # Maximum precision
└── imatrix.dat # Generated importance matrix
```
## Error Handling
The tool includes comprehensive error handling for:
- Network failures during download
- Missing binaries or dependencies
- Insufficient disk space
- HuggingFace API errors
- Conversion failures
## Performance Considerations
- **Disk space**: Requires ~3x model size in free space
- **Memory**: Needs RAM proportional to model size
- **Processing time**: Varies from minutes to hours based on model size
- **Network**: Downloads can be large (10-100+ GB for large models)

164
docs/safetensors2gguf.md Normal file
View file

@ -0,0 +1,164 @@
# direct_safetensors_to_gguf.py - Direct SafeTensors Conversion
Direct SafeTensors to GGUF converter for unsupported architectures.
## Overview
This tool converts SafeTensors models directly to GGUF format without requiring specific
architecture support in llama.cpp. It's particularly useful for experimental models, custom
architectures, or when llama.cpp's standard conversion tools don't recognise your model
architecture.
## Features
- **Architecture-agnostic**: Works with unsupported model architectures
- **Automatic mapping**: Intelligently maps tensor names to GGUF conventions
- **BFloat16 support**: Handles BF16 tensors with PyTorch (optional)
- **Vision models**: Supports models with vision components
- **Tokeniser preservation**: Extracts and includes tokeniser metadata
- **Fallback mechanisms**: Provides sensible defaults for unknown architectures
## Usage
### Basic Usage
```bash
# Convert a local SafeTensors model
uv run direct_safetensors_to_gguf.py /path/to/model/directory
```
### Command Line Options
```bash
# Specify output file
uv run direct_safetensors_to_gguf.py /path/to/model -o output.gguf
# Force specific architecture mapping
uv run direct_safetensors_to_gguf.py /path/to/model --force-arch qwen2
# Convert with custom output path
uv run direct_safetensors_to_gguf.py ./my-model --output ./converted/my-model.gguf
```
## Supported Input Formats
The tool automatically detects and handles:
1. **Single file models**: `model.safetensors`
2. **Sharded models**: `model-00001-of-00005.safetensors`, etc.
3. **Custom names**: Any `*.safetensors` files in the directory
## Architecture Mapping
The tool includes built-in mappings for several architectures:
- `DotsOCRForCausalLM``qwen2`
- `GptOssForCausalLM``llama`
- Unknown architectures → `llama` (fallback)
You can override these with the `--force-arch` parameter.
## Tensor Name Mapping
The converter automatically maps common tensor patterns:
| Original Pattern | GGUF Name |
|-----------------|-----------|
| `model.embed_tokens.weight` | `token_embd.weight` |
| `model.norm.weight` | `output_norm.weight` |
| `lm_head.weight` | `output.weight` |
| `layers.N.self_attn.q_proj` | `blk.N.attn_q` |
| `layers.N.self_attn.k_proj` | `blk.N.attn_k` |
| `layers.N.self_attn.v_proj` | `blk.N.attn_v` |
| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` |
| `layers.N.mlp.up_proj` | `blk.N.ffn_up` |
| `layers.N.mlp.down_proj` | `blk.N.ffn_down` |
## Configuration Requirements
The model directory must contain:
- **config.json**: Model configuration file (required)
- **\*.safetensors**: One or more SafeTensors files (required)
- **tokenizer_config.json**: Tokeniser configuration (optional)
- **tokenizer.json**: Tokeniser data (optional)
## Output Format
The tool produces a single GGUF file containing:
- All model weights in F32 format
- Model architecture metadata
- Tokeniser configuration (if available)
- Special token IDs (BOS, EOS, UNK, PAD)
## Error Handling
| Error | Message | Solution |
|-------|---------|----------|
| Missing config.json | `FileNotFoundError: Config file not found` | Ensure the model directory contains a valid `config.json` file |
| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Check that the directory contains `.safetensors` files |
| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch for BF16 support: `uv add torch` |
| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Use `--force-arch` to specify a known compatible architecture |
## Technical Details
### Parameter Inference
The tool infers GGUF parameters from the model configuration:
- `vocab_size` → vocabulary size (default: 32000)
- `max_position_embeddings` → context length (default: 2048)
- `hidden_size` → embedding dimension (default: 4096)
- `num_hidden_layers` → number of transformer blocks (default: 32)
- `num_attention_heads` → attention head count (default: 32)
- `num_key_value_heads` → KV head count (defaults to attention heads)
- `rope_theta` → RoPE frequency base (default: 10000.0)
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5)
### Vision Model Support
For models with vision components, the tool extracts:
- Vision embedding dimensions
- Vision transformer block count
- Vision attention heads
- Vision feed-forward dimensions
- Patch size and spatial merge parameters
## Limitations
- **F32 only**: Currently outputs only full precision (F32) models
- **Architecture guessing**: May require manual architecture specification
- **Tokeniser compatibility**: Uses llama tokeniser as default fallback
- **Memory usage**: Requires loading full tensors into memory
## Examples
### Converting a custom model
```bash
# Download a model first
git clone https://huggingface.co/my-org/my-model ./my-model
# Convert to GGUF
uv run direct_safetensors_to_gguf.py ./my-model
# Output will be at ./my-model/my-model-f32.gguf
```
### Converting with specific architecture
```bash
# For a Qwen2-based model
uv run direct_safetensors_to_gguf.py ./qwen-model --force-arch qwen2
```
### Batch conversion
```bash
# Convert multiple models
for model in ./models/*; do
uv run direct_safetensors_to_gguf.py "$model" -o "./gguf/$(basename $model).gguf"
done
```