Initial commit
This commit is contained in:
commit
ef7df1a8c3
28 changed files with 6829 additions and 0 deletions
86
docs/development.md
Normal file
86
docs/development.md
Normal file
|
@ -0,0 +1,86 @@
|
|||
# Development Guide
|
||||
|
||||
This guide covers development setup, code quality standards, and project structure for contributors.
|
||||
|
||||
## Code Quality
|
||||
|
||||
```bash
|
||||
# Run linting
|
||||
uv run ruff check
|
||||
|
||||
# Format code
|
||||
uv run ruff format
|
||||
|
||||
# Run with debug logging
|
||||
DEBUG=true uv run <script>
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```plain
|
||||
llm-gguf-tools/
|
||||
├── quantise.py # Bartowski quantisation tool
|
||||
├── direct_safetensors_to_gguf.py # Direct conversion tool
|
||||
├── helpers/ # Shared utilities
|
||||
│ ├── __init__.py
|
||||
│ └── logger.py # Colour-coded logging
|
||||
├── resources/ # Resource files
|
||||
│ └── imatrix_data.txt # Calibration data for imatrix
|
||||
├── docs/ # Detailed documentation
|
||||
│ ├── quantise.md
|
||||
│ ├── direct_safetensors_to_gguf.md
|
||||
│ └── development.md
|
||||
└── pyproject.toml # Project configuration
|
||||
```
|
||||
|
||||
## Contributing Guidelines
|
||||
|
||||
Contributions are welcome! Please ensure:
|
||||
|
||||
1. Code follows the existing style (run `uv run ruff format`)
|
||||
2. All functions have Google-style docstrings
|
||||
3. Type hints are used throughout
|
||||
4. Tests pass (if applicable)
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Setting Up Development Environment
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://git.tomfos.tr/tom/llm-gguf-tools.git
|
||||
cd llm-gguf-tools
|
||||
|
||||
# Install all dependencies including dev
|
||||
uv sync --all-groups
|
||||
```
|
||||
|
||||
### Code Style
|
||||
|
||||
- Follow PEP 8 with ruff enforcement
|
||||
- Use UK English spelling in comments and documentation
|
||||
- Maximum line length: 100 characters
|
||||
- Use type hints for all function parameters and returns
|
||||
|
||||
### Testing
|
||||
|
||||
While formal tests are not yet implemented, ensure:
|
||||
|
||||
- Scripts run without errors on sample models
|
||||
- Logger output is correctly formatted
|
||||
- File I/O operations handle errors gracefully
|
||||
|
||||
### Debugging
|
||||
|
||||
Enable debug logging for verbose output:
|
||||
|
||||
```bash
|
||||
DEBUG=true uv run quantise.py <model_url>
|
||||
```
|
||||
|
||||
This will show additional information about:
|
||||
|
||||
- Model download progress
|
||||
- Conversion steps
|
||||
- File operations
|
||||
- Error details
|
102
docs/quantise_gguf.md
Normal file
102
docs/quantise_gguf.md
Normal file
|
@ -0,0 +1,102 @@
|
|||
# quantise.py - Advanced GGUF Quantisation
|
||||
|
||||
Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
|
||||
|
||||
## Overview
|
||||
|
||||
This tool automates the complete quantisation workflow for converting models to GGUF format with
|
||||
multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
|
||||
|
||||
## Quantisation Variants
|
||||
|
||||
The tool produces four quantisation variants based on Bartowski's method:
|
||||
|
||||
- **Q4_K_M**: Standard baseline quantisation
|
||||
- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
|
||||
- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
|
||||
- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
|
||||
|
||||
## Features
|
||||
|
||||
- **Automatic model download**: Downloads models from HuggingFace automatically
|
||||
- **Importance matrix generation**: Creates imatrix for improved quantisation quality
|
||||
- **Parallel processing**: Uploads multiple variants simultaneously
|
||||
- **Progress tracking**: Real-time status updates during conversion
|
||||
- **README generation**: Automatically creates model cards with quantisation details
|
||||
- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
# Quantise a model from HuggingFace
|
||||
uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
|
||||
```
|
||||
|
||||
### Command Line Options
|
||||
|
||||
```bash
|
||||
# Skip imatrix generation for faster processing
|
||||
uv run quantise.py <model_url> --no-imatrix
|
||||
|
||||
# Local testing without upload
|
||||
uv run quantise.py <model_url> --no-upload
|
||||
|
||||
# Custom output directory
|
||||
uv run quantise.py <model_url> --output-dir ./my-models
|
||||
|
||||
# Use specific HuggingFace token
|
||||
uv run quantise.py <model_url> --hf-token YOUR_TOKEN
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
- `HF_TOKEN`: HuggingFace API token for uploads
|
||||
- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
|
||||
- `DEBUG`: Enable debug logging when set to "true"
|
||||
|
||||
## Requirements
|
||||
|
||||
- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
|
||||
- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
|
||||
- **HuggingFace account**: For uploading quantised models (optional)
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Download**: Fetches the model from HuggingFace
|
||||
2. **Convert**: Converts to initial GGUF format (F32)
|
||||
3. **Generate imatrix**: Creates importance matrix using calibration data
|
||||
4. **Quantise**: Produces multiple quantisation variants in parallel
|
||||
5. **Upload**: Pushes quantised models to HuggingFace with metadata
|
||||
6. **Clean up**: Removes temporary files and caches
|
||||
|
||||
## Output Structure
|
||||
|
||||
```plain
|
||||
output_dir/
|
||||
├── model-F32.gguf # Full precision conversion
|
||||
├── model-Q4_K_M.gguf # Standard quantisation
|
||||
├── model-Q4_K_M-imat.gguf # With importance matrix
|
||||
├── model-Q4_K_L-imat.gguf # Enhanced embeddings/attention
|
||||
├── model-Q4_K_XL-imat.gguf # High precision embeddings
|
||||
├── model-Q4_K_XXL-imat.gguf # Maximum precision
|
||||
└── imatrix.dat # Generated importance matrix
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
The tool includes comprehensive error handling for:
|
||||
|
||||
- Network failures during download
|
||||
- Missing binaries or dependencies
|
||||
- Insufficient disk space
|
||||
- HuggingFace API errors
|
||||
- Conversion failures
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Disk space**: Requires ~3x model size in free space
|
||||
- **Memory**: Needs RAM proportional to model size
|
||||
- **Processing time**: Varies from minutes to hours based on model size
|
||||
- **Network**: Downloads can be large (10-100+ GB for large models)
|
164
docs/safetensors2gguf.md
Normal file
164
docs/safetensors2gguf.md
Normal file
|
@ -0,0 +1,164 @@
|
|||
# direct_safetensors_to_gguf.py - Direct SafeTensors Conversion
|
||||
|
||||
Direct SafeTensors to GGUF converter for unsupported architectures.
|
||||
|
||||
## Overview
|
||||
|
||||
This tool converts SafeTensors models directly to GGUF format without requiring specific
|
||||
architecture support in llama.cpp. It's particularly useful for experimental models, custom
|
||||
architectures, or when llama.cpp's standard conversion tools don't recognise your model
|
||||
architecture.
|
||||
|
||||
## Features
|
||||
|
||||
- **Architecture-agnostic**: Works with unsupported model architectures
|
||||
- **Automatic mapping**: Intelligently maps tensor names to GGUF conventions
|
||||
- **BFloat16 support**: Handles BF16 tensors with PyTorch (optional)
|
||||
- **Vision models**: Supports models with vision components
|
||||
- **Tokeniser preservation**: Extracts and includes tokeniser metadata
|
||||
- **Fallback mechanisms**: Provides sensible defaults for unknown architectures
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
# Convert a local SafeTensors model
|
||||
uv run direct_safetensors_to_gguf.py /path/to/model/directory
|
||||
```
|
||||
|
||||
### Command Line Options
|
||||
|
||||
```bash
|
||||
# Specify output file
|
||||
uv run direct_safetensors_to_gguf.py /path/to/model -o output.gguf
|
||||
|
||||
# Force specific architecture mapping
|
||||
uv run direct_safetensors_to_gguf.py /path/to/model --force-arch qwen2
|
||||
|
||||
# Convert with custom output path
|
||||
uv run direct_safetensors_to_gguf.py ./my-model --output ./converted/my-model.gguf
|
||||
```
|
||||
|
||||
## Supported Input Formats
|
||||
|
||||
The tool automatically detects and handles:
|
||||
|
||||
1. **Single file models**: `model.safetensors`
|
||||
2. **Sharded models**: `model-00001-of-00005.safetensors`, etc.
|
||||
3. **Custom names**: Any `*.safetensors` files in the directory
|
||||
|
||||
## Architecture Mapping
|
||||
|
||||
The tool includes built-in mappings for several architectures:
|
||||
|
||||
- `DotsOCRForCausalLM` → `qwen2`
|
||||
- `GptOssForCausalLM` → `llama`
|
||||
- Unknown architectures → `llama` (fallback)
|
||||
|
||||
You can override these with the `--force-arch` parameter.
|
||||
|
||||
## Tensor Name Mapping
|
||||
|
||||
The converter automatically maps common tensor patterns:
|
||||
|
||||
| Original Pattern | GGUF Name |
|
||||
|-----------------|-----------|
|
||||
| `model.embed_tokens.weight` | `token_embd.weight` |
|
||||
| `model.norm.weight` | `output_norm.weight` |
|
||||
| `lm_head.weight` | `output.weight` |
|
||||
| `layers.N.self_attn.q_proj` | `blk.N.attn_q` |
|
||||
| `layers.N.self_attn.k_proj` | `blk.N.attn_k` |
|
||||
| `layers.N.self_attn.v_proj` | `blk.N.attn_v` |
|
||||
| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` |
|
||||
| `layers.N.mlp.up_proj` | `blk.N.ffn_up` |
|
||||
| `layers.N.mlp.down_proj` | `blk.N.ffn_down` |
|
||||
|
||||
## Configuration Requirements
|
||||
|
||||
The model directory must contain:
|
||||
|
||||
- **config.json**: Model configuration file (required)
|
||||
- **\*.safetensors**: One or more SafeTensors files (required)
|
||||
- **tokenizer_config.json**: Tokeniser configuration (optional)
|
||||
- **tokenizer.json**: Tokeniser data (optional)
|
||||
|
||||
## Output Format
|
||||
|
||||
The tool produces a single GGUF file containing:
|
||||
|
||||
- All model weights in F32 format
|
||||
- Model architecture metadata
|
||||
- Tokeniser configuration (if available)
|
||||
- Special token IDs (BOS, EOS, UNK, PAD)
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Error | Message | Solution |
|
||||
|-------|---------|----------|
|
||||
| Missing config.json | `FileNotFoundError: Config file not found` | Ensure the model directory contains a valid `config.json` file |
|
||||
| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Check that the directory contains `.safetensors` files |
|
||||
| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch for BF16 support: `uv add torch` |
|
||||
| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Use `--force-arch` to specify a known compatible architecture |
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Parameter Inference
|
||||
|
||||
The tool infers GGUF parameters from the model configuration:
|
||||
|
||||
- `vocab_size` → vocabulary size (default: 32000)
|
||||
- `max_position_embeddings` → context length (default: 2048)
|
||||
- `hidden_size` → embedding dimension (default: 4096)
|
||||
- `num_hidden_layers` → number of transformer blocks (default: 32)
|
||||
- `num_attention_heads` → attention head count (default: 32)
|
||||
- `num_key_value_heads` → KV head count (defaults to attention heads)
|
||||
- `rope_theta` → RoPE frequency base (default: 10000.0)
|
||||
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5)
|
||||
|
||||
### Vision Model Support
|
||||
|
||||
For models with vision components, the tool extracts:
|
||||
|
||||
- Vision embedding dimensions
|
||||
- Vision transformer block count
|
||||
- Vision attention heads
|
||||
- Vision feed-forward dimensions
|
||||
- Patch size and spatial merge parameters
|
||||
|
||||
## Limitations
|
||||
|
||||
- **F32 only**: Currently outputs only full precision (F32) models
|
||||
- **Architecture guessing**: May require manual architecture specification
|
||||
- **Tokeniser compatibility**: Uses llama tokeniser as default fallback
|
||||
- **Memory usage**: Requires loading full tensors into memory
|
||||
|
||||
## Examples
|
||||
|
||||
### Converting a custom model
|
||||
|
||||
```bash
|
||||
# Download a model first
|
||||
git clone https://huggingface.co/my-org/my-model ./my-model
|
||||
|
||||
# Convert to GGUF
|
||||
uv run direct_safetensors_to_gguf.py ./my-model
|
||||
|
||||
# Output will be at ./my-model/my-model-f32.gguf
|
||||
```
|
||||
|
||||
### Converting with specific architecture
|
||||
|
||||
```bash
|
||||
# For a Qwen2-based model
|
||||
uv run direct_safetensors_to_gguf.py ./qwen-model --force-arch qwen2
|
||||
```
|
||||
|
||||
### Batch conversion
|
||||
|
||||
```bash
|
||||
# Convert multiple models
|
||||
for model in ./models/*; do
|
||||
uv run direct_safetensors_to_gguf.py "$model" -o "./gguf/$(basename $model).gguf"
|
||||
done
|
||||
```
|
Loading…
Add table
Add a link
Reference in a new issue