Switch to llama-cpp-python

This commit is contained in:
Tom Foster 2025-08-08 21:40:15 +01:00
parent ef7df1a8c3
commit d937f2d5fa
25 changed files with 2957 additions and 1181 deletions

127
docs/bartowski_analysis.md Normal file
View file

@ -0,0 +1,127 @@
# Bartowski Quantisation Analysis
Analysis of Bartowski GGUF files reveals why these models work so well: the "M" variants don't
apply uniform quantisation as their names suggest.
1. [The Hidden Sophistication of M Variants](#the-hidden-sophistication-of-m-variants)
2. [The Complete Quantisation Map](#the-complete-quantisation-map)
3. [The Architecture of Intelligence](#the-architecture-of-intelligence)
4. [The Economics of Enhancement](#the-economics-of-enhancement)
5. [Why Q3\_K Gets Special Treatment](#why-q3_k-gets-special-treatment)
6. [Implementation Insights](#implementation-insights)
7. [The Deeper Pattern](#the-deeper-pattern)
## The Hidden Sophistication of M Variants
When creating a Q4_K_M model, llama.cpp doesn't apply Q4_K throughout. Instead, it strategically
enhances critical components embeddings jump to Q6_K, attention V layers get Q6_K, and FFN down
projections receive the same treatment. This represents years of empirical optimisation baked
directly into the quantisation logic.
The L and XL models make surgical adjustments to an already-optimised foundation. Q4_K_L simply
takes the enhanced Q4_K_M and upgrades embeddings from Q6_K to Q8_0. This explains why file size
increases are modest relative to quality gains.
## The Complete Quantisation Map
Here's what's actually happening inside these models, based on analysis of real GGUF files:
| Variant | Embed | Output | Q | K | V | Gate | Up | Down |
|----------|-------|--------|-------|-------|-------|-------|-------|-------|
| Q3_K_M | Q6_K | Q4_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
| Q3_K_L | Q6_K | Q5_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
| Q3_K_XL | Q8_0 | Q5_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
| Q4_K_M | Q6_K | Q4_K | Q4_K | Q4_K | Q6_K | Q4_K | Q4_K | Q6_K |
| Q4_K_L | Q8_0 | Q4_K | Q4_K | Q4_K | Q6_K | Q4_K | Q4_K | Q6_K |
| Q5_K_M | Q6_K | Q5_K | Q5_K | Q5_K | Q6_K | Q5_K | Q5_K | Q6_K |
| Q5_K_L | Q8_0 | Q5_K | Q5_K | Q5_K | Q6_K | Q5_K | Q5_K | Q6_K |
| Q6_K_L | Q8_0 | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K |
Key patterns: M variants boost embeddings to Q6_K, enhance attention V layers (Q3→Q5, Q4/Q5→Q6),
and upgrade FFN down projections. L variants change just embeddings or output. Only Q3_K has an XL
variant as it has room for both improvements without competing with the next tier.
## The Architecture of Intelligence
Using a Qwen3 4B model as reference: embeddings comprise just 9.7% of parameters (389M, 0.78GB at
F16) yet fundamentally determine vocabulary understanding. Poor embedding quantisation prevents the
model from distinguishing similar tokens. Upgrading from Q4 to Q8 adds only 0.17GB but dramatically
improves handling of technical terms and rare words.
Attention (Q, K, V) accounts for 14.1% of parameters (566M, 1.13GB). Value vectors (V) are critical
they're what the model retrieves when attending to context. M variants enhance V layers whilst
leaving Q and K at base quantisation for better information retrieval without excessive size increase.
Feed-forward network trade-offs: Gate and up projections (44.6% of parameters, 1,793M, 3.59GB)
stay at base quantisation as enhancement would double file sizes for modest gains. Down projections
(22.3%, 897M, 1.79GB) get enhanced in M variants as they're the final transformation affecting all
downstream processing.
The output layer (9.4% of parameters, 378M, 0.75GB) determines final token predictions. Q3_K_L
targets it for enhancement as improved output precision can mean the difference between coherent
and garbled text for Q3-based models.
## The Economics of Enhancement
Q4_K_M at 2.26GB already includes strategic Q6_K enhancements. The L variant adds just 0.44GB (19%
increase) by upgrading only embeddings to Q8_0, leveraging existing enhancements whilst maximising
vocabulary understanding. A naive approach of upgrading everything would add gigabytes for marginal
improvements.
Bartowski's popularity stems from carefully chosen points in the size-quality space. Each variant
represents a local optimum better quality requires jumping tiers, smaller size sacrifices key
capabilities.
## Why Q3_K Gets Special Treatment
Q3_K uniquely has an XL variant because it starts from the lowest practical quantisation with room
for improvement. The progression from Q3_K_M (1.5GB) through L (1.6GB) to XL (1.8GB) provides
granular control for memory-constrained environments, with each 15-20% size increase delivering
meaningful quality improvements.
Q4_K_XL or Q5_K_XL don't exist because they'd compete with the next tier. A hypothetical Q4_K_XL
at 2.75GB would match Q5_K_M's size, but Q5_K_M's superior base quantisation provides better
quality than selectively enhanced Q4_K layers.
The pattern is consistent: significant enhancements to Q5_K or Q6_K mean you should jump to the
next base type. Sweet spots: Q3 family for extreme memory constraints, Q4/Q5 for mainstream use,
Q6/Q8 when quality matters more than size.
## Implementation Insights
Since llama.cpp's M variants already include sophisticated enhancements, replicating Bartowski's
variants requires minimal configuration:
```python
# Q3_K_L: Only upgrade output from M baseline
config = {
"base": "Q3_K_M", # Inherits Q6_K embeddings, Q5_K V/FFN-down
"output": "Q5_K" # Single surgical change
}
# Q4_K_L: Only upgrade embeddings from M baseline
config = {
"base": "Q4_K_M", # Inherits Q6_K V/FFN-down
"embeddings": "Q8_0" # Single surgical change
}
# Q3_K_XL: The only variant needing two changes
config = {
"base": "Q3_K_M",
"embeddings": "Q8_0",
"output": "Q5_K"
}
```
This minimalist approach recognises that M variants already embody years of community optimisation.
Bartowski's contribution lies in identifying where small adjustments yield outsized returns.
## The Deeper Pattern
This system evolved through countless experiments rather than top-down design. M variants encode
hard-won knowledge about critical layers. L variants build on this foundation. The absence of most
XL variants shows where diminishing returns set in.
Bartowski's quantisations work because they embody years of collective learning about what matters
in practice. They demonstrate that the best solutions often come from understanding and building
upon what already works, rather than grand redesigns.

View file

@ -1,86 +1,136 @@
# Development Guide
This guide covers development setup, code quality standards, and project structure for contributors.
Contributing to GGUF tools requires understanding quantisation workflows and Python's modern
dependency ecosystem. This guide covers setup, standards, and architectural decisions for fixing
bugs, adding quantisation profiles, or extending conversion capabilities.
## Code Quality
Ruff replaces the traditional Black/isort/flake8 stack as both linter and formatter. Mypy provides
static type checking to catch type-related bugs before runtime. Zero tolerance for linting and type
errors catches issues early. Both tools have extensive configuration in `pyproject.toml` to enforce
only the important code quality standards we've selected. Debug logging reveals quantisation internals
when models fail.
```bash
# Run linting
uv run ruff check
# Run linting - catches style violations, potential bugs, and code smells
uvx ruff check
# Format code
uv run ruff format
# Format code - enforces consistent style automatically
uvx ruff format
# Run with debug logging
# Run type checking - ensures type safety and catches potential bugs
uv run mypy .
# Run with debug logging - reveals conversion steps and tensor processing
DEBUG=true uv run <script>
```
## Project Structure
Architecture separates concerns cleanly: top-level scripts provide interfaces, helpers encapsulate
reusable logic, resources contain community data. Structure evolved from practical needs helpers
emerged to eliminate duplication, services to abstract external dependencies.
```plain
llm-gguf-tools/
├── quantise.py # Bartowski quantisation tool
├── direct_safetensors_to_gguf.py # Direct conversion tool
├── helpers/ # Shared utilities
├── quantise.py # Bartowski quantisation tool - the main workflow
├── direct_safetensors_to_gguf.py # Direct conversion for unsupported architectures
├── helpers/ # Shared utilities and abstractions
│ ├── __init__.py
│ └── logger.py # Colour-coded logging
├── resources/ # Resource files
│ └── imatrix_data.txt # Calibration data for imatrix
│ ├── logger.py # Colour-coded logging with context awareness
│ ├── services/ # External service wrappers
│ │ ├── gguf.py # GGUF writer abstraction
│ │ └── llama_python.py # llama-cpp-python integration
│ └── utils/ # Pure utility functions
│ ├── config_parser.py # Model configuration handling
│ └── tensor_mapping.py # Architecture-specific tensor name mapping
├── resources/ # Resource files and calibration data
│ └── imatrix_data.txt # Curated calibration data from Bartowski
├── docs/ # Detailed documentation
│ ├── quantise.md
│ ├── direct_safetensors_to_gguf.md
│ └── development.md
└── pyproject.toml # Project configuration
│ ├── quantise_gguf.md # Quantisation strategies and profiles
│ ├── safetensors2gguf.md # Direct conversion documentation
│ ├── bartowski_analysis.md # Deep dive into variant strategies
│ ├── imatrix_data.md # Importance matrix guide
│ └── development.md # This guide
└── pyproject.toml # Modern Python project configuration
```
## Contributing Guidelines
Contributions are welcome! Please ensure:
The project values pragmatic solutions over theoretical perfection working code that handles edge
cases beats elegant abstractions. Contributors should understand how quantisation profiles map to
Bartowski's discoveries and where Python-C++ boundaries limit functionality.
1. Code follows the existing style (run `uv run ruff format`)
2. All functions have Google-style docstrings
3. Type hints are used throughout
4. Tests pass (if applicable)
Essential requirements:
1. **Style consistency**: Run `uvx ruff format` before committing to keep diffs focused on logic
2. **Documentation**: Google-style docstrings explain behaviour and rationale beyond type hints
3. **Type safety**: Complete type hints for all public functions enable IDE support
4. **Practical testing**: Test with both 1B and 7B+ models to catch scaling issues
## Development Workflow
### Setting Up Development Environment
The project uses `uv` for dependency management Rust-fast, automatic Python version management,
upfront dependency resolution. Development dependencies include ruff, type stubs, and optional
PyTorch for BFloat16 handling.
```bash
# Clone the repository
# Clone the repository - uses Forgejo (GitLab-like) hosting
git clone https://git.tomfos.tr/tom/llm-gguf-tools.git
cd llm-gguf-tools
# Install all dependencies including dev
# Install all dependencies including dev tools
# This installs llama-cpp-python with CUDA support if available
uv sync --all-groups
```
### Code Style
- Follow PEP 8 with ruff enforcement
- Use UK English spelling in comments and documentation
- Maximum line length: 100 characters
- Use type hints for all function parameters and returns
Code style reduces cognitive load by letting reviewers focus on logic rather than layout. UK English
maintains llama.cpp consistency. The 100-character line limit balances descriptive names with
readability.
Core conventions:
- **PEP 8 compliance**: Ruff catches mutable defaults, unused imports automatically
- **UK English**: "Optimise" not "optimize", matching upstream llama.cpp
- **Line length**: 100 characters maximum except URLs or unbreakable paths
- **Type annotations**: Complete hints for public functions documentation that can't go stale
- **Import ordering**: Standard library, third-party, local ruff handles automatically
### Testing
While formal tests are not yet implemented, ensure:
Formal tests pending. Quantisation "correctness" depends on complex interactions between model
architecture, strategy, and downstream usage. Benchmark performance doesn't guarantee production
success.
- Scripts run without errors on sample models
- Logger output is correctly formatted
- File I/O operations handle errors gracefully
Current validation approach:
- **End-to-end testing**: Qwen 0.5B for quick iteration, Llama 3.2 1B for architecture compatibility
- **Output validation**: GGUF must load in llama.cpp and degrade gracefully, not produce gibberish
- **Error handling**: Test corrupted SafeTensors, missing configs, insufficient disk space
- **Logger consistency**: Verify colour coding across terminals, progress bars with piped output
### Debugging
Enable debug logging for verbose output:
Debug logging transforms black box to glass box, revealing failure points. Colour coding highlights
stages: blue (info), yellow (warnings), red (errors), green (success). Visual hierarchy enables
efficient log scanning.
```bash
DEBUG=true uv run quantise.py <model_url>
# Enable comprehensive debug output
DEBUG=true uv run direct_safetensors_to_gguf.py ./model # Tensor mapping details
DEBUG=true uv run quantise.py <model_url> # Memory usage tracking
```
This will show additional information about:
Debug output reveals:
- Model download progress
- Conversion steps
- File operations
- Error details
- **Download progress**: Bytes transferred, retries, connection issues
- **Conversion pipeline**: SafeTensors→GGUF steps, tensor mappings, dimension changes
- **Quantisation decisions**: Layer bit depths, importance matrix effects on weight selection
- **Memory usage**: Peak consumption for predicting larger model requirements
- **File operations**: Read/write/temp patterns for disk usage analysis
- **Error context**: Stack traces with local variables at failure points

115
docs/imatrix_data.md Normal file
View file

@ -0,0 +1,115 @@
# Importance Matrix (IMatrix) Data Guide
An importance matrix guides quantisation by identifying critical weights that need protection. Like
JPEG compression preserving detail in faces whilst compressing uniform backgrounds, the imatrix
protects parameters that most affect output quality.
At lower bit rates, imatrix-quantised models show 2-3% better perplexity scores overall, with larger
gains in specific capabilities. A Q3_K model without imatrix might lose technical vocabulary or
rare language handling, whilst with imatrix it retains these abilities the difference between
simple size reduction and intelligent compression.
1. [The Art of Calibration Data](#the-art-of-calibration-data)
2. [Finding Pre-computed Matrices](#finding-pre-computed-matrices)
3. [Creating Your Own Matrix](#creating-your-own-matrix)
4. [Resource Requirements and Optimisation](#resource-requirements-and-optimisation)
5. [Integration and Workflow](#integration-and-workflow)
6. [Future Developments](#future-developments)
7. [Practical Tips](#practical-tips)
## The Art of Calibration Data
This repository includes `resources/imatrix_data.txt` from
[Bartowski's collection](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8),
originally compiled by Dampf building on Kalomaze's work. The dataset systematically activates
different model capabilities: technical writing for analysis, creative fiction for narrative,
multilingual text for language diversity, and factual content for knowledge accuracy.
The default calibration data works well for general models, but specialised models benefit from
targeted calibration. Code models need diverse programming languages and patterns; medical models
need technical literature and terminology. Calibration should reflect actual use cases 50-100KB
of well-chosen text beats gigabytes of random content.
Calibration runs text through the model to observe weight activation patterns. These patterns
become the importance matrix a heat map of crucial parameters for intended use cases, similar to
how brains strengthen frequently-used neural pathways.
## Finding Pre-computed Matrices
Check for existing matrices before generating your own. Bartowski shares pre-computed matrices at
`https://huggingface.co/bartowski/MODEL-NAME-GGUF/resolve/main/MODEL-NAME.imatrix`. These save
hours of computation and provide excellent results from high-quality calibration data.
The tool automatically checks for imatrix files. If missing, download the appropriate imatrix to
your model's work directory as `imatrix.dat`. The quality improvement, especially at lower
quantisation levels, justifies this extra step.
## Creating Your Own Matrix
Generate your own imatrix for new models, domain-specific calibration, or experimentation.
Currently requires llama.cpp's binary tools as the functionality isn't exposed through
llama-cpp-python.
Download llama.cpp from the [official releases](https://github.com/ggerganov/llama.cpp/releases).
Windows users need `llama-bXXXX-bin-win-cuda-x64.zip` for GPU support; Linux/macOS users can use
binaries or compile from source.
Use the F16 or F32 GGUF model (found in `./work/<model-name>/` after quantisation). F16 balances
quality and computation requirements. Run from your llama.cpp directory:
```bash
./llama-imatrix -m /path/to/model-F16.gguf \
-f /path/to/calibration.txt \
-o /path/to/imatrix.dat \
--chunks 100
```
Generation runs inference whilst analysing activation patterns. The `--chunks` parameter controls
thoroughness (100 is standard, more for production, less for experiments). Expect 30 minutes to
several hours on consumer hardware. GPU acceleration helps significantly.
Generation shows perplexity calculations and progress updates after initial loading. The tool tracks
activation patterns, calculates importance scores, and builds the statistical model for guiding
quantisation.
## Resource Requirements and Optimisation
Resource requirements match full inference: 7B models need ~14GB RAM for F16. CPU-only works but
GPU acceleration reduces days to hours for large models. The process supports interruption and
resumption.
Matrix quality depends on multiple factors. More chunks improve results with diminishing returns
beyond 200-300. F16 precision is optimal F32 doubles computation for minimal gain, whilst
quantised models create quality-degrading feedback loops.
Temperature affects generation (lower focuses on likely paths, higher explores possibilities) but
defaults are well-tuned. Good calibration data matters more than parameter tweaking.
## Integration and Workflow
Place the imatrix as `imatrix.dat` in your model's work directory. The tool auto-detects and applies
it with log confirmation. One imatrix works for all quantisation levels.
The tool acknowledges current limitations whilst providing clean workflows. Though Python generation
isn't available yet, using external matrices is trivial. This pragmatic approach delivers optimal
results whilst preparing for future improvements.
## Future Developments
Native imatrix generation is on llama-cpp-python's roadmap for immediate integration when available.
Meanwhile, this hybrid approach works well. The community shares matrices, calibration datasets
improve constantly, and algorithms grow more sophisticated.
Research continues into dynamic importance scoring, multi-modal calibration for vision-language
models, and automated calibration generation. These advances will eventually reach production tools,
but current approaches already deliver impressive results.
## Practical Tips
Key insights: Quality and diversity beat quantity in calibration data. Include specific use cases
even if uncommon. Balance languages proportionally for multilingual models. Include edge cases for
robustness. When in doubt, use Bartowski's pre-computed matrices they're consistently excellent.
The importance matrix seems obvious in hindsight preserve critical weights, calibrate for actual
usage. Yet it took years of experimentation to develop these techniques. Using them well transforms
quantisation from simple size reduction to intelligent preservation of what matters.

View file

@ -1,102 +1,151 @@
# quantise.py - Advanced GGUF Quantisation
# quantise_gguf.py - Advanced GGUF Quantisation
Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
Transforms language models into optimised GGUF formats, from aggressive Q2 compression to
high-precision Q8_0. Based on analysis of community quantisation patterns, it achieves excellent
quality-to-size ratios whilst working within Python-to-C++ interop constraints.
## Overview
1. [The Full Picture](#the-full-picture)
2. [Understanding the Variants](#understanding-the-variants)
3. [Practical Usage](#practical-usage)
4. [The Architecture Behind the Magic](#the-architecture-behind-the-magic)
5. [Environment and Performance](#environment-and-performance)
6. [Output and Organisation](#output-and-organisation)
This tool automates the complete quantisation workflow for converting models to GGUF format with
multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
## The Full Picture
## Quantisation Variants
GGUF quantisation isn't uniform precision reduction. The tool supports the complete llama.cpp
spectrum: K-quant series (Q3_K-Q6_K) with S/M/L variants, legacy formats (Q4_0, Q5_1), experimental
integer types (IQ2-IQ4), and full precision F16/BF16. The key is understanding strategic usage.
The tool produces four quantisation variants based on Bartowski's method:
Replicating Bartowski's patterns revealed an interesting limitation. Llama-cpp-python provides
embedding and output layer control, but the sophisticated `tensor_types` parameter expects a C++
`std::vector<tensor_quantization>` pointer impossible to create from Python. This architectural
boundary between Python and C++ cannot be worked around without significant redesign.
- **Q4_K_M**: Standard baseline quantisation
- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
Analysis of Bartowski's GGUF files shows this limitation doesn't matter. M variants already include
per-layer enhancements Q4_K_M uses Q6_K for embeddings, attention V, and FFN down layers.
Bartowski's L and XL variants only tweak embeddings and output layers, precisely what we can control.
Working with constraints rather than against them.
## Features
For further optimisation, importance matrix (imatrix) files guide quantisation based on usage
patterns, outperforming fixed rules. See the [IMatrix Guide](./imatrix_data.md) for obtaining or
generating these files particularly crucial at lower bit rates.
- **Automatic model download**: Downloads models from HuggingFace automatically
- **Importance matrix generation**: Creates imatrix for improved quantisation quality
- **Parallel processing**: Uploads multiple variants simultaneously
- **Progress tracking**: Real-time status updates during conversion
- **README generation**: Automatically creates model cards with quantisation details
- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
## Understanding the Variants
## Usage
Our profiles match Bartowski's exact configurations from GGUF analysis. M variants aren't middle
ground but optimised baselines Q4_K_M uses Q6_K for critical layers whilst maintaining Q4_K
elsewhere, a balance proven through years of community experimentation.
### Basic Usage
L variants make minimal but impactful changes. Q4_K_L upgrades embeddings from Q6_K to Q8_0 (+19%
size, better vocabulary). Q3_K_L upgrades output to Q5_K. Q3_K_XL combines both strategies. No
Q4_K_XL or Q5_K_XL exist at those sizes, Q5_K_M's superior base quantisation wins.
```bash
# Quantise a model from HuggingFace
uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
Q5_K_L and Q6_K_L upgrade embeddings to Q8_0, providing stepping stones between major levels for
fine-grained size-quality control. See [Bartowski Analysis](./bartowski_analysis.md) for detailed
architectural interactions.
## Practical Usage
The tool handles the complete workflow: fetches from HuggingFace, converts to GGUF, checks for
imatrix files, processes multiple variants with parallel uploads, generates documentation, and
uploads with metadata. Fire-and-forget design start it and return to completed models.
The Python API enables custom configurations (limited to embedding and output layers due to
llama-cpp-python constraints):
```python
from helpers.services.llama_python import LlamaCppPythonAPI
api = LlamaCppPythonAPI()
# Q4_K_L profile - upgrades embeddings to Q8_0
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-Q4_K_L.gguf",
base_type="Q4_K_M", # Q4_K_M uses Q6_K for embeddings, attn_v, and ffn_down (not flat Q4_K!)
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
output_type=None # Keep default from base type
)
# Example 2: Q3_K_L profile - upgrades output to Q5_K
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-Q3_K_L.gguf",
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down (not flat Q3_K!)
embedding_type=None, # Keep the already-enhanced Q6_K embeddings from base
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
)
# Q3_K_XL profile - upgrades both embeddings and output
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-Q3_K_XL.gguf",
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
)
# Example 4: Custom experimental configuration
api.quantise_model_flexible(
input_path="model-f16.gguf",
output_path="model-custom.gguf",
base_type="Q5_K_M", # Q5_K_M uses Q6_K embeddings, Q6_K attn_v, Q6_K ffn_down
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
output_type="Q8_0" # Upgrade output to maximum precision Q8_0
)
```
### Command Line Options
Command-line usage is even simpler. Just point it at a HuggingFace model and let it work:
```bash
# Skip imatrix generation for faster processing
uv run quantise.py <model_url> --no-imatrix
# Basic usage
uv run quantise_gguf.py https://huggingface.co/meta-llama/Llama-3.2-1B
# Skip imatrix checking for speed
uv run quantise_gguf.py <model_url> --no-imatrix
# Local testing without upload
uv run quantise.py <model_url> --no-upload
uv run quantise_gguf.py <model_url> --no-upload
# Custom output directory
uv run quantise.py <model_url> --output-dir ./my-models
# Use specific HuggingFace token
uv run quantise.py <model_url> --hf-token YOUR_TOKEN
# Custom profiles
uv run quantise_gguf.py <model_url> --profiles Q3_K_M Q4_K_L Q6_K
```
## Environment Variables
## The Architecture Behind the Magic
- `HF_TOKEN`: HuggingFace API token for uploads
- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
- `DEBUG`: Enable debug logging when set to "true"
Based on Qwen3 4B analysis: embeddings (9.7% of parameters) critically affect vocabulary Q4 to Q8
adds just 0.17GB but dramatically improves rare tokens. Attention (14.1% total) has V layers (4.7%)
enhanced in M variants whilst Q and K stay at base for size control.
## Requirements
Feed-forward layers show clear trade-offs: gate/up projections (44.6% of parameters) stay at base
as enhancement would double size for modest gains. Down projections (22.3%) are enhanced in M
variants for feature transformation quality. Output layer (9.4%) gets special attention in Q3_K_L
for prediction quality.
- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
- **HuggingFace account**: For uploading quantised models (optional)
For an 8B model: Q4_K_M baseline is ~4.5GB with Q6_K enhancements. Q4_K_L adds 753MB (5.3GB total)
for Q8_0 embeddings. A hypothetical Q4_K_XL would reach 6.6GB at which point Q5_K_M's superior
base quantisation makes more sense.
## Workflow
## Environment and Performance
1. **Download**: Fetches the model from HuggingFace
2. **Convert**: Converts to initial GGUF format (F32)
3. **Generate imatrix**: Creates importance matrix using calibration data
4. **Quantise**: Produces multiple quantisation variants in parallel
5. **Upload**: Pushes quantised models to HuggingFace with metadata
6. **Clean up**: Removes temporary files and caches
Configuration via environment variables: `HF_TOKEN` for uploads, `LLAMA_CPP_DIR` for custom
binaries, `DEBUG=true` for verbose logging. Uses llama-cpp-python (auto-installed via uv),
benefits from imatrix files, requires HuggingFace account only for uploads.
## Output Structure
Requirements scale predictably: disk needs ~3x model size (original, F32, outputs), memory tracks
model size with streaming optimisations. Processing takes minutes to hours depending on size.
Downloads range from gigabytes to 100GB+ for largest models.
```plain
output_dir/
├── model-F32.gguf # Full precision conversion
├── model-Q4_K_M.gguf # Standard quantisation
├── model-Q4_K_M-imat.gguf # With importance matrix
├── model-Q4_K_L-imat.gguf # Enhanced embeddings/attention
├── model-Q4_K_XL-imat.gguf # High precision embeddings
├── model-Q4_K_XXL-imat.gguf # Maximum precision
└── imatrix.dat # Generated importance matrix
```
Comprehensive error handling: automatic retry with exponential backoff, early dependency detection,
disk space checks, actionable API error messages, detailed conversion failure logs. Resilient
workflow keeps you informed whilst handling large model processing challenges.
## Error Handling
## Output and Organisation
The tool includes comprehensive error handling for:
Outputs organised per model: F32/F16 base, quantisation variants, imatrix files, documentation.
Naming pattern: `model-name-variant.gguf`. Successful uploads auto-clean local files; failures
preserve for manual intervention. READMEs document variant characteristics and technical details.
- Network failures during download
- Missing binaries or dependencies
- Insufficient disk space
- HuggingFace API errors
- Conversion failures
## Performance Considerations
- **Disk space**: Requires ~3x model size in free space
- **Memory**: Needs RAM proportional to model size
- **Processing time**: Varies from minutes to hours based on model size
- **Network**: Downloads can be large (10-100+ GB for large models)
Uploads include metadata, quantisation tags, and model cards explaining trade-offs. Parallel upload
system maximises throughput with full progress visibility.

View file

@ -1,164 +1,272 @@
# direct_safetensors_to_gguf.py - Direct SafeTensors Conversion
# safetensors2gguf.py - Direct SafeTensors Conversion
Direct SafeTensors to GGUF converter for unsupported architectures.
When llama.cpp doesn't recognise your model architecture, this tool provides direct SafeTensors to
GGUF conversion. It bypasses llama.cpp's architecture-specific logic for experimental models and
custom architectures that lack official support.
## Overview
This tool converts SafeTensors models directly to GGUF format without requiring specific
architecture support in llama.cpp. It's particularly useful for experimental models, custom
architectures, or when llama.cpp's standard conversion tools don't recognise your model
architecture.
Most transformer models share common tensor patterns regardless of architecture. While llama.cpp
requires explicit support for each architecture, this tool maps tensor names to GGUF conventions
and preserves metadata. Works well for models following standard transformer patterns.
## Features
- **Architecture-agnostic**: Works with unsupported model architectures
- **Automatic mapping**: Intelligently maps tensor names to GGUF conventions
- **BFloat16 support**: Handles BF16 tensors with PyTorch (optional)
- **Vision models**: Supports models with vision components
- **Tokeniser preservation**: Extracts and includes tokeniser metadata
- **Fallback mechanisms**: Provides sensible defaults for unknown architectures
The converter handles real-world models pragmatically:
- **Architecture-agnostic conversion**: Pattern matching identifies common tensor types embeddings
look similar across Llama, Qwen, or custom architectures
- **Intelligent tensor mapping**: Maps standard patterns (self_attn.q_proj → attn_q) whilst
preserving unrecognised tensors rather than dropping them
- **BFloat16 handling**: Optional PyTorch for BF16→F32 conversion as many models ship in BF16
- **Vision model support**: Extracts vision tower parameters for multimodal models
- **Tokeniser preservation**: Copies configuration wholesale to prevent garbled output from mismatches
- **Graceful fallbacks**: Unknown architectures default to Llama structure effective since most
models derive from Llama
## Usage
Point at a model directory and the tool handles the rest. Most models convert with defaults, though
forcing architecture helps when autodetection fails.
### Basic Usage
```bash
# Convert a local SafeTensors model
uv run direct_safetensors_to_gguf.py /path/to/model/directory
# Convert a local SafeTensors model - autodetects architecture
uv run safetensors2gguf.py /path/to/model/directory
```
### Command Line Options
```bash
# Specify output file
uv run direct_safetensors_to_gguf.py /path/to/model -o output.gguf
# Specify output location - useful for organising converted models
uv run safetensors2gguf.py /path/to/model -o output.gguf
# Force specific architecture mapping
uv run direct_safetensors_to_gguf.py /path/to/model --force-arch qwen2
# Force architecture when autodetection fails or for better compatibility
uv run safetensors2gguf.py /path/to/model --force-arch qwen2
# Convert with custom output path
uv run direct_safetensors_to_gguf.py ./my-model --output ./converted/my-model.gguf
# Convert with full path control - keeps originals safe
uv run safetensors2gguf.py ./my-model --output ./converted/my-model.gguf
```
## Supported Input Formats
The tool automatically detects and handles:
The tool handles all packaging formats. Sharding emerged when models exceeded file system limits
a 70B model spans dozens of files. Reassembles fragments transparently whether HuggingFace numbered
shards or custom splits.
1. **Single file models**: `model.safetensors`
2. **Sharded models**: `model-00001-of-00005.safetensors`, etc.
3. **Custom names**: Any `*.safetensors` files in the directory
1. **Single file models**: `model.safetensors` common for models under 10GB
2. **Sharded models**: `model-00001-of-00005.safetensors` standard for large models, tool
automatically finds and merges all shards in sequence
3. **Custom names**: Any `*.safetensors` files some fine-tunes use non-standard naming, tool
scans for all SafeTensors files regardless of naming convention
## Architecture Mapping
The tool includes built-in mappings for several architectures:
Architecture mapping bridges naming chaos and GGUF's structured expectations. Model creators invent
their own names, but patterns remain similar underneath. Translation table for known architectures,
unknowns default to Llama reasonable since most models are Llama-inspired.
- `DotsOCRForCausalLM``qwen2`
- `GptOssForCausalLM``llama`
- Unknown architectures → `llama` (fallback)
Built-in mappings reflect real-world encounters:
You can override these with the `--force-arch` parameter.
- `DotsOCRForCausalLM``qwen2` Dots OCR models are Qwen2-based despite the naming
- `GptOssForCausalLM``llama` Generic GPT models usually follow Llama architecture
- Unknown architectures → `llama` Safe default that works for most transformer models
Use `--force-arch` when you know better than autodetection. Particularly useful for fine-tuned
models with custom names but standard structure.
## Tensor Name Mapping
The converter automatically maps common tensor patterns:
Tensor naming diverges most between formats. HuggingFace uses verbose hierarchical names
(`model.layers.0.self_attn.q_proj.weight`), GGUF prefers terse (`blk.0.attn_q`). Mapping preserves
semantics whilst adapting conventions, enabling cross-ecosystem compatibility with llama.cpp.
| Original Pattern | GGUF Name |
|-----------------|-----------|
| `model.embed_tokens.weight` | `token_embd.weight` |
| `model.norm.weight` | `output_norm.weight` |
| `lm_head.weight` | `output.weight` |
| `layers.N.self_attn.q_proj` | `blk.N.attn_q` |
| `layers.N.self_attn.k_proj` | `blk.N.attn_k` |
| `layers.N.self_attn.v_proj` | `blk.N.attn_v` |
| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` |
| `layers.N.mlp.up_proj` | `blk.N.ffn_up` |
| `layers.N.mlp.down_proj` | `blk.N.ffn_down` |
| Original Pattern | GGUF Name | Purpose |
|-----------------|-----------|------|
| `model.embed_tokens.weight` | `token_embd.weight` | Token embeddings maps input IDs to vectors |
| `model.norm.weight` | `output_norm.weight` | Final layer normalisation before output |
| `lm_head.weight` | `output.weight` | Output projection to vocabulary space |
| `layers.N.self_attn.q_proj` | `blk.N.attn_q` | Query projection for attention layer N |
| `layers.N.self_attn.k_proj` | `blk.N.attn_k` | Key projection for attention layer N |
| `layers.N.self_attn.v_proj` | `blk.N.attn_v` | Value projection for attention layer N |
| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` | Gate projection in feedforward network |
| `layers.N.mlp.up_proj` | `blk.N.ffn_up` | Up projection expanding hidden dimension |
| `layers.N.mlp.down_proj` | `blk.N.ffn_down` | Down projection reducing to model dimension |
Pattern matching handles variations like `transformer.h.N` (GPT-style) or `model.decoder.layers.N`
(encoder-decoder) by identifying core patterns regardless of prefix.
## Configuration Requirements
The model directory must contain:
Conversion requires core files though optional components are forgiven. HuggingFace downloads
typically include everything, manually assembled models may lack critical configuration.
- **config.json**: Model configuration file (required)
- **\*.safetensors**: One or more SafeTensors files (required)
- **tokenizer_config.json**: Tokeniser configuration (optional)
- **tokenizer.json**: Tokeniser data (optional)
Required files:
- **config.json**: Architecture name, layer counts, vocabulary size essential for structuring GGUF
- **\*.safetensors**: Model weights, single or sharded handled automatically
Optional but recommended:
- **tokenizer_config.json**: Special tokens, chat templates, tokeniser behaviour missing often
causes garbled output
- **tokenizer.json**: Vocabulary and merge rules tool extracts from other sources if missing but
inclusion ensures compatibility
## Output Format
The tool produces a single GGUF file containing:
GGUF bundles everything for inference in one file, unlike SafeTensors' scattered JSON configuration.
Simplifies deployment but requires careful metadata preservation during conversion.
- All model weights in F32 format
- Model architecture metadata
- Tokeniser configuration (if available)
- Special token IDs (BOS, EOS, UNK, PAD)
The output file contains:
- **Model weights in F32**: Full precision, quantise later with dedicated tools
- **Architecture metadata**: Layer counts, dimensions, activations for model graph construction
- **Tokeniser configuration**: Vocabulary, special tokens, chat templates for model behaviour
- **Special token mappings**: BOS, EOS, UNK, PAD control generation, must match training config
## Error Handling
Error messages are actionable explaining what went wrong, why it matters, and how to fix it.
| Error | Message | Solution |
|-------|---------|----------|
| Missing config.json | `FileNotFoundError: Config file not found` | Ensure the model directory contains a valid `config.json` file |
| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Check that the directory contains `.safetensors` files |
| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch for BF16 support: `uv add torch` |
| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Use `--force-arch` to specify a known compatible architecture |
| Missing config.json | `FileNotFoundError: Config file not found` | Download the complete model including config.json, not just weights |
| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Verify the model uses SafeTensors format older models might use PyTorch .bin files |
| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch (`uv add torch`) or accept potential precision loss in BF16→F32 conversion |
| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Research the model's base architecture and use `--force-arch` with the appropriate type |
## Technical Details
### Parameter Inference
The tool infers GGUF parameters from the model configuration:
Parameter inference bridges naming inconsistencies. Llama's `num_attention_heads` is GPT's
`n_heads`. Translation layer provides sensible defaults for missing values.
- `vocab_size` → vocabulary size (default: 32000)
- `max_position_embeddings` → context length (default: 2048)
- `hidden_size` → embedding dimension (default: 4096)
- `num_hidden_layers` → number of transformer blocks (default: 32)
- `num_attention_heads` → attention head count (default: 32)
- `num_key_value_heads` → KV head count (defaults to attention heads)
- `rope_theta` → RoPE frequency base (default: 10000.0)
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5)
Configuration mapping with defaults chosen from common models:
- `vocab_size` → vocabulary size (default: 32000 Llama's vocabulary)
- `max_position_embeddings` → context length (default: 2048 conservative for compatibility)
- `hidden_size` → embedding dimension (default: 4096 typical for 7B models)
- `num_hidden_layers` → transformer blocks (default: 32 standard for 7B models)
- `num_attention_heads` → attention heads (default: 32 balanced for 4096 dimension)
- `num_key_value_heads` → KV heads for GQA (defaults to attention heads assumes MHA not GQA)
- `rope_theta` → RoPE frequency base (default: 10000.0 standard RoPE configuration)
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5 numerical stability threshold)
Defaults work for most models. Wrong parameters may not error immediately but degrade output quality.
### Vision Model Support
For models with vision components, the tool extracts:
Multimodal models increasingly common. Tool preserves vision tower configuration though GGUF support
remains experimental. Vision parameters extracted but may not be fully utilised.
- Vision embedding dimensions
- Vision transformer block count
- Vision attention heads
- Vision feed-forward dimensions
- Patch size and spatial merge parameters
Extracted vision parameters:
- **Vision embedding dimensions**: Hidden size, typically differs from language dimensions
- **Vision transformer blocks**: Encoder layers, fewer but wider than language
- **Vision attention heads**: Usually standard MHA rather than grouped-query
- **Feed-forward dimensions**: Different expansion ratios from language FFN
- **Patch configuration**: Size (14×14), spatial merging, position encoding
Vision support best-effort preserves what's found, can't guarantee inference engine usage.
## Limitations
- **F32 only**: Currently outputs only full precision (F32) models
- **Architecture guessing**: May require manual architecture specification
- **Tokeniser compatibility**: Uses llama tokeniser as default fallback
- **Memory usage**: Requires loading full tensors into memory
Understanding limitations prevents frustration. Design favours broad compatibility over perfection.
- **F32 output only**: Quantisation requires separate tools like quantise_gguf.py for bit depth control
- **Architecture guessing**: Works for common patterns, novel architectures need manual specification
- **Tokeniser compatibility**: Falls back to Llama tokeniser when data missing may cause issues with
special tokens
- **Memory requirements**: Loads entire tensors into RAM 70B model needs 140GB+, no streaming support
- **No quantisation**: Preserves full precision, quantise separately for deployment control
- **Limited validation**: Ensures structure, can't verify output quality test before deployment
## Examples
### Converting a custom model
Typical workflow: download from HuggingFace, convert to GGUF, quantise for deployment. This tool
handles the SafeTensors→GGUF transformation.
```bash
# Download a model first
# Download complete model with all configuration files
git clone https://huggingface.co/my-org/my-model ./my-model
# Convert to GGUF
uv run direct_safetensors_to_gguf.py ./my-model
# Convert to GGUF - automatic architecture detection
uv run safetensors2gguf.py ./my-model
# Output will be at ./my-model/my-model-f32.gguf
# Output appears at ./my-model/my-model-f32.gguf
# Now ready for quantisation with quantise_gguf.py
```
### Converting with specific architecture
Force architecture when autodetection fails or you know the model's lineage. Useful for fine-tuned
models with custom names.
```bash
# For a Qwen2-based model
uv run direct_safetensors_to_gguf.py ./qwen-model --force-arch qwen2
# Force Qwen2 architecture for a model you know is Qwen2-based
uv run safetensors2gguf.py ./qwen-model --force-arch qwen2
# Common forced architectures:
# --force-arch llama # Most models
# --force-arch qwen2 # Qwen family
# --force-arch mistral # Mistral variants
```
### Batch conversion
Bash loops enable bulk conversion for comparing checkpoints or converting model families.
```bash
# Convert multiple models
# Convert directory of models, preserving originals
for model in ./models/*; do
uv run direct_safetensors_to_gguf.py "$model" -o "./gguf/$(basename $model).gguf"
echo "Converting $(basename $model)..."
uv run safetensors2gguf.py "$model" \
-o "./gguf/$(basename $model).gguf" 2>&1 | \
tee "./logs/$(basename $model).log"
done
# Check results
ls -lh ./gguf/*.gguf
```
## Integration with Quantisation Pipeline
Tool produces F32 GGUF ready for quantisation. Typical pipeline:
1. **Download model**: Get SafeTensors model from HuggingFace
2. **Convert to GGUF**: Use this tool for architecture-agnostic conversion
3. **Quantise**: Apply quantise_gguf.py for Bartowski-style variants
4. **Deploy**: Use with llama.cpp, Ollama, or other GGUF-compatible inference engines
Separation enables control at each stage. Convert once, quantise to multiple bit depths, test
configurations without repeating conversion.
## Troubleshooting
### Model produces gibberish after conversion
Indicates tokeniser mismatch. Ensure tokenizer.json and tokenizer_config.json present. Custom
tokenisers may need `--force-arch`.
### Conversion succeeds but model won't load
Use recent llama.cpp GGUF format evolves, older versions lack newer metadata support. Verify
forced architecture matches actual structure wrong forcing creates invalid models.
### Out of memory during conversion
Tool loads all weights simultaneously. For large models:
- Close other applications to free RAM
- Use a system with more memory (cloud instances work well)
- Consider quantising from a pre-converted F16 model if available
### Warning about unknown tensors
Normal for custom layers. Preserves unknown tensors though inference may not use them. Harmless
better to include unused weights than miss critical ones.