Switch to llama-cpp-python
This commit is contained in:
parent
ef7df1a8c3
commit
d937f2d5fa
25 changed files with 2957 additions and 1181 deletions
127
docs/bartowski_analysis.md
Normal file
127
docs/bartowski_analysis.md
Normal file
|
@ -0,0 +1,127 @@
|
|||
# Bartowski Quantisation Analysis
|
||||
|
||||
Analysis of Bartowski GGUF files reveals why these models work so well: the "M" variants don't
|
||||
apply uniform quantisation as their names suggest.
|
||||
|
||||
1. [The Hidden Sophistication of M Variants](#the-hidden-sophistication-of-m-variants)
|
||||
2. [The Complete Quantisation Map](#the-complete-quantisation-map)
|
||||
3. [The Architecture of Intelligence](#the-architecture-of-intelligence)
|
||||
4. [The Economics of Enhancement](#the-economics-of-enhancement)
|
||||
5. [Why Q3\_K Gets Special Treatment](#why-q3_k-gets-special-treatment)
|
||||
6. [Implementation Insights](#implementation-insights)
|
||||
7. [The Deeper Pattern](#the-deeper-pattern)
|
||||
|
||||
## The Hidden Sophistication of M Variants
|
||||
|
||||
When creating a Q4_K_M model, llama.cpp doesn't apply Q4_K throughout. Instead, it strategically
|
||||
enhances critical components – embeddings jump to Q6_K, attention V layers get Q6_K, and FFN down
|
||||
projections receive the same treatment. This represents years of empirical optimisation baked
|
||||
directly into the quantisation logic.
|
||||
|
||||
The L and XL models make surgical adjustments to an already-optimised foundation. Q4_K_L simply
|
||||
takes the enhanced Q4_K_M and upgrades embeddings from Q6_K to Q8_0. This explains why file size
|
||||
increases are modest relative to quality gains.
|
||||
|
||||
## The Complete Quantisation Map
|
||||
|
||||
Here's what's actually happening inside these models, based on analysis of real GGUF files:
|
||||
|
||||
| Variant | Embed | Output | Q | K | V | Gate | Up | Down |
|
||||
|----------|-------|--------|-------|-------|-------|-------|-------|-------|
|
||||
| Q3_K_M | Q6_K | Q4_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
|
||||
| Q3_K_L | Q6_K | Q5_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
|
||||
| Q3_K_XL | Q8_0 | Q5_K | Q3_K | Q3_K | Q5_K | Q3_K | Q3_K | Q5_K |
|
||||
| Q4_K_M | Q6_K | Q4_K | Q4_K | Q4_K | Q6_K | Q4_K | Q4_K | Q6_K |
|
||||
| Q4_K_L | Q8_0 | Q4_K | Q4_K | Q4_K | Q6_K | Q4_K | Q4_K | Q6_K |
|
||||
| Q5_K_M | Q6_K | Q5_K | Q5_K | Q5_K | Q6_K | Q5_K | Q5_K | Q6_K |
|
||||
| Q5_K_L | Q8_0 | Q5_K | Q5_K | Q5_K | Q6_K | Q5_K | Q5_K | Q6_K |
|
||||
| Q6_K_L | Q8_0 | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K | Q6_K |
|
||||
|
||||
Key patterns: M variants boost embeddings to Q6_K, enhance attention V layers (Q3→Q5, Q4/Q5→Q6),
|
||||
and upgrade FFN down projections. L variants change just embeddings or output. Only Q3_K has an XL
|
||||
variant as it has room for both improvements without competing with the next tier.
|
||||
|
||||
## The Architecture of Intelligence
|
||||
|
||||
Using a Qwen3 4B model as reference: embeddings comprise just 9.7% of parameters (389M, 0.78GB at
|
||||
F16) yet fundamentally determine vocabulary understanding. Poor embedding quantisation prevents the
|
||||
model from distinguishing similar tokens. Upgrading from Q4 to Q8 adds only 0.17GB but dramatically
|
||||
improves handling of technical terms and rare words.
|
||||
|
||||
Attention (Q, K, V) accounts for 14.1% of parameters (566M, 1.13GB). Value vectors (V) are critical
|
||||
– they're what the model retrieves when attending to context. M variants enhance V layers whilst
|
||||
leaving Q and K at base quantisation for better information retrieval without excessive size increase.
|
||||
|
||||
Feed-forward network trade-offs: Gate and up projections (44.6% of parameters, 1,793M, 3.59GB)
|
||||
stay at base quantisation as enhancement would double file sizes for modest gains. Down projections
|
||||
(22.3%, 897M, 1.79GB) get enhanced in M variants as they're the final transformation affecting all
|
||||
downstream processing.
|
||||
|
||||
The output layer (9.4% of parameters, 378M, 0.75GB) determines final token predictions. Q3_K_L
|
||||
targets it for enhancement as improved output precision can mean the difference between coherent
|
||||
and garbled text for Q3-based models.
|
||||
|
||||
## The Economics of Enhancement
|
||||
|
||||
Q4_K_M at 2.26GB already includes strategic Q6_K enhancements. The L variant adds just 0.44GB (19%
|
||||
increase) by upgrading only embeddings to Q8_0, leveraging existing enhancements whilst maximising
|
||||
vocabulary understanding. A naive approach of upgrading everything would add gigabytes for marginal
|
||||
improvements.
|
||||
|
||||
Bartowski's popularity stems from carefully chosen points in the size-quality space. Each variant
|
||||
represents a local optimum – better quality requires jumping tiers, smaller size sacrifices key
|
||||
capabilities.
|
||||
|
||||
## Why Q3_K Gets Special Treatment
|
||||
|
||||
Q3_K uniquely has an XL variant because it starts from the lowest practical quantisation with room
|
||||
for improvement. The progression from Q3_K_M (1.5GB) through L (1.6GB) to XL (1.8GB) provides
|
||||
granular control for memory-constrained environments, with each 15-20% size increase delivering
|
||||
meaningful quality improvements.
|
||||
|
||||
Q4_K_XL or Q5_K_XL don't exist because they'd compete with the next tier. A hypothetical Q4_K_XL
|
||||
at 2.75GB would match Q5_K_M's size, but Q5_K_M's superior base quantisation provides better
|
||||
quality than selectively enhanced Q4_K layers.
|
||||
|
||||
The pattern is consistent: significant enhancements to Q5_K or Q6_K mean you should jump to the
|
||||
next base type. Sweet spots: Q3 family for extreme memory constraints, Q4/Q5 for mainstream use,
|
||||
Q6/Q8 when quality matters more than size.
|
||||
|
||||
## Implementation Insights
|
||||
|
||||
Since llama.cpp's M variants already include sophisticated enhancements, replicating Bartowski's
|
||||
variants requires minimal configuration:
|
||||
|
||||
```python
|
||||
# Q3_K_L: Only upgrade output from M baseline
|
||||
config = {
|
||||
"base": "Q3_K_M", # Inherits Q6_K embeddings, Q5_K V/FFN-down
|
||||
"output": "Q5_K" # Single surgical change
|
||||
}
|
||||
|
||||
# Q4_K_L: Only upgrade embeddings from M baseline
|
||||
config = {
|
||||
"base": "Q4_K_M", # Inherits Q6_K V/FFN-down
|
||||
"embeddings": "Q8_0" # Single surgical change
|
||||
}
|
||||
|
||||
# Q3_K_XL: The only variant needing two changes
|
||||
config = {
|
||||
"base": "Q3_K_M",
|
||||
"embeddings": "Q8_0",
|
||||
"output": "Q5_K"
|
||||
}
|
||||
```
|
||||
|
||||
This minimalist approach recognises that M variants already embody years of community optimisation.
|
||||
Bartowski's contribution lies in identifying where small adjustments yield outsized returns.
|
||||
|
||||
## The Deeper Pattern
|
||||
|
||||
This system evolved through countless experiments rather than top-down design. M variants encode
|
||||
hard-won knowledge about critical layers. L variants build on this foundation. The absence of most
|
||||
XL variants shows where diminishing returns set in.
|
||||
|
||||
Bartowski's quantisations work because they embody years of collective learning about what matters
|
||||
in practice. They demonstrate that the best solutions often come from understanding and building
|
||||
upon what already works, rather than grand redesigns.
|
|
@ -1,86 +1,136 @@
|
|||
# Development Guide
|
||||
|
||||
This guide covers development setup, code quality standards, and project structure for contributors.
|
||||
Contributing to GGUF tools requires understanding quantisation workflows and Python's modern
|
||||
dependency ecosystem. This guide covers setup, standards, and architectural decisions for fixing
|
||||
bugs, adding quantisation profiles, or extending conversion capabilities.
|
||||
|
||||
## Code Quality
|
||||
|
||||
Ruff replaces the traditional Black/isort/flake8 stack as both linter and formatter. Mypy provides
|
||||
static type checking to catch type-related bugs before runtime. Zero tolerance for linting and type
|
||||
errors catches issues early. Both tools have extensive configuration in `pyproject.toml` to enforce
|
||||
only the important code quality standards we've selected. Debug logging reveals quantisation internals
|
||||
when models fail.
|
||||
|
||||
```bash
|
||||
# Run linting
|
||||
uv run ruff check
|
||||
# Run linting - catches style violations, potential bugs, and code smells
|
||||
uvx ruff check
|
||||
|
||||
# Format code
|
||||
uv run ruff format
|
||||
# Format code - enforces consistent style automatically
|
||||
uvx ruff format
|
||||
|
||||
# Run with debug logging
|
||||
# Run type checking - ensures type safety and catches potential bugs
|
||||
uv run mypy .
|
||||
|
||||
# Run with debug logging - reveals conversion steps and tensor processing
|
||||
DEBUG=true uv run <script>
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
Architecture separates concerns cleanly: top-level scripts provide interfaces, helpers encapsulate
|
||||
reusable logic, resources contain community data. Structure evolved from practical needs – helpers
|
||||
emerged to eliminate duplication, services to abstract external dependencies.
|
||||
|
||||
```plain
|
||||
llm-gguf-tools/
|
||||
├── quantise.py # Bartowski quantisation tool
|
||||
├── direct_safetensors_to_gguf.py # Direct conversion tool
|
||||
├── helpers/ # Shared utilities
|
||||
├── quantise.py # Bartowski quantisation tool - the main workflow
|
||||
├── direct_safetensors_to_gguf.py # Direct conversion for unsupported architectures
|
||||
├── helpers/ # Shared utilities and abstractions
|
||||
│ ├── __init__.py
|
||||
│ └── logger.py # Colour-coded logging
|
||||
├── resources/ # Resource files
|
||||
│ └── imatrix_data.txt # Calibration data for imatrix
|
||||
│ ├── logger.py # Colour-coded logging with context awareness
|
||||
│ ├── services/ # External service wrappers
|
||||
│ │ ├── gguf.py # GGUF writer abstraction
|
||||
│ │ └── llama_python.py # llama-cpp-python integration
|
||||
│ └── utils/ # Pure utility functions
|
||||
│ ├── config_parser.py # Model configuration handling
|
||||
│ └── tensor_mapping.py # Architecture-specific tensor name mapping
|
||||
├── resources/ # Resource files and calibration data
|
||||
│ └── imatrix_data.txt # Curated calibration data from Bartowski
|
||||
├── docs/ # Detailed documentation
|
||||
│ ├── quantise.md
|
||||
│ ├── direct_safetensors_to_gguf.md
|
||||
│ └── development.md
|
||||
└── pyproject.toml # Project configuration
|
||||
│ ├── quantise_gguf.md # Quantisation strategies and profiles
|
||||
│ ├── safetensors2gguf.md # Direct conversion documentation
|
||||
│ ├── bartowski_analysis.md # Deep dive into variant strategies
|
||||
│ ├── imatrix_data.md # Importance matrix guide
|
||||
│ └── development.md # This guide
|
||||
└── pyproject.toml # Modern Python project configuration
|
||||
```
|
||||
|
||||
## Contributing Guidelines
|
||||
|
||||
Contributions are welcome! Please ensure:
|
||||
The project values pragmatic solutions over theoretical perfection – working code that handles edge
|
||||
cases beats elegant abstractions. Contributors should understand how quantisation profiles map to
|
||||
Bartowski's discoveries and where Python-C++ boundaries limit functionality.
|
||||
|
||||
1. Code follows the existing style (run `uv run ruff format`)
|
||||
2. All functions have Google-style docstrings
|
||||
3. Type hints are used throughout
|
||||
4. Tests pass (if applicable)
|
||||
Essential requirements:
|
||||
|
||||
1. **Style consistency**: Run `uvx ruff format` before committing to keep diffs focused on logic
|
||||
2. **Documentation**: Google-style docstrings explain behaviour and rationale beyond type hints
|
||||
3. **Type safety**: Complete type hints for all public functions enable IDE support
|
||||
4. **Practical testing**: Test with both 1B and 7B+ models to catch scaling issues
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Setting Up Development Environment
|
||||
|
||||
The project uses `uv` for dependency management – Rust-fast, automatic Python version management,
|
||||
upfront dependency resolution. Development dependencies include ruff, type stubs, and optional
|
||||
PyTorch for BFloat16 handling.
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
# Clone the repository - uses Forgejo (GitLab-like) hosting
|
||||
git clone https://git.tomfos.tr/tom/llm-gguf-tools.git
|
||||
cd llm-gguf-tools
|
||||
|
||||
# Install all dependencies including dev
|
||||
# Install all dependencies including dev tools
|
||||
# This installs llama-cpp-python with CUDA support if available
|
||||
uv sync --all-groups
|
||||
```
|
||||
|
||||
### Code Style
|
||||
|
||||
- Follow PEP 8 with ruff enforcement
|
||||
- Use UK English spelling in comments and documentation
|
||||
- Maximum line length: 100 characters
|
||||
- Use type hints for all function parameters and returns
|
||||
Code style reduces cognitive load by letting reviewers focus on logic rather than layout. UK English
|
||||
maintains llama.cpp consistency. The 100-character line limit balances descriptive names with
|
||||
readability.
|
||||
|
||||
Core conventions:
|
||||
|
||||
- **PEP 8 compliance**: Ruff catches mutable defaults, unused imports automatically
|
||||
- **UK English**: "Optimise" not "optimize", matching upstream llama.cpp
|
||||
- **Line length**: 100 characters maximum except URLs or unbreakable paths
|
||||
- **Type annotations**: Complete hints for public functions – documentation that can't go stale
|
||||
- **Import ordering**: Standard library, third-party, local – ruff handles automatically
|
||||
|
||||
### Testing
|
||||
|
||||
While formal tests are not yet implemented, ensure:
|
||||
Formal tests pending. Quantisation "correctness" depends on complex interactions between model
|
||||
architecture, strategy, and downstream usage. Benchmark performance doesn't guarantee production
|
||||
success.
|
||||
|
||||
- Scripts run without errors on sample models
|
||||
- Logger output is correctly formatted
|
||||
- File I/O operations handle errors gracefully
|
||||
Current validation approach:
|
||||
|
||||
- **End-to-end testing**: Qwen 0.5B for quick iteration, Llama 3.2 1B for architecture compatibility
|
||||
- **Output validation**: GGUF must load in llama.cpp and degrade gracefully, not produce gibberish
|
||||
- **Error handling**: Test corrupted SafeTensors, missing configs, insufficient disk space
|
||||
- **Logger consistency**: Verify colour coding across terminals, progress bars with piped output
|
||||
|
||||
### Debugging
|
||||
|
||||
Enable debug logging for verbose output:
|
||||
Debug logging transforms black box to glass box, revealing failure points. Colour coding highlights
|
||||
stages: blue (info), yellow (warnings), red (errors), green (success). Visual hierarchy enables
|
||||
efficient log scanning.
|
||||
|
||||
```bash
|
||||
DEBUG=true uv run quantise.py <model_url>
|
||||
# Enable comprehensive debug output
|
||||
DEBUG=true uv run direct_safetensors_to_gguf.py ./model # Tensor mapping details
|
||||
DEBUG=true uv run quantise.py <model_url> # Memory usage tracking
|
||||
```
|
||||
|
||||
This will show additional information about:
|
||||
Debug output reveals:
|
||||
|
||||
- Model download progress
|
||||
- Conversion steps
|
||||
- File operations
|
||||
- Error details
|
||||
- **Download progress**: Bytes transferred, retries, connection issues
|
||||
- **Conversion pipeline**: SafeTensors→GGUF steps, tensor mappings, dimension changes
|
||||
- **Quantisation decisions**: Layer bit depths, importance matrix effects on weight selection
|
||||
- **Memory usage**: Peak consumption for predicting larger model requirements
|
||||
- **File operations**: Read/write/temp patterns for disk usage analysis
|
||||
- **Error context**: Stack traces with local variables at failure points
|
||||
|
|
115
docs/imatrix_data.md
Normal file
115
docs/imatrix_data.md
Normal file
|
@ -0,0 +1,115 @@
|
|||
# Importance Matrix (IMatrix) Data Guide
|
||||
|
||||
An importance matrix guides quantisation by identifying critical weights that need protection. Like
|
||||
JPEG compression preserving detail in faces whilst compressing uniform backgrounds, the imatrix
|
||||
protects parameters that most affect output quality.
|
||||
|
||||
At lower bit rates, imatrix-quantised models show 2-3% better perplexity scores overall, with larger
|
||||
gains in specific capabilities. A Q3_K model without imatrix might lose technical vocabulary or
|
||||
rare language handling, whilst with imatrix it retains these abilities – the difference between
|
||||
simple size reduction and intelligent compression.
|
||||
|
||||
1. [The Art of Calibration Data](#the-art-of-calibration-data)
|
||||
2. [Finding Pre-computed Matrices](#finding-pre-computed-matrices)
|
||||
3. [Creating Your Own Matrix](#creating-your-own-matrix)
|
||||
4. [Resource Requirements and Optimisation](#resource-requirements-and-optimisation)
|
||||
5. [Integration and Workflow](#integration-and-workflow)
|
||||
6. [Future Developments](#future-developments)
|
||||
7. [Practical Tips](#practical-tips)
|
||||
|
||||
## The Art of Calibration Data
|
||||
|
||||
This repository includes `resources/imatrix_data.txt` from
|
||||
[Bartowski's collection](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8),
|
||||
originally compiled by Dampf building on Kalomaze's work. The dataset systematically activates
|
||||
different model capabilities: technical writing for analysis, creative fiction for narrative,
|
||||
multilingual text for language diversity, and factual content for knowledge accuracy.
|
||||
|
||||
The default calibration data works well for general models, but specialised models benefit from
|
||||
targeted calibration. Code models need diverse programming languages and patterns; medical models
|
||||
need technical literature and terminology. Calibration should reflect actual use cases – 50-100KB
|
||||
of well-chosen text beats gigabytes of random content.
|
||||
|
||||
Calibration runs text through the model to observe weight activation patterns. These patterns
|
||||
become the importance matrix – a heat map of crucial parameters for intended use cases, similar to
|
||||
how brains strengthen frequently-used neural pathways.
|
||||
|
||||
## Finding Pre-computed Matrices
|
||||
|
||||
Check for existing matrices before generating your own. Bartowski shares pre-computed matrices at
|
||||
`https://huggingface.co/bartowski/MODEL-NAME-GGUF/resolve/main/MODEL-NAME.imatrix`. These save
|
||||
hours of computation and provide excellent results from high-quality calibration data.
|
||||
|
||||
The tool automatically checks for imatrix files. If missing, download the appropriate imatrix to
|
||||
your model's work directory as `imatrix.dat`. The quality improvement, especially at lower
|
||||
quantisation levels, justifies this extra step.
|
||||
|
||||
## Creating Your Own Matrix
|
||||
|
||||
Generate your own imatrix for new models, domain-specific calibration, or experimentation.
|
||||
Currently requires llama.cpp's binary tools as the functionality isn't exposed through
|
||||
llama-cpp-python.
|
||||
|
||||
Download llama.cpp from the [official releases](https://github.com/ggerganov/llama.cpp/releases).
|
||||
Windows users need `llama-bXXXX-bin-win-cuda-x64.zip` for GPU support; Linux/macOS users can use
|
||||
binaries or compile from source.
|
||||
|
||||
Use the F16 or F32 GGUF model (found in `./work/<model-name>/` after quantisation). F16 balances
|
||||
quality and computation requirements. Run from your llama.cpp directory:
|
||||
|
||||
```bash
|
||||
./llama-imatrix -m /path/to/model-F16.gguf \
|
||||
-f /path/to/calibration.txt \
|
||||
-o /path/to/imatrix.dat \
|
||||
--chunks 100
|
||||
```
|
||||
|
||||
Generation runs inference whilst analysing activation patterns. The `--chunks` parameter controls
|
||||
thoroughness (100 is standard, more for production, less for experiments). Expect 30 minutes to
|
||||
several hours on consumer hardware. GPU acceleration helps significantly.
|
||||
|
||||
Generation shows perplexity calculations and progress updates after initial loading. The tool tracks
|
||||
activation patterns, calculates importance scores, and builds the statistical model for guiding
|
||||
quantisation.
|
||||
|
||||
## Resource Requirements and Optimisation
|
||||
|
||||
Resource requirements match full inference: 7B models need ~14GB RAM for F16. CPU-only works but
|
||||
GPU acceleration reduces days to hours for large models. The process supports interruption and
|
||||
resumption.
|
||||
|
||||
Matrix quality depends on multiple factors. More chunks improve results with diminishing returns
|
||||
beyond 200-300. F16 precision is optimal – F32 doubles computation for minimal gain, whilst
|
||||
quantised models create quality-degrading feedback loops.
|
||||
|
||||
Temperature affects generation (lower focuses on likely paths, higher explores possibilities) but
|
||||
defaults are well-tuned. Good calibration data matters more than parameter tweaking.
|
||||
|
||||
## Integration and Workflow
|
||||
|
||||
Place the imatrix as `imatrix.dat` in your model's work directory. The tool auto-detects and applies
|
||||
it with log confirmation. One imatrix works for all quantisation levels.
|
||||
|
||||
The tool acknowledges current limitations whilst providing clean workflows. Though Python generation
|
||||
isn't available yet, using external matrices is trivial. This pragmatic approach delivers optimal
|
||||
results whilst preparing for future improvements.
|
||||
|
||||
## Future Developments
|
||||
|
||||
Native imatrix generation is on llama-cpp-python's roadmap for immediate integration when available.
|
||||
Meanwhile, this hybrid approach works well. The community shares matrices, calibration datasets
|
||||
improve constantly, and algorithms grow more sophisticated.
|
||||
|
||||
Research continues into dynamic importance scoring, multi-modal calibration for vision-language
|
||||
models, and automated calibration generation. These advances will eventually reach production tools,
|
||||
but current approaches already deliver impressive results.
|
||||
|
||||
## Practical Tips
|
||||
|
||||
Key insights: Quality and diversity beat quantity in calibration data. Include specific use cases
|
||||
even if uncommon. Balance languages proportionally for multilingual models. Include edge cases for
|
||||
robustness. When in doubt, use Bartowski's pre-computed matrices – they're consistently excellent.
|
||||
|
||||
The importance matrix seems obvious in hindsight – preserve critical weights, calibrate for actual
|
||||
usage. Yet it took years of experimentation to develop these techniques. Using them well transforms
|
||||
quantisation from simple size reduction to intelligent preservation of what matters.
|
|
@ -1,102 +1,151 @@
|
|||
# quantise.py - Advanced GGUF Quantisation
|
||||
# quantise_gguf.py - Advanced GGUF Quantisation
|
||||
|
||||
Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
|
||||
Transforms language models into optimised GGUF formats, from aggressive Q2 compression to
|
||||
high-precision Q8_0. Based on analysis of community quantisation patterns, it achieves excellent
|
||||
quality-to-size ratios whilst working within Python-to-C++ interop constraints.
|
||||
|
||||
## Overview
|
||||
1. [The Full Picture](#the-full-picture)
|
||||
2. [Understanding the Variants](#understanding-the-variants)
|
||||
3. [Practical Usage](#practical-usage)
|
||||
4. [The Architecture Behind the Magic](#the-architecture-behind-the-magic)
|
||||
5. [Environment and Performance](#environment-and-performance)
|
||||
6. [Output and Organisation](#output-and-organisation)
|
||||
|
||||
This tool automates the complete quantisation workflow for converting models to GGUF format with
|
||||
multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
|
||||
## The Full Picture
|
||||
|
||||
## Quantisation Variants
|
||||
GGUF quantisation isn't uniform precision reduction. The tool supports the complete llama.cpp
|
||||
spectrum: K-quant series (Q3_K-Q6_K) with S/M/L variants, legacy formats (Q4_0, Q5_1), experimental
|
||||
integer types (IQ2-IQ4), and full precision F16/BF16. The key is understanding strategic usage.
|
||||
|
||||
The tool produces four quantisation variants based on Bartowski's method:
|
||||
Replicating Bartowski's patterns revealed an interesting limitation. Llama-cpp-python provides
|
||||
embedding and output layer control, but the sophisticated `tensor_types` parameter expects a C++
|
||||
`std::vector<tensor_quantization>` pointer – impossible to create from Python. This architectural
|
||||
boundary between Python and C++ cannot be worked around without significant redesign.
|
||||
|
||||
- **Q4_K_M**: Standard baseline quantisation
|
||||
- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
|
||||
- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
|
||||
- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
|
||||
Analysis of Bartowski's GGUF files shows this limitation doesn't matter. M variants already include
|
||||
per-layer enhancements – Q4_K_M uses Q6_K for embeddings, attention V, and FFN down layers.
|
||||
Bartowski's L and XL variants only tweak embeddings and output layers, precisely what we can control.
|
||||
Working with constraints rather than against them.
|
||||
|
||||
## Features
|
||||
For further optimisation, importance matrix (imatrix) files guide quantisation based on usage
|
||||
patterns, outperforming fixed rules. See the [IMatrix Guide](./imatrix_data.md) for obtaining or
|
||||
generating these files – particularly crucial at lower bit rates.
|
||||
|
||||
- **Automatic model download**: Downloads models from HuggingFace automatically
|
||||
- **Importance matrix generation**: Creates imatrix for improved quantisation quality
|
||||
- **Parallel processing**: Uploads multiple variants simultaneously
|
||||
- **Progress tracking**: Real-time status updates during conversion
|
||||
- **README generation**: Automatically creates model cards with quantisation details
|
||||
- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
|
||||
## Understanding the Variants
|
||||
|
||||
## Usage
|
||||
Our profiles match Bartowski's exact configurations from GGUF analysis. M variants aren't middle
|
||||
ground but optimised baselines – Q4_K_M uses Q6_K for critical layers whilst maintaining Q4_K
|
||||
elsewhere, a balance proven through years of community experimentation.
|
||||
|
||||
### Basic Usage
|
||||
L variants make minimal but impactful changes. Q4_K_L upgrades embeddings from Q6_K to Q8_0 (+19%
|
||||
size, better vocabulary). Q3_K_L upgrades output to Q5_K. Q3_K_XL combines both strategies. No
|
||||
Q4_K_XL or Q5_K_XL exist – at those sizes, Q5_K_M's superior base quantisation wins.
|
||||
|
||||
```bash
|
||||
# Quantise a model from HuggingFace
|
||||
uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
|
||||
Q5_K_L and Q6_K_L upgrade embeddings to Q8_0, providing stepping stones between major levels for
|
||||
fine-grained size-quality control. See [Bartowski Analysis](./bartowski_analysis.md) for detailed
|
||||
architectural interactions.
|
||||
|
||||
## Practical Usage
|
||||
|
||||
The tool handles the complete workflow: fetches from HuggingFace, converts to GGUF, checks for
|
||||
imatrix files, processes multiple variants with parallel uploads, generates documentation, and
|
||||
uploads with metadata. Fire-and-forget design – start it and return to completed models.
|
||||
|
||||
The Python API enables custom configurations (limited to embedding and output layers due to
|
||||
llama-cpp-python constraints):
|
||||
|
||||
```python
|
||||
from helpers.services.llama_python import LlamaCppPythonAPI
|
||||
|
||||
api = LlamaCppPythonAPI()
|
||||
|
||||
# Q4_K_L profile - upgrades embeddings to Q8_0
|
||||
api.quantise_model_flexible(
|
||||
input_path="model-f16.gguf",
|
||||
output_path="model-Q4_K_L.gguf",
|
||||
base_type="Q4_K_M", # Q4_K_M uses Q6_K for embeddings, attn_v, and ffn_down (not flat Q4_K!)
|
||||
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
|
||||
output_type=None # Keep default from base type
|
||||
)
|
||||
|
||||
# Example 2: Q3_K_L profile - upgrades output to Q5_K
|
||||
api.quantise_model_flexible(
|
||||
input_path="model-f16.gguf",
|
||||
output_path="model-Q3_K_L.gguf",
|
||||
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down (not flat Q3_K!)
|
||||
embedding_type=None, # Keep the already-enhanced Q6_K embeddings from base
|
||||
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
|
||||
)
|
||||
|
||||
# Q3_K_XL profile - upgrades both embeddings and output
|
||||
api.quantise_model_flexible(
|
||||
input_path="model-f16.gguf",
|
||||
output_path="model-Q3_K_XL.gguf",
|
||||
base_type="Q3_K_M", # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down
|
||||
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
|
||||
output_type="Q5_K" # Upgrade output from Q4_K to Q5_K
|
||||
)
|
||||
|
||||
# Example 4: Custom experimental configuration
|
||||
api.quantise_model_flexible(
|
||||
input_path="model-f16.gguf",
|
||||
output_path="model-custom.gguf",
|
||||
base_type="Q5_K_M", # Q5_K_M uses Q6_K embeddings, Q6_K attn_v, Q6_K ffn_down
|
||||
embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
|
||||
output_type="Q8_0" # Upgrade output to maximum precision Q8_0
|
||||
)
|
||||
```
|
||||
|
||||
### Command Line Options
|
||||
Command-line usage is even simpler. Just point it at a HuggingFace model and let it work:
|
||||
|
||||
```bash
|
||||
# Skip imatrix generation for faster processing
|
||||
uv run quantise.py <model_url> --no-imatrix
|
||||
# Basic usage
|
||||
uv run quantise_gguf.py https://huggingface.co/meta-llama/Llama-3.2-1B
|
||||
|
||||
# Skip imatrix checking for speed
|
||||
uv run quantise_gguf.py <model_url> --no-imatrix
|
||||
|
||||
# Local testing without upload
|
||||
uv run quantise.py <model_url> --no-upload
|
||||
uv run quantise_gguf.py <model_url> --no-upload
|
||||
|
||||
# Custom output directory
|
||||
uv run quantise.py <model_url> --output-dir ./my-models
|
||||
|
||||
# Use specific HuggingFace token
|
||||
uv run quantise.py <model_url> --hf-token YOUR_TOKEN
|
||||
# Custom profiles
|
||||
uv run quantise_gguf.py <model_url> --profiles Q3_K_M Q4_K_L Q6_K
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
## The Architecture Behind the Magic
|
||||
|
||||
- `HF_TOKEN`: HuggingFace API token for uploads
|
||||
- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
|
||||
- `DEBUG`: Enable debug logging when set to "true"
|
||||
Based on Qwen3 4B analysis: embeddings (9.7% of parameters) critically affect vocabulary – Q4 to Q8
|
||||
adds just 0.17GB but dramatically improves rare tokens. Attention (14.1% total) has V layers (4.7%)
|
||||
enhanced in M variants whilst Q and K stay at base for size control.
|
||||
|
||||
## Requirements
|
||||
Feed-forward layers show clear trade-offs: gate/up projections (44.6% of parameters) stay at base
|
||||
as enhancement would double size for modest gains. Down projections (22.3%) are enhanced in M
|
||||
variants for feature transformation quality. Output layer (9.4%) gets special attention in Q3_K_L
|
||||
for prediction quality.
|
||||
|
||||
- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
|
||||
- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
|
||||
- **HuggingFace account**: For uploading quantised models (optional)
|
||||
For an 8B model: Q4_K_M baseline is ~4.5GB with Q6_K enhancements. Q4_K_L adds 753MB (5.3GB total)
|
||||
for Q8_0 embeddings. A hypothetical Q4_K_XL would reach 6.6GB – at which point Q5_K_M's superior
|
||||
base quantisation makes more sense.
|
||||
|
||||
## Workflow
|
||||
## Environment and Performance
|
||||
|
||||
1. **Download**: Fetches the model from HuggingFace
|
||||
2. **Convert**: Converts to initial GGUF format (F32)
|
||||
3. **Generate imatrix**: Creates importance matrix using calibration data
|
||||
4. **Quantise**: Produces multiple quantisation variants in parallel
|
||||
5. **Upload**: Pushes quantised models to HuggingFace with metadata
|
||||
6. **Clean up**: Removes temporary files and caches
|
||||
Configuration via environment variables: `HF_TOKEN` for uploads, `LLAMA_CPP_DIR` for custom
|
||||
binaries, `DEBUG=true` for verbose logging. Uses llama-cpp-python (auto-installed via uv),
|
||||
benefits from imatrix files, requires HuggingFace account only for uploads.
|
||||
|
||||
## Output Structure
|
||||
Requirements scale predictably: disk needs ~3x model size (original, F32, outputs), memory tracks
|
||||
model size with streaming optimisations. Processing takes minutes to hours depending on size.
|
||||
Downloads range from gigabytes to 100GB+ for largest models.
|
||||
|
||||
```plain
|
||||
output_dir/
|
||||
├── model-F32.gguf # Full precision conversion
|
||||
├── model-Q4_K_M.gguf # Standard quantisation
|
||||
├── model-Q4_K_M-imat.gguf # With importance matrix
|
||||
├── model-Q4_K_L-imat.gguf # Enhanced embeddings/attention
|
||||
├── model-Q4_K_XL-imat.gguf # High precision embeddings
|
||||
├── model-Q4_K_XXL-imat.gguf # Maximum precision
|
||||
└── imatrix.dat # Generated importance matrix
|
||||
```
|
||||
Comprehensive error handling: automatic retry with exponential backoff, early dependency detection,
|
||||
disk space checks, actionable API error messages, detailed conversion failure logs. Resilient
|
||||
workflow keeps you informed whilst handling large model processing challenges.
|
||||
|
||||
## Error Handling
|
||||
## Output and Organisation
|
||||
|
||||
The tool includes comprehensive error handling for:
|
||||
Outputs organised per model: F32/F16 base, quantisation variants, imatrix files, documentation.
|
||||
Naming pattern: `model-name-variant.gguf`. Successful uploads auto-clean local files; failures
|
||||
preserve for manual intervention. READMEs document variant characteristics and technical details.
|
||||
|
||||
- Network failures during download
|
||||
- Missing binaries or dependencies
|
||||
- Insufficient disk space
|
||||
- HuggingFace API errors
|
||||
- Conversion failures
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Disk space**: Requires ~3x model size in free space
|
||||
- **Memory**: Needs RAM proportional to model size
|
||||
- **Processing time**: Varies from minutes to hours based on model size
|
||||
- **Network**: Downloads can be large (10-100+ GB for large models)
|
||||
Uploads include metadata, quantisation tags, and model cards explaining trade-offs. Parallel upload
|
||||
system maximises throughput with full progress visibility.
|
||||
|
|
|
@ -1,164 +1,272 @@
|
|||
# direct_safetensors_to_gguf.py - Direct SafeTensors Conversion
|
||||
# safetensors2gguf.py - Direct SafeTensors Conversion
|
||||
|
||||
Direct SafeTensors to GGUF converter for unsupported architectures.
|
||||
When llama.cpp doesn't recognise your model architecture, this tool provides direct SafeTensors to
|
||||
GGUF conversion. It bypasses llama.cpp's architecture-specific logic for experimental models and
|
||||
custom architectures that lack official support.
|
||||
|
||||
## Overview
|
||||
|
||||
This tool converts SafeTensors models directly to GGUF format without requiring specific
|
||||
architecture support in llama.cpp. It's particularly useful for experimental models, custom
|
||||
architectures, or when llama.cpp's standard conversion tools don't recognise your model
|
||||
architecture.
|
||||
Most transformer models share common tensor patterns regardless of architecture. While llama.cpp
|
||||
requires explicit support for each architecture, this tool maps tensor names to GGUF conventions
|
||||
and preserves metadata. Works well for models following standard transformer patterns.
|
||||
|
||||
## Features
|
||||
|
||||
- **Architecture-agnostic**: Works with unsupported model architectures
|
||||
- **Automatic mapping**: Intelligently maps tensor names to GGUF conventions
|
||||
- **BFloat16 support**: Handles BF16 tensors with PyTorch (optional)
|
||||
- **Vision models**: Supports models with vision components
|
||||
- **Tokeniser preservation**: Extracts and includes tokeniser metadata
|
||||
- **Fallback mechanisms**: Provides sensible defaults for unknown architectures
|
||||
The converter handles real-world models pragmatically:
|
||||
|
||||
- **Architecture-agnostic conversion**: Pattern matching identifies common tensor types – embeddings
|
||||
look similar across Llama, Qwen, or custom architectures
|
||||
- **Intelligent tensor mapping**: Maps standard patterns (self_attn.q_proj → attn_q) whilst
|
||||
preserving unrecognised tensors rather than dropping them
|
||||
- **BFloat16 handling**: Optional PyTorch for BF16→F32 conversion as many models ship in BF16
|
||||
- **Vision model support**: Extracts vision tower parameters for multimodal models
|
||||
- **Tokeniser preservation**: Copies configuration wholesale to prevent garbled output from mismatches
|
||||
- **Graceful fallbacks**: Unknown architectures default to Llama structure – effective since most
|
||||
models derive from Llama
|
||||
|
||||
## Usage
|
||||
|
||||
Point at a model directory and the tool handles the rest. Most models convert with defaults, though
|
||||
forcing architecture helps when autodetection fails.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
# Convert a local SafeTensors model
|
||||
uv run direct_safetensors_to_gguf.py /path/to/model/directory
|
||||
# Convert a local SafeTensors model - autodetects architecture
|
||||
uv run safetensors2gguf.py /path/to/model/directory
|
||||
```
|
||||
|
||||
### Command Line Options
|
||||
|
||||
```bash
|
||||
# Specify output file
|
||||
uv run direct_safetensors_to_gguf.py /path/to/model -o output.gguf
|
||||
# Specify output location - useful for organising converted models
|
||||
uv run safetensors2gguf.py /path/to/model -o output.gguf
|
||||
|
||||
# Force specific architecture mapping
|
||||
uv run direct_safetensors_to_gguf.py /path/to/model --force-arch qwen2
|
||||
# Force architecture when autodetection fails or for better compatibility
|
||||
uv run safetensors2gguf.py /path/to/model --force-arch qwen2
|
||||
|
||||
# Convert with custom output path
|
||||
uv run direct_safetensors_to_gguf.py ./my-model --output ./converted/my-model.gguf
|
||||
# Convert with full path control - keeps originals safe
|
||||
uv run safetensors2gguf.py ./my-model --output ./converted/my-model.gguf
|
||||
```
|
||||
|
||||
## Supported Input Formats
|
||||
|
||||
The tool automatically detects and handles:
|
||||
The tool handles all packaging formats. Sharding emerged when models exceeded file system limits –
|
||||
a 70B model spans dozens of files. Reassembles fragments transparently whether HuggingFace numbered
|
||||
shards or custom splits.
|
||||
|
||||
1. **Single file models**: `model.safetensors`
|
||||
2. **Sharded models**: `model-00001-of-00005.safetensors`, etc.
|
||||
3. **Custom names**: Any `*.safetensors` files in the directory
|
||||
1. **Single file models**: `model.safetensors` – common for models under 10GB
|
||||
2. **Sharded models**: `model-00001-of-00005.safetensors` – standard for large models, tool
|
||||
automatically finds and merges all shards in sequence
|
||||
3. **Custom names**: Any `*.safetensors` files – some fine-tunes use non-standard naming, tool
|
||||
scans for all SafeTensors files regardless of naming convention
|
||||
|
||||
## Architecture Mapping
|
||||
|
||||
The tool includes built-in mappings for several architectures:
|
||||
Architecture mapping bridges naming chaos and GGUF's structured expectations. Model creators invent
|
||||
their own names, but patterns remain similar underneath. Translation table for known architectures,
|
||||
unknowns default to Llama – reasonable since most models are Llama-inspired.
|
||||
|
||||
- `DotsOCRForCausalLM` → `qwen2`
|
||||
- `GptOssForCausalLM` → `llama`
|
||||
- Unknown architectures → `llama` (fallback)
|
||||
Built-in mappings reflect real-world encounters:
|
||||
|
||||
You can override these with the `--force-arch` parameter.
|
||||
- `DotsOCRForCausalLM` → `qwen2` – Dots OCR models are Qwen2-based despite the naming
|
||||
- `GptOssForCausalLM` → `llama` – Generic GPT models usually follow Llama architecture
|
||||
- Unknown architectures → `llama` – Safe default that works for most transformer models
|
||||
|
||||
Use `--force-arch` when you know better than autodetection. Particularly useful for fine-tuned
|
||||
models with custom names but standard structure.
|
||||
|
||||
## Tensor Name Mapping
|
||||
|
||||
The converter automatically maps common tensor patterns:
|
||||
Tensor naming diverges most between formats. HuggingFace uses verbose hierarchical names
|
||||
(`model.layers.0.self_attn.q_proj.weight`), GGUF prefers terse (`blk.0.attn_q`). Mapping preserves
|
||||
semantics whilst adapting conventions, enabling cross-ecosystem compatibility with llama.cpp.
|
||||
|
||||
| Original Pattern | GGUF Name |
|
||||
|-----------------|-----------|
|
||||
| `model.embed_tokens.weight` | `token_embd.weight` |
|
||||
| `model.norm.weight` | `output_norm.weight` |
|
||||
| `lm_head.weight` | `output.weight` |
|
||||
| `layers.N.self_attn.q_proj` | `blk.N.attn_q` |
|
||||
| `layers.N.self_attn.k_proj` | `blk.N.attn_k` |
|
||||
| `layers.N.self_attn.v_proj` | `blk.N.attn_v` |
|
||||
| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` |
|
||||
| `layers.N.mlp.up_proj` | `blk.N.ffn_up` |
|
||||
| `layers.N.mlp.down_proj` | `blk.N.ffn_down` |
|
||||
| Original Pattern | GGUF Name | Purpose |
|
||||
|-----------------|-----------|------|
|
||||
| `model.embed_tokens.weight` | `token_embd.weight` | Token embeddings – maps input IDs to vectors |
|
||||
| `model.norm.weight` | `output_norm.weight` | Final layer normalisation before output |
|
||||
| `lm_head.weight` | `output.weight` | Output projection to vocabulary space |
|
||||
| `layers.N.self_attn.q_proj` | `blk.N.attn_q` | Query projection for attention layer N |
|
||||
| `layers.N.self_attn.k_proj` | `blk.N.attn_k` | Key projection for attention layer N |
|
||||
| `layers.N.self_attn.v_proj` | `blk.N.attn_v` | Value projection for attention layer N |
|
||||
| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` | Gate projection in feedforward network |
|
||||
| `layers.N.mlp.up_proj` | `blk.N.ffn_up` | Up projection expanding hidden dimension |
|
||||
| `layers.N.mlp.down_proj` | `blk.N.ffn_down` | Down projection reducing to model dimension |
|
||||
|
||||
Pattern matching handles variations like `transformer.h.N` (GPT-style) or `model.decoder.layers.N`
|
||||
(encoder-decoder) by identifying core patterns regardless of prefix.
|
||||
|
||||
## Configuration Requirements
|
||||
|
||||
The model directory must contain:
|
||||
Conversion requires core files though optional components are forgiven. HuggingFace downloads
|
||||
typically include everything, manually assembled models may lack critical configuration.
|
||||
|
||||
- **config.json**: Model configuration file (required)
|
||||
- **\*.safetensors**: One or more SafeTensors files (required)
|
||||
- **tokenizer_config.json**: Tokeniser configuration (optional)
|
||||
- **tokenizer.json**: Tokeniser data (optional)
|
||||
Required files:
|
||||
|
||||
- **config.json**: Architecture name, layer counts, vocabulary size – essential for structuring GGUF
|
||||
- **\*.safetensors**: Model weights, single or sharded – handled automatically
|
||||
|
||||
Optional but recommended:
|
||||
|
||||
- **tokenizer_config.json**: Special tokens, chat templates, tokeniser behaviour – missing often
|
||||
causes garbled output
|
||||
- **tokenizer.json**: Vocabulary and merge rules – tool extracts from other sources if missing but
|
||||
inclusion ensures compatibility
|
||||
|
||||
## Output Format
|
||||
|
||||
The tool produces a single GGUF file containing:
|
||||
GGUF bundles everything for inference in one file, unlike SafeTensors' scattered JSON configuration.
|
||||
Simplifies deployment but requires careful metadata preservation during conversion.
|
||||
|
||||
- All model weights in F32 format
|
||||
- Model architecture metadata
|
||||
- Tokeniser configuration (if available)
|
||||
- Special token IDs (BOS, EOS, UNK, PAD)
|
||||
The output file contains:
|
||||
|
||||
- **Model weights in F32**: Full precision, quantise later with dedicated tools
|
||||
- **Architecture metadata**: Layer counts, dimensions, activations for model graph construction
|
||||
- **Tokeniser configuration**: Vocabulary, special tokens, chat templates for model behaviour
|
||||
- **Special token mappings**: BOS, EOS, UNK, PAD – control generation, must match training config
|
||||
|
||||
## Error Handling
|
||||
|
||||
Error messages are actionable – explaining what went wrong, why it matters, and how to fix it.
|
||||
|
||||
| Error | Message | Solution |
|
||||
|-------|---------|----------|
|
||||
| Missing config.json | `FileNotFoundError: Config file not found` | Ensure the model directory contains a valid `config.json` file |
|
||||
| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Check that the directory contains `.safetensors` files |
|
||||
| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch for BF16 support: `uv add torch` |
|
||||
| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Use `--force-arch` to specify a known compatible architecture |
|
||||
| Missing config.json | `FileNotFoundError: Config file not found` | Download the complete model including config.json, not just weights |
|
||||
| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Verify the model uses SafeTensors format – older models might use PyTorch .bin files |
|
||||
| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch (`uv add torch`) or accept potential precision loss in BF16→F32 conversion |
|
||||
| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Research the model's base architecture and use `--force-arch` with the appropriate type |
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Parameter Inference
|
||||
|
||||
The tool infers GGUF parameters from the model configuration:
|
||||
Parameter inference bridges naming inconsistencies. Llama's `num_attention_heads` is GPT's
|
||||
`n_heads`. Translation layer provides sensible defaults for missing values.
|
||||
|
||||
- `vocab_size` → vocabulary size (default: 32000)
|
||||
- `max_position_embeddings` → context length (default: 2048)
|
||||
- `hidden_size` → embedding dimension (default: 4096)
|
||||
- `num_hidden_layers` → number of transformer blocks (default: 32)
|
||||
- `num_attention_heads` → attention head count (default: 32)
|
||||
- `num_key_value_heads` → KV head count (defaults to attention heads)
|
||||
- `rope_theta` → RoPE frequency base (default: 10000.0)
|
||||
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5)
|
||||
Configuration mapping with defaults chosen from common models:
|
||||
|
||||
- `vocab_size` → vocabulary size (default: 32000 – Llama's vocabulary)
|
||||
- `max_position_embeddings` → context length (default: 2048 – conservative for compatibility)
|
||||
- `hidden_size` → embedding dimension (default: 4096 – typical for 7B models)
|
||||
- `num_hidden_layers` → transformer blocks (default: 32 – standard for 7B models)
|
||||
- `num_attention_heads` → attention heads (default: 32 – balanced for 4096 dimension)
|
||||
- `num_key_value_heads` → KV heads for GQA (defaults to attention heads – assumes MHA not GQA)
|
||||
- `rope_theta` → RoPE frequency base (default: 10000.0 – standard RoPE configuration)
|
||||
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5 – numerical stability threshold)
|
||||
|
||||
Defaults work for most models. Wrong parameters may not error immediately but degrade output quality.
|
||||
|
||||
### Vision Model Support
|
||||
|
||||
For models with vision components, the tool extracts:
|
||||
Multimodal models increasingly common. Tool preserves vision tower configuration though GGUF support
|
||||
remains experimental. Vision parameters extracted but may not be fully utilised.
|
||||
|
||||
- Vision embedding dimensions
|
||||
- Vision transformer block count
|
||||
- Vision attention heads
|
||||
- Vision feed-forward dimensions
|
||||
- Patch size and spatial merge parameters
|
||||
Extracted vision parameters:
|
||||
|
||||
- **Vision embedding dimensions**: Hidden size, typically differs from language dimensions
|
||||
- **Vision transformer blocks**: Encoder layers, fewer but wider than language
|
||||
- **Vision attention heads**: Usually standard MHA rather than grouped-query
|
||||
- **Feed-forward dimensions**: Different expansion ratios from language FFN
|
||||
- **Patch configuration**: Size (14×14), spatial merging, position encoding
|
||||
|
||||
Vision support best-effort – preserves what's found, can't guarantee inference engine usage.
|
||||
|
||||
## Limitations
|
||||
|
||||
- **F32 only**: Currently outputs only full precision (F32) models
|
||||
- **Architecture guessing**: May require manual architecture specification
|
||||
- **Tokeniser compatibility**: Uses llama tokeniser as default fallback
|
||||
- **Memory usage**: Requires loading full tensors into memory
|
||||
Understanding limitations prevents frustration. Design favours broad compatibility over perfection.
|
||||
|
||||
- **F32 output only**: Quantisation requires separate tools like quantise_gguf.py for bit depth control
|
||||
- **Architecture guessing**: Works for common patterns, novel architectures need manual specification
|
||||
- **Tokeniser compatibility**: Falls back to Llama tokeniser when data missing – may cause issues with
|
||||
special tokens
|
||||
- **Memory requirements**: Loads entire tensors into RAM – 70B model needs 140GB+, no streaming support
|
||||
- **No quantisation**: Preserves full precision, quantise separately for deployment control
|
||||
- **Limited validation**: Ensures structure, can't verify output quality – test before deployment
|
||||
|
||||
## Examples
|
||||
|
||||
### Converting a custom model
|
||||
|
||||
Typical workflow: download from HuggingFace, convert to GGUF, quantise for deployment. This tool
|
||||
handles the SafeTensors→GGUF transformation.
|
||||
|
||||
```bash
|
||||
# Download a model first
|
||||
# Download complete model with all configuration files
|
||||
git clone https://huggingface.co/my-org/my-model ./my-model
|
||||
|
||||
# Convert to GGUF
|
||||
uv run direct_safetensors_to_gguf.py ./my-model
|
||||
# Convert to GGUF - automatic architecture detection
|
||||
uv run safetensors2gguf.py ./my-model
|
||||
|
||||
# Output will be at ./my-model/my-model-f32.gguf
|
||||
# Output appears at ./my-model/my-model-f32.gguf
|
||||
# Now ready for quantisation with quantise_gguf.py
|
||||
```
|
||||
|
||||
### Converting with specific architecture
|
||||
|
||||
Force architecture when autodetection fails or you know the model's lineage. Useful for fine-tuned
|
||||
models with custom names.
|
||||
|
||||
```bash
|
||||
# For a Qwen2-based model
|
||||
uv run direct_safetensors_to_gguf.py ./qwen-model --force-arch qwen2
|
||||
# Force Qwen2 architecture for a model you know is Qwen2-based
|
||||
uv run safetensors2gguf.py ./qwen-model --force-arch qwen2
|
||||
|
||||
# Common forced architectures:
|
||||
# --force-arch llama # Most models
|
||||
# --force-arch qwen2 # Qwen family
|
||||
# --force-arch mistral # Mistral variants
|
||||
```
|
||||
|
||||
### Batch conversion
|
||||
|
||||
Bash loops enable bulk conversion for comparing checkpoints or converting model families.
|
||||
|
||||
```bash
|
||||
# Convert multiple models
|
||||
# Convert directory of models, preserving originals
|
||||
for model in ./models/*; do
|
||||
uv run direct_safetensors_to_gguf.py "$model" -o "./gguf/$(basename $model).gguf"
|
||||
echo "Converting $(basename $model)..."
|
||||
uv run safetensors2gguf.py "$model" \
|
||||
-o "./gguf/$(basename $model).gguf" 2>&1 | \
|
||||
tee "./logs/$(basename $model).log"
|
||||
done
|
||||
|
||||
# Check results
|
||||
ls -lh ./gguf/*.gguf
|
||||
```
|
||||
|
||||
## Integration with Quantisation Pipeline
|
||||
|
||||
Tool produces F32 GGUF ready for quantisation. Typical pipeline:
|
||||
|
||||
1. **Download model**: Get SafeTensors model from HuggingFace
|
||||
2. **Convert to GGUF**: Use this tool for architecture-agnostic conversion
|
||||
3. **Quantise**: Apply quantise_gguf.py for Bartowski-style variants
|
||||
4. **Deploy**: Use with llama.cpp, Ollama, or other GGUF-compatible inference engines
|
||||
|
||||
Separation enables control at each stage. Convert once, quantise to multiple bit depths, test
|
||||
configurations without repeating conversion.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Model produces gibberish after conversion
|
||||
|
||||
Indicates tokeniser mismatch. Ensure tokenizer.json and tokenizer_config.json present. Custom
|
||||
tokenisers may need `--force-arch`.
|
||||
|
||||
### Conversion succeeds but model won't load
|
||||
|
||||
Use recent llama.cpp – GGUF format evolves, older versions lack newer metadata support. Verify
|
||||
forced architecture matches actual structure – wrong forcing creates invalid models.
|
||||
|
||||
### Out of memory during conversion
|
||||
|
||||
Tool loads all weights simultaneously. For large models:
|
||||
|
||||
- Close other applications to free RAM
|
||||
- Use a system with more memory (cloud instances work well)
|
||||
- Consider quantising from a pre-converted F16 model if available
|
||||
|
||||
### Warning about unknown tensors
|
||||
|
||||
Normal for custom layers. Preserves unknown tensors though inference may not use them. Harmless –
|
||||
better to include unused weights than miss critical ones.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue