Switch to llama-cpp-python

2025-08-08 21:40:15 +01:00 · 2025-08-08 21:40:15 +01:00 · d937f2d5fa
commit d937f2d5fa
parent ef7df1a8c3
25 changed files with 2957 additions and 1181 deletions
--- a/docs/bartowski_analysis.md
+++ b/docs/bartowski_analysis.md
@ -0,0 +1,127 @@
+# Bartowski Quantisation Analysis
+
+Analysis of Bartowski GGUF files reveals why these models work so well: the "M" variants don't
+apply uniform quantisation as their names suggest.
+
+1. [The Hidden Sophistication of M Variants](#the-hidden-sophistication-of-m-variants)
+2. [The Complete Quantisation Map](#the-complete-quantisation-map)
+3. [The Architecture of Intelligence](#the-architecture-of-intelligence)
+4. [The Economics of Enhancement](#the-economics-of-enhancement)
+5. [Why Q3\_K Gets Special Treatment](#why-q3_k-gets-special-treatment)
+6. [Implementation Insights](#implementation-insights)
+7. [The Deeper Pattern](#the-deeper-pattern)
+
+## The Hidden Sophistication of M Variants
+
+When creating a Q4_K_M model, llama.cpp doesn't apply Q4_K throughout. Instead, it strategically
+enhances critical components – embeddings jump to Q6_K, attention V layers get Q6_K, and FFN down
+projections receive the same treatment. This represents years of empirical optimisation baked
+directly into the quantisation logic.
+
+The L and XL models make surgical adjustments to an already-optimised foundation. Q4_K_L simply
+takes the enhanced Q4_K_M and upgrades embeddings from Q6_K to Q8_0. This explains why file size
+increases are modest relative to quality gains.
+
+## The Complete Quantisation Map
+
+Here's what's actually happening inside these models, based on analysis of real GGUF files:
+
+| Variant  | Embed | Output | Q     | K     | V     | Gate  | Up    | Down  |
+|----------|-------|--------|-------|-------|-------|-------|-------|-------|
+| Q3_K_M   | Q6_K  | Q4_K   | Q3_K  | Q3_K  | Q5_K  | Q3_K  | Q3_K  | Q5_K  |
+| Q3_K_L   | Q6_K  | Q5_K   | Q3_K  | Q3_K  | Q5_K  | Q3_K  | Q3_K  | Q5_K  |
+| Q3_K_XL  | Q8_0  | Q5_K   | Q3_K  | Q3_K  | Q5_K  | Q3_K  | Q3_K  | Q5_K  |
+| Q4_K_M   | Q6_K  | Q4_K   | Q4_K  | Q4_K  | Q6_K  | Q4_K  | Q4_K  | Q6_K  |
+| Q4_K_L   | Q8_0  | Q4_K   | Q4_K  | Q4_K  | Q6_K  | Q4_K  | Q4_K  | Q6_K  |
+| Q5_K_M   | Q6_K  | Q5_K   | Q5_K  | Q5_K  | Q6_K  | Q5_K  | Q5_K  | Q6_K  |
+| Q5_K_L   | Q8_0  | Q5_K   | Q5_K  | Q5_K  | Q6_K  | Q5_K  | Q5_K  | Q6_K  |
+| Q6_K_L   | Q8_0  | Q6_K   | Q6_K  | Q6_K  | Q6_K  | Q6_K  | Q6_K  | Q6_K  |
+
+Key patterns: M variants boost embeddings to Q6_K, enhance attention V layers (Q3→Q5, Q4/Q5→Q6),
+and upgrade FFN down projections. L variants change just embeddings or output. Only Q3_K has an XL
+variant as it has room for both improvements without competing with the next tier.
+
+## The Architecture of Intelligence
+
+Using a Qwen3 4B model as reference: embeddings comprise just 9.7% of parameters (389M, 0.78GB at
+F16) yet fundamentally determine vocabulary understanding. Poor embedding quantisation prevents the
+model from distinguishing similar tokens. Upgrading from Q4 to Q8 adds only 0.17GB but dramatically
+improves handling of technical terms and rare words.
+
+Attention (Q, K, V) accounts for 14.1% of parameters (566M, 1.13GB). Value vectors (V) are critical
+– they're what the model retrieves when attending to context. M variants enhance V layers whilst
+leaving Q and K at base quantisation for better information retrieval without excessive size increase.
+
+Feed-forward network trade-offs: Gate and up projections (44.6% of parameters, 1,793M, 3.59GB)
+stay at base quantisation as enhancement would double file sizes for modest gains. Down projections
+(22.3%, 897M, 1.79GB) get enhanced in M variants as they're the final transformation affecting all
+downstream processing.
+
+The output layer (9.4% of parameters, 378M, 0.75GB) determines final token predictions. Q3_K_L
+targets it for enhancement as improved output precision can mean the difference between coherent
+and garbled text for Q3-based models.
+
+## The Economics of Enhancement
+
+Q4_K_M at 2.26GB already includes strategic Q6_K enhancements. The L variant adds just 0.44GB (19%
+increase) by upgrading only embeddings to Q8_0, leveraging existing enhancements whilst maximising
+vocabulary understanding. A naive approach of upgrading everything would add gigabytes for marginal
+improvements.
+
+Bartowski's popularity stems from carefully chosen points in the size-quality space. Each variant
+represents a local optimum – better quality requires jumping tiers, smaller size sacrifices key
+capabilities.
+
+## Why Q3_K Gets Special Treatment
+
+Q3_K uniquely has an XL variant because it starts from the lowest practical quantisation with room
+for improvement. The progression from Q3_K_M (1.5GB) through L (1.6GB) to XL (1.8GB) provides
+granular control for memory-constrained environments, with each 15-20% size increase delivering
+meaningful quality improvements.
+
+Q4_K_XL or Q5_K_XL don't exist because they'd compete with the next tier. A hypothetical Q4_K_XL
+at 2.75GB would match Q5_K_M's size, but Q5_K_M's superior base quantisation provides better
+quality than selectively enhanced Q4_K layers.
+
+The pattern is consistent: significant enhancements to Q5_K or Q6_K mean you should jump to the
+next base type. Sweet spots: Q3 family for extreme memory constraints, Q4/Q5 for mainstream use,
+Q6/Q8 when quality matters more than size.
+
+## Implementation Insights
+
+Since llama.cpp's M variants already include sophisticated enhancements, replicating Bartowski's
+variants requires minimal configuration:
+
+```python
+# Q3_K_L: Only upgrade output from M baseline
+config = {
+    "base": "Q3_K_M",  # Inherits Q6_K embeddings, Q5_K V/FFN-down
+    "output": "Q5_K"   # Single surgical change
+}
+
+# Q4_K_L: Only upgrade embeddings from M baseline
+config = {
+    "base": "Q4_K_M",     # Inherits Q6_K V/FFN-down
+    "embeddings": "Q8_0"  # Single surgical change
+}
+
+# Q3_K_XL: The only variant needing two changes
+config = {
+    "base": "Q3_K_M",
+    "embeddings": "Q8_0",
+    "output": "Q5_K"
+}
+```
+
+This minimalist approach recognises that M variants already embody years of community optimisation.
+Bartowski's contribution lies in identifying where small adjustments yield outsized returns.
+
+## The Deeper Pattern
+
+This system evolved through countless experiments rather than top-down design. M variants encode
+hard-won knowledge about critical layers. L variants build on this foundation. The absence of most
+XL variants shows where diminishing returns set in.
+
+Bartowski's quantisations work because they embody years of collective learning about what matters
+in practice. They demonstrate that the best solutions often come from understanding and building
+upon what already works, rather than grand redesigns.
--- a/docs/development.md
+++ b/docs/development.md
@ -1,86 +1,136 @@
 # Development Guide

-This guide covers development setup, code quality standards, and project structure for contributors.
+Contributing to GGUF tools requires understanding quantisation workflows and Python's modern
+dependency ecosystem. This guide covers setup, standards, and architectural decisions for fixing
+bugs, adding quantisation profiles, or extending conversion capabilities.

 ## Code Quality

+Ruff replaces the traditional Black/isort/flake8 stack as both linter and formatter. Mypy provides
+static type checking to catch type-related bugs before runtime. Zero tolerance for linting and type
+errors catches issues early. Both tools have extensive configuration in `pyproject.toml` to enforce
+only the important code quality standards we've selected. Debug logging reveals quantisation internals
+when models fail.
+
 ```bash
-# Run linting
-uv run ruff check
+# Run linting - catches style violations, potential bugs, and code smells
+uvx ruff check

-# Format code
-uv run ruff format
+# Format code - enforces consistent style automatically
+uvx ruff format

-# Run with debug logging
+# Run type checking - ensures type safety and catches potential bugs
+uv run mypy .
+
+# Run with debug logging - reveals conversion steps and tensor processing
 DEBUG=true uv run <script>
 ```

 ## Project Structure

+Architecture separates concerns cleanly: top-level scripts provide interfaces, helpers encapsulate
+reusable logic, resources contain community data. Structure evolved from practical needs – helpers
+emerged to eliminate duplication, services to abstract external dependencies.
+
 ```plain
 llm-gguf-tools/
-├── quantise.py                    # Bartowski quantisation tool
-├── direct_safetensors_to_gguf.py  # Direct conversion tool
-├── helpers/                       # Shared utilities
+├── quantise.py                    # Bartowski quantisation tool - the main workflow
+├── direct_safetensors_to_gguf.py  # Direct conversion for unsupported architectures
+├── helpers/                       # Shared utilities and abstractions
 │   ├── __init__.py
-│   └── logger.py                  # Colour-coded logging
-├── resources/                     # Resource files
-│   └── imatrix_data.txt          # Calibration data for imatrix
+│   ├── logger.py                  # Colour-coded logging with context awareness
+│   ├── services/                  # External service wrappers
+│   │   ├── gguf.py                # GGUF writer abstraction
+│   │   └── llama_python.py        # llama-cpp-python integration
+│   └── utils/                     # Pure utility functions
+│       ├── config_parser.py       # Model configuration handling
+│       └── tensor_mapping.py      # Architecture-specific tensor name mapping
+├── resources/                     # Resource files and calibration data
+│   └── imatrix_data.txt           # Curated calibration data from Bartowski
 ├── docs/                          # Detailed documentation
-│   ├── quantise.md
-│   ├── direct_safetensors_to_gguf.md
-│   └── development.md
-└── pyproject.toml                # Project configuration
+│   ├── quantise_gguf.md           # Quantisation strategies and profiles
+│   ├── safetensors2gguf.md        # Direct conversion documentation
+│   ├── bartowski_analysis.md      # Deep dive into variant strategies
+│   ├── imatrix_data.md            # Importance matrix guide
+│   └── development.md             # This guide
+└── pyproject.toml                 # Modern Python project configuration
 ```

 ## Contributing Guidelines

-Contributions are welcome! Please ensure:
+The project values pragmatic solutions over theoretical perfection – working code that handles edge
+cases beats elegant abstractions. Contributors should understand how quantisation profiles map to
+Bartowski's discoveries and where Python-C++ boundaries limit functionality.

-1. Code follows the existing style (run `uv run ruff format`)
-2. All functions have Google-style docstrings
-3. Type hints are used throughout
-4. Tests pass (if applicable)
+Essential requirements:
+
+1. **Style consistency**: Run `uvx ruff format` before committing to keep diffs focused on logic
+2. **Documentation**: Google-style docstrings explain behaviour and rationale beyond type hints
+3. **Type safety**: Complete type hints for all public functions enable IDE support
+4. **Practical testing**: Test with both 1B and 7B+ models to catch scaling issues

 ## Development Workflow

 ### Setting Up Development Environment

+The project uses `uv` for dependency management – Rust-fast, automatic Python version management,
+upfront dependency resolution. Development dependencies include ruff, type stubs, and optional
+PyTorch for BFloat16 handling.
+
 ```bash
-# Clone the repository
+# Clone the repository - uses Forgejo (GitLab-like) hosting
 git clone https://git.tomfos.tr/tom/llm-gguf-tools.git
 cd llm-gguf-tools

-# Install all dependencies including dev
+# Install all dependencies including dev tools
+# This installs llama-cpp-python with CUDA support if available
 uv sync --all-groups
 ```

 ### Code Style

- Follow PEP 8 with ruff enforcement
- Use UK English spelling in comments and documentation
- Maximum line length: 100 characters
- Use type hints for all function parameters and returns
+Code style reduces cognitive load by letting reviewers focus on logic rather than layout. UK English
+maintains llama.cpp consistency. The 100-character line limit balances descriptive names with
+readability.
+
+Core conventions:
+
+- **PEP 8 compliance**: Ruff catches mutable defaults, unused imports automatically
+- **UK English**: "Optimise" not "optimize", matching upstream llama.cpp
+- **Line length**: 100 characters maximum except URLs or unbreakable paths
+- **Type annotations**: Complete hints for public functions – documentation that can't go stale
+- **Import ordering**: Standard library, third-party, local – ruff handles automatically

 ### Testing

-While formal tests are not yet implemented, ensure:
+Formal tests pending. Quantisation "correctness" depends on complex interactions between model
+architecture, strategy, and downstream usage. Benchmark performance doesn't guarantee production
+success.

- Scripts run without errors on sample models
- Logger output is correctly formatted
- File I/O operations handle errors gracefully
+Current validation approach:
+
+- **End-to-end testing**: Qwen 0.5B for quick iteration, Llama 3.2 1B for architecture compatibility
+- **Output validation**: GGUF must load in llama.cpp and degrade gracefully, not produce gibberish
+- **Error handling**: Test corrupted SafeTensors, missing configs, insufficient disk space
+- **Logger consistency**: Verify colour coding across terminals, progress bars with piped output

 ### Debugging

-Enable debug logging for verbose output:
+Debug logging transforms black box to glass box, revealing failure points. Colour coding highlights
+stages: blue (info), yellow (warnings), red (errors), green (success). Visual hierarchy enables
+efficient log scanning.

 ```bash
-DEBUG=true uv run quantise.py <model_url>
+# Enable comprehensive debug output
+DEBUG=true uv run direct_safetensors_to_gguf.py ./model  # Tensor mapping details
+DEBUG=true uv run quantise.py <model_url>                # Memory usage tracking
 ```

-This will show additional information about:
+Debug output reveals:

- Model download progress
- Conversion steps
- File operations
- Error details
+- **Download progress**: Bytes transferred, retries, connection issues
+- **Conversion pipeline**: SafeTensors→GGUF steps, tensor mappings, dimension changes
+- **Quantisation decisions**: Layer bit depths, importance matrix effects on weight selection
+- **Memory usage**: Peak consumption for predicting larger model requirements
+- **File operations**: Read/write/temp patterns for disk usage analysis
+- **Error context**: Stack traces with local variables at failure points
--- a/docs/imatrix_data.md
+++ b/docs/imatrix_data.md
@ -0,0 +1,115 @@
+# Importance Matrix (IMatrix) Data Guide
+
+An importance matrix guides quantisation by identifying critical weights that need protection. Like
+JPEG compression preserving detail in faces whilst compressing uniform backgrounds, the imatrix
+protects parameters that most affect output quality.
+
+At lower bit rates, imatrix-quantised models show 2-3% better perplexity scores overall, with larger
+gains in specific capabilities. A Q3_K model without imatrix might lose technical vocabulary or
+rare language handling, whilst with imatrix it retains these abilities – the difference between
+simple size reduction and intelligent compression.
+
+1. [The Art of Calibration Data](#the-art-of-calibration-data)
+2. [Finding Pre-computed Matrices](#finding-pre-computed-matrices)
+3. [Creating Your Own Matrix](#creating-your-own-matrix)
+4. [Resource Requirements and Optimisation](#resource-requirements-and-optimisation)
+5. [Integration and Workflow](#integration-and-workflow)
+6. [Future Developments](#future-developments)
+7. [Practical Tips](#practical-tips)
+
+## The Art of Calibration Data
+
+This repository includes `resources/imatrix_data.txt` from
+[Bartowski's collection](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8),
+originally compiled by Dampf building on Kalomaze's work. The dataset systematically activates
+different model capabilities: technical writing for analysis, creative fiction for narrative,
+multilingual text for language diversity, and factual content for knowledge accuracy.
+
+The default calibration data works well for general models, but specialised models benefit from
+targeted calibration. Code models need diverse programming languages and patterns; medical models
+need technical literature and terminology. Calibration should reflect actual use cases – 50-100KB
+of well-chosen text beats gigabytes of random content.
+
+Calibration runs text through the model to observe weight activation patterns. These patterns
+become the importance matrix – a heat map of crucial parameters for intended use cases, similar to
+how brains strengthen frequently-used neural pathways.
+
+## Finding Pre-computed Matrices
+
+Check for existing matrices before generating your own. Bartowski shares pre-computed matrices at
+`https://huggingface.co/bartowski/MODEL-NAME-GGUF/resolve/main/MODEL-NAME.imatrix`. These save
+hours of computation and provide excellent results from high-quality calibration data.
+
+The tool automatically checks for imatrix files. If missing, download the appropriate imatrix to
+your model's work directory as `imatrix.dat`. The quality improvement, especially at lower
+quantisation levels, justifies this extra step.
+
+## Creating Your Own Matrix
+
+Generate your own imatrix for new models, domain-specific calibration, or experimentation.
+Currently requires llama.cpp's binary tools as the functionality isn't exposed through
+llama-cpp-python.
+
+Download llama.cpp from the [official releases](https://github.com/ggerganov/llama.cpp/releases).
+Windows users need `llama-bXXXX-bin-win-cuda-x64.zip` for GPU support; Linux/macOS users can use
+binaries or compile from source.
+
+Use the F16 or F32 GGUF model (found in `./work/<model-name>/` after quantisation). F16 balances
+quality and computation requirements. Run from your llama.cpp directory:
+
+```bash
+./llama-imatrix -m /path/to/model-F16.gguf \
+                 -f /path/to/calibration.txt \
+                 -o /path/to/imatrix.dat \
+                 --chunks 100
+```
+
+Generation runs inference whilst analysing activation patterns. The `--chunks` parameter controls
+thoroughness (100 is standard, more for production, less for experiments). Expect 30 minutes to
+several hours on consumer hardware. GPU acceleration helps significantly.
+
+Generation shows perplexity calculations and progress updates after initial loading. The tool tracks
+activation patterns, calculates importance scores, and builds the statistical model for guiding
+quantisation.
+
+## Resource Requirements and Optimisation
+
+Resource requirements match full inference: 7B models need ~14GB RAM for F16. CPU-only works but
+GPU acceleration reduces days to hours for large models. The process supports interruption and
+resumption.
+
+Matrix quality depends on multiple factors. More chunks improve results with diminishing returns
+beyond 200-300. F16 precision is optimal – F32 doubles computation for minimal gain, whilst
+quantised models create quality-degrading feedback loops.
+
+Temperature affects generation (lower focuses on likely paths, higher explores possibilities) but
+defaults are well-tuned. Good calibration data matters more than parameter tweaking.
+
+## Integration and Workflow
+
+Place the imatrix as `imatrix.dat` in your model's work directory. The tool auto-detects and applies
+it with log confirmation. One imatrix works for all quantisation levels.
+
+The tool acknowledges current limitations whilst providing clean workflows. Though Python generation
+isn't available yet, using external matrices is trivial. This pragmatic approach delivers optimal
+results whilst preparing for future improvements.
+
+## Future Developments
+
+Native imatrix generation is on llama-cpp-python's roadmap for immediate integration when available.
+Meanwhile, this hybrid approach works well. The community shares matrices, calibration datasets
+improve constantly, and algorithms grow more sophisticated.
+
+Research continues into dynamic importance scoring, multi-modal calibration for vision-language
+models, and automated calibration generation. These advances will eventually reach production tools,
+but current approaches already deliver impressive results.
+
+## Practical Tips
+
+Key insights: Quality and diversity beat quantity in calibration data. Include specific use cases
+even if uncommon. Balance languages proportionally for multilingual models. Include edge cases for
+robustness. When in doubt, use Bartowski's pre-computed matrices – they're consistently excellent.
+
+The importance matrix seems obvious in hindsight – preserve critical weights, calibrate for actual
+usage. Yet it took years of experimentation to develop these techniques. Using them well transforms
+quantisation from simple size reduction to intelligent preservation of what matters.
--- a/docs/quantise_gguf.md
+++ b/docs/quantise_gguf.md
@ -1,102 +1,151 @@
-# quantise.py - Advanced GGUF Quantisation
+# quantise_gguf.py - Advanced GGUF Quantisation

-Advanced GGUF quantisation tool implementing Bartowski's sophisticated quantisation pipeline.
+Transforms language models into optimised GGUF formats, from aggressive Q2 compression to
+high-precision Q8_0. Based on analysis of community quantisation patterns, it achieves excellent
+quality-to-size ratios whilst working within Python-to-C++ interop constraints.

-## Overview
+1. [The Full Picture](#the-full-picture)
+2. [Understanding the Variants](#understanding-the-variants)
+3. [Practical Usage](#practical-usage)
+4. [The Architecture Behind the Magic](#the-architecture-behind-the-magic)
+5. [Environment and Performance](#environment-and-performance)
+6. [Output and Organisation](#output-and-organisation)

-This tool automates the complete quantisation workflow for converting models to GGUF format with
-multiple precision variants, importance matrix generation, and automatic upload to HuggingFace.
+## The Full Picture

-## Quantisation Variants
+GGUF quantisation isn't uniform precision reduction. The tool supports the complete llama.cpp
+spectrum: K-quant series (Q3_K-Q6_K) with S/M/L variants, legacy formats (Q4_0, Q5_1), experimental
+integer types (IQ2-IQ4), and full precision F16/BF16. The key is understanding strategic usage.

-The tool produces four quantisation variants based on Bartowski's method:
+Replicating Bartowski's patterns revealed an interesting limitation. Llama-cpp-python provides
+embedding and output layer control, but the sophisticated `tensor_types` parameter expects a C++
+`std::vector<tensor_quantization>` pointer – impossible to create from Python. This architectural
+boundary between Python and C++ cannot be worked around without significant redesign.

- **Q4_K_M**: Standard baseline quantisation
- **Q4_K_L**: Q6_K embeddings + Q6_K attention layers for better quality
- **Q4_K_XL**: Q8_0 embeddings + Q6_K attention layers for enhanced precision
- **Q4_K_XXL**: Q8_0 embeddings + Q8_0 attention for maximum precision
+Analysis of Bartowski's GGUF files shows this limitation doesn't matter. M variants already include
+per-layer enhancements – Q4_K_M uses Q6_K for embeddings, attention V, and FFN down layers.
+Bartowski's L and XL variants only tweak embeddings and output layers, precisely what we can control.
+Working with constraints rather than against them.

-## Features
+For further optimisation, importance matrix (imatrix) files guide quantisation based on usage
+patterns, outperforming fixed rules. See the [IMatrix Guide](./imatrix_data.md) for obtaining or
+generating these files – particularly crucial at lower bit rates.

- **Automatic model download**: Downloads models from HuggingFace automatically
- **Importance matrix generation**: Creates imatrix for improved quantisation quality
- **Parallel processing**: Uploads multiple variants simultaneously
- **Progress tracking**: Real-time status updates during conversion
- **README generation**: Automatically creates model cards with quantisation details
- **HuggingFace integration**: Direct upload to HuggingFace with proper metadata
+## Understanding the Variants

-## Usage
+Our profiles match Bartowski's exact configurations from GGUF analysis. M variants aren't middle
+ground but optimised baselines – Q4_K_M uses Q6_K for critical layers whilst maintaining Q4_K
+elsewhere, a balance proven through years of community experimentation.

-### Basic Usage
+L variants make minimal but impactful changes. Q4_K_L upgrades embeddings from Q6_K to Q8_0 (+19%
+size, better vocabulary). Q3_K_L upgrades output to Q5_K. Q3_K_XL combines both strategies. No
+Q4_K_XL or Q5_K_XL exist – at those sizes, Q5_K_M's superior base quantisation wins.

-```bash
-# Quantise a model from HuggingFace
-uv run quantise.py https://huggingface.co/meta-llama/Llama-3.2-1B
+Q5_K_L and Q6_K_L upgrade embeddings to Q8_0, providing stepping stones between major levels for
+fine-grained size-quality control. See [Bartowski Analysis](./bartowski_analysis.md) for detailed
+architectural interactions.
+
+## Practical Usage
+
+The tool handles the complete workflow: fetches from HuggingFace, converts to GGUF, checks for
+imatrix files, processes multiple variants with parallel uploads, generates documentation, and
+uploads with metadata. Fire-and-forget design – start it and return to completed models.
+
+The Python API enables custom configurations (limited to embedding and output layers due to
+llama-cpp-python constraints):
+
+```python
+from helpers.services.llama_python import LlamaCppPythonAPI
+
+api = LlamaCppPythonAPI()
+
+# Q4_K_L profile - upgrades embeddings to Q8_0
+api.quantise_model_flexible(
+    input_path="model-f16.gguf",
+    output_path="model-Q4_K_L.gguf",
+    base_type="Q4_K_M",      # Q4_K_M uses Q6_K for embeddings, attn_v, and ffn_down (not flat Q4_K!)
+    embedding_type="Q8_0",   # Further upgrade embeddings from Q6_K to Q8_0
+    output_type=None         # Keep default from base type
+)
+
+# Example 2: Q3_K_L profile - upgrades output to Q5_K
+api.quantise_model_flexible(
+    input_path="model-f16.gguf",
+    output_path="model-Q3_K_L.gguf",
+    base_type="Q3_K_M",      # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down (not flat Q3_K!)
+    embedding_type=None,     # Keep the already-enhanced Q6_K embeddings from base
+    output_type="Q5_K"       # Upgrade output from Q4_K to Q5_K
+)
+
+# Q3_K_XL profile - upgrades both embeddings and output
+api.quantise_model_flexible(
+    input_path="model-f16.gguf",
+    output_path="model-Q3_K_XL.gguf",
+    base_type="Q3_K_M",    # Q3_K_M uses Q6_K embeddings, Q5_K attn_v, Q4_K ffn_down
+    embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
+    output_type="Q5_K"     # Upgrade output from Q4_K to Q5_K
+)
+
+# Example 4: Custom experimental configuration
+api.quantise_model_flexible(
+    input_path="model-f16.gguf",
+    output_path="model-custom.gguf",
+    base_type="Q5_K_M",    # Q5_K_M uses Q6_K embeddings, Q6_K attn_v, Q6_K ffn_down
+    embedding_type="Q8_0", # Further upgrade embeddings from Q6_K to Q8_0
+    output_type="Q8_0"     # Upgrade output to maximum precision Q8_0
+)
 ```

-### Command Line Options
+Command-line usage is even simpler. Just point it at a HuggingFace model and let it work:

 ```bash
-# Skip imatrix generation for faster processing
-uv run quantise.py <model_url> --no-imatrix
+# Basic usage
+uv run quantise_gguf.py https://huggingface.co/meta-llama/Llama-3.2-1B
+
+# Skip imatrix checking for speed
+uv run quantise_gguf.py <model_url> --no-imatrix

 # Local testing without upload
-uv run quantise.py <model_url> --no-upload
+uv run quantise_gguf.py <model_url> --no-upload

-# Custom output directory
-uv run quantise.py <model_url> --output-dir ./my-models
-
-# Use specific HuggingFace token
-uv run quantise.py <model_url> --hf-token YOUR_TOKEN
+# Custom profiles
+uv run quantise_gguf.py <model_url> --profiles Q3_K_M Q4_K_L Q6_K
 ```

-## Environment Variables
+## The Architecture Behind the Magic

- `HF_TOKEN`: HuggingFace API token for uploads
- `LLAMA_CPP_DIR`: Custom path to llama.cpp binaries
- `DEBUG`: Enable debug logging when set to "true"
+Based on Qwen3 4B analysis: embeddings (9.7% of parameters) critically affect vocabulary – Q4 to Q8
+adds just 0.17GB but dramatically improves rare tokens. Attention (14.1% total) has V layers (4.7%)
+enhanced in M variants whilst Q and K stay at base for size control.

-## Requirements
+Feed-forward layers show clear trade-offs: gate/up projections (44.6% of parameters) stay at base
+as enhancement would double size for modest gains. Down projections (22.3%) are enhanced in M
+variants for feature transformation quality. Output layer (9.4%) gets special attention in Q3_K_L
+for prediction quality.

- **llama.cpp binaries**: `llama-quantize`, `llama-cli`, `llama-imatrix`
- **Calibration data**: `resources/imatrix_data.txt` for importance matrix generation
- **HuggingFace account**: For uploading quantised models (optional)
+For an 8B model: Q4_K_M baseline is ~4.5GB with Q6_K enhancements. Q4_K_L adds 753MB (5.3GB total)
+for Q8_0 embeddings. A hypothetical Q4_K_XL would reach 6.6GB – at which point Q5_K_M's superior
+base quantisation makes more sense.

-## Workflow
+## Environment and Performance

-1. **Download**: Fetches the model from HuggingFace
-2. **Convert**: Converts to initial GGUF format (F32)
-3. **Generate imatrix**: Creates importance matrix using calibration data
-4. **Quantise**: Produces multiple quantisation variants in parallel
-5. **Upload**: Pushes quantised models to HuggingFace with metadata
-6. **Clean up**: Removes temporary files and caches
+Configuration via environment variables: `HF_TOKEN` for uploads, `LLAMA_CPP_DIR` for custom
+binaries, `DEBUG=true` for verbose logging. Uses llama-cpp-python (auto-installed via uv),
+benefits from imatrix files, requires HuggingFace account only for uploads.

-## Output Structure
+Requirements scale predictably: disk needs ~3x model size (original, F32, outputs), memory tracks
+model size with streaming optimisations. Processing takes minutes to hours depending on size.
+Downloads range from gigabytes to 100GB+ for largest models.

-```plain
-output_dir/
-├── model-F32.gguf           # Full precision conversion
-├── model-Q4_K_M.gguf        # Standard quantisation
-├── model-Q4_K_M-imat.gguf   # With importance matrix
-├── model-Q4_K_L-imat.gguf   # Enhanced embeddings/attention
-├── model-Q4_K_XL-imat.gguf  # High precision embeddings
-├── model-Q4_K_XXL-imat.gguf # Maximum precision
-└── imatrix.dat              # Generated importance matrix
-```
+Comprehensive error handling: automatic retry with exponential backoff, early dependency detection,
+disk space checks, actionable API error messages, detailed conversion failure logs. Resilient
+workflow keeps you informed whilst handling large model processing challenges.

-## Error Handling
+## Output and Organisation

-The tool includes comprehensive error handling for:
+Outputs organised per model: F32/F16 base, quantisation variants, imatrix files, documentation.
+Naming pattern: `model-name-variant.gguf`. Successful uploads auto-clean local files; failures
+preserve for manual intervention. READMEs document variant characteristics and technical details.

- Network failures during download
- Missing binaries or dependencies
- Insufficient disk space
- HuggingFace API errors
- Conversion failures
-
-## Performance Considerations
-
- **Disk space**: Requires ~3x model size in free space
- **Memory**: Needs RAM proportional to model size
- **Processing time**: Varies from minutes to hours based on model size
- **Network**: Downloads can be large (10-100+ GB for large models)
+Uploads include metadata, quantisation tags, and model cards explaining trade-offs. Parallel upload
+system maximises throughput with full progress visibility.
--- a/docs/safetensors2gguf.md
+++ b/docs/safetensors2gguf.md
@ -1,164 +1,272 @@
-# direct_safetensors_to_gguf.py - Direct SafeTensors Conversion
+# safetensors2gguf.py - Direct SafeTensors Conversion

-Direct SafeTensors to GGUF converter for unsupported architectures.
+When llama.cpp doesn't recognise your model architecture, this tool provides direct SafeTensors to
+GGUF conversion. It bypasses llama.cpp's architecture-specific logic for experimental models and
+custom architectures that lack official support.

 ## Overview

-This tool converts SafeTensors models directly to GGUF format without requiring specific
-architecture support in llama.cpp. It's particularly useful for experimental models, custom
-architectures, or when llama.cpp's standard conversion tools don't recognise your model
-architecture.
+Most transformer models share common tensor patterns regardless of architecture. While llama.cpp
+requires explicit support for each architecture, this tool maps tensor names to GGUF conventions
+and preserves metadata. Works well for models following standard transformer patterns.

 ## Features

- **Architecture-agnostic**: Works with unsupported model architectures
- **Automatic mapping**: Intelligently maps tensor names to GGUF conventions
- **BFloat16 support**: Handles BF16 tensors with PyTorch (optional)
- **Vision models**: Supports models with vision components
- **Tokeniser preservation**: Extracts and includes tokeniser metadata
- **Fallback mechanisms**: Provides sensible defaults for unknown architectures
+The converter handles real-world models pragmatically:
+
+- **Architecture-agnostic conversion**: Pattern matching identifies common tensor types – embeddings
+  look similar across Llama, Qwen, or custom architectures
+- **Intelligent tensor mapping**: Maps standard patterns (self_attn.q_proj → attn_q) whilst
+  preserving unrecognised tensors rather than dropping them
+- **BFloat16 handling**: Optional PyTorch for BF16→F32 conversion as many models ship in BF16
+- **Vision model support**: Extracts vision tower parameters for multimodal models
+- **Tokeniser preservation**: Copies configuration wholesale to prevent garbled output from mismatches
+- **Graceful fallbacks**: Unknown architectures default to Llama structure – effective since most
+  models derive from Llama

 ## Usage

+Point at a model directory and the tool handles the rest. Most models convert with defaults, though
+forcing architecture helps when autodetection fails.
+
 ### Basic Usage

 ```bash
-# Convert a local SafeTensors model
-uv run direct_safetensors_to_gguf.py /path/to/model/directory
+# Convert a local SafeTensors model - autodetects architecture
+uv run safetensors2gguf.py /path/to/model/directory
 ```

 ### Command Line Options

 ```bash
-# Specify output file
-uv run direct_safetensors_to_gguf.py /path/to/model -o output.gguf
+# Specify output location - useful for organising converted models
+uv run safetensors2gguf.py /path/to/model -o output.gguf

-# Force specific architecture mapping
-uv run direct_safetensors_to_gguf.py /path/to/model --force-arch qwen2
+# Force architecture when autodetection fails or for better compatibility
+uv run safetensors2gguf.py /path/to/model --force-arch qwen2

-# Convert with custom output path
-uv run direct_safetensors_to_gguf.py ./my-model --output ./converted/my-model.gguf
+# Convert with full path control - keeps originals safe
+uv run safetensors2gguf.py ./my-model --output ./converted/my-model.gguf
 ```

 ## Supported Input Formats

-The tool automatically detects and handles:
+The tool handles all packaging formats. Sharding emerged when models exceeded file system limits –
+a 70B model spans dozens of files. Reassembles fragments transparently whether HuggingFace numbered
+shards or custom splits.

-1. **Single file models**: `model.safetensors`
-2. **Sharded models**: `model-00001-of-00005.safetensors`, etc.
-3. **Custom names**: Any `*.safetensors` files in the directory
+1. **Single file models**: `model.safetensors` – common for models under 10GB
+2. **Sharded models**: `model-00001-of-00005.safetensors` – standard for large models, tool
+   automatically finds and merges all shards in sequence
+3. **Custom names**: Any `*.safetensors` files – some fine-tunes use non-standard naming, tool
+   scans for all SafeTensors files regardless of naming convention

 ## Architecture Mapping

-The tool includes built-in mappings for several architectures:
+Architecture mapping bridges naming chaos and GGUF's structured expectations. Model creators invent
+their own names, but patterns remain similar underneath. Translation table for known architectures,
+unknowns default to Llama – reasonable since most models are Llama-inspired.

- `DotsOCRForCausalLM` → `qwen2`
- `GptOssForCausalLM` → `llama`
- Unknown architectures → `llama` (fallback)
+Built-in mappings reflect real-world encounters:

-You can override these with the `--force-arch` parameter.
+- `DotsOCRForCausalLM` → `qwen2` – Dots OCR models are Qwen2-based despite the naming
+- `GptOssForCausalLM` → `llama` – Generic GPT models usually follow Llama architecture
+- Unknown architectures → `llama` – Safe default that works for most transformer models
+
+Use `--force-arch` when you know better than autodetection. Particularly useful for fine-tuned
+models with custom names but standard structure.

 ## Tensor Name Mapping

-The converter automatically maps common tensor patterns:
+Tensor naming diverges most between formats. HuggingFace uses verbose hierarchical names
+(`model.layers.0.self_attn.q_proj.weight`), GGUF prefers terse (`blk.0.attn_q`). Mapping preserves
+semantics whilst adapting conventions, enabling cross-ecosystem compatibility with llama.cpp.

-| Original Pattern | GGUF Name |
-|-----------------|-----------|
-| `model.embed_tokens.weight` | `token_embd.weight` |
-| `model.norm.weight` | `output_norm.weight` |
-| `lm_head.weight` | `output.weight` |
-| `layers.N.self_attn.q_proj` | `blk.N.attn_q` |
-| `layers.N.self_attn.k_proj` | `blk.N.attn_k` |
-| `layers.N.self_attn.v_proj` | `blk.N.attn_v` |
-| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` |
-| `layers.N.mlp.up_proj` | `blk.N.ffn_up` |
-| `layers.N.mlp.down_proj` | `blk.N.ffn_down` |
+| Original Pattern | GGUF Name | Purpose |
+|-----------------|-----------|------|
+| `model.embed_tokens.weight` | `token_embd.weight` | Token embeddings – maps input IDs to vectors |
+| `model.norm.weight` | `output_norm.weight` | Final layer normalisation before output |
+| `lm_head.weight` | `output.weight` | Output projection to vocabulary space |
+| `layers.N.self_attn.q_proj` | `blk.N.attn_q` | Query projection for attention layer N |
+| `layers.N.self_attn.k_proj` | `blk.N.attn_k` | Key projection for attention layer N |
+| `layers.N.self_attn.v_proj` | `blk.N.attn_v` | Value projection for attention layer N |
+| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` | Gate projection in feedforward network |
+| `layers.N.mlp.up_proj` | `blk.N.ffn_up` | Up projection expanding hidden dimension |
+| `layers.N.mlp.down_proj` | `blk.N.ffn_down` | Down projection reducing to model dimension |
+
+Pattern matching handles variations like `transformer.h.N` (GPT-style) or `model.decoder.layers.N`
+(encoder-decoder) by identifying core patterns regardless of prefix.

 ## Configuration Requirements

-The model directory must contain:
+Conversion requires core files though optional components are forgiven. HuggingFace downloads
+typically include everything, manually assembled models may lack critical configuration.

- **config.json**: Model configuration file (required)
- **\*.safetensors**: One or more SafeTensors files (required)
- **tokenizer_config.json**: Tokeniser configuration (optional)
- **tokenizer.json**: Tokeniser data (optional)
+Required files:
+
+- **config.json**: Architecture name, layer counts, vocabulary size – essential for structuring GGUF
+- **\*.safetensors**: Model weights, single or sharded – handled automatically
+
+Optional but recommended:
+
+- **tokenizer_config.json**: Special tokens, chat templates, tokeniser behaviour – missing often
+  causes garbled output
+- **tokenizer.json**: Vocabulary and merge rules – tool extracts from other sources if missing but
+  inclusion ensures compatibility

 ## Output Format

-The tool produces a single GGUF file containing:
+GGUF bundles everything for inference in one file, unlike SafeTensors' scattered JSON configuration.
+Simplifies deployment but requires careful metadata preservation during conversion.

- All model weights in F32 format
- Model architecture metadata
- Tokeniser configuration (if available)
- Special token IDs (BOS, EOS, UNK, PAD)
+The output file contains:
+
+- **Model weights in F32**: Full precision, quantise later with dedicated tools
+- **Architecture metadata**: Layer counts, dimensions, activations for model graph construction
+- **Tokeniser configuration**: Vocabulary, special tokens, chat templates for model behaviour
+- **Special token mappings**: BOS, EOS, UNK, PAD – control generation, must match training config

 ## Error Handling

+Error messages are actionable – explaining what went wrong, why it matters, and how to fix it.
+
 | Error | Message | Solution |
 |-------|---------|----------|
-| Missing config.json | `FileNotFoundError: Config file not found` | Ensure the model directory contains a valid `config.json` file |
-| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Check that the directory contains `.safetensors` files |
-| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch for BF16 support: `uv add torch` |
-| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Use `--force-arch` to specify a known compatible architecture |
+| Missing config.json | `FileNotFoundError: Config file not found` | Download the complete model including config.json, not just weights |
+| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Verify the model uses SafeTensors format – older models might use PyTorch .bin files |
+| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch (`uv add torch`) or accept potential precision loss in BF16→F32 conversion |
+| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Research the model's base architecture and use `--force-arch` with the appropriate type |

 ## Technical Details

 ### Parameter Inference

-The tool infers GGUF parameters from the model configuration:
+Parameter inference bridges naming inconsistencies. Llama's `num_attention_heads` is GPT's
+`n_heads`. Translation layer provides sensible defaults for missing values.

- `vocab_size` → vocabulary size (default: 32000)
- `max_position_embeddings` → context length (default: 2048)
- `hidden_size` → embedding dimension (default: 4096)
- `num_hidden_layers` → number of transformer blocks (default: 32)
- `num_attention_heads` → attention head count (default: 32)
- `num_key_value_heads` → KV head count (defaults to attention heads)
- `rope_theta` → RoPE frequency base (default: 10000.0)
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5)
+Configuration mapping with defaults chosen from common models:
+
+- `vocab_size` → vocabulary size (default: 32000 – Llama's vocabulary)
+- `max_position_embeddings` → context length (default: 2048 – conservative for compatibility)
+- `hidden_size` → embedding dimension (default: 4096 – typical for 7B models)
+- `num_hidden_layers` → transformer blocks (default: 32 – standard for 7B models)
+- `num_attention_heads` → attention heads (default: 32 – balanced for 4096 dimension)
+- `num_key_value_heads` → KV heads for GQA (defaults to attention heads – assumes MHA not GQA)
+- `rope_theta` → RoPE frequency base (default: 10000.0 – standard RoPE configuration)
+- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5 – numerical stability threshold)
+
+Defaults work for most models. Wrong parameters may not error immediately but degrade output quality.

 ### Vision Model Support

-For models with vision components, the tool extracts:
+Multimodal models increasingly common. Tool preserves vision tower configuration though GGUF support
+remains experimental. Vision parameters extracted but may not be fully utilised.

- Vision embedding dimensions
- Vision transformer block count
- Vision attention heads
- Vision feed-forward dimensions
- Patch size and spatial merge parameters
+Extracted vision parameters:
+
+- **Vision embedding dimensions**: Hidden size, typically differs from language dimensions
+- **Vision transformer blocks**: Encoder layers, fewer but wider than language
+- **Vision attention heads**: Usually standard MHA rather than grouped-query
+- **Feed-forward dimensions**: Different expansion ratios from language FFN
+- **Patch configuration**: Size (14×14), spatial merging, position encoding
+
+Vision support best-effort – preserves what's found, can't guarantee inference engine usage.

 ## Limitations

- **F32 only**: Currently outputs only full precision (F32) models
- **Architecture guessing**: May require manual architecture specification
- **Tokeniser compatibility**: Uses llama tokeniser as default fallback
- **Memory usage**: Requires loading full tensors into memory
+Understanding limitations prevents frustration. Design favours broad compatibility over perfection.
+
+- **F32 output only**: Quantisation requires separate tools like quantise_gguf.py for bit depth control
+- **Architecture guessing**: Works for common patterns, novel architectures need manual specification
+- **Tokeniser compatibility**: Falls back to Llama tokeniser when data missing – may cause issues with
+  special tokens
+- **Memory requirements**: Loads entire tensors into RAM – 70B model needs 140GB+, no streaming support
+- **No quantisation**: Preserves full precision, quantise separately for deployment control
+- **Limited validation**: Ensures structure, can't verify output quality – test before deployment

 ## Examples

 ### Converting a custom model

+Typical workflow: download from HuggingFace, convert to GGUF, quantise for deployment. This tool
+handles the SafeTensors→GGUF transformation.
+
 ```bash
-# Download a model first
+# Download complete model with all configuration files
 git clone https://huggingface.co/my-org/my-model ./my-model

-# Convert to GGUF
-uv run direct_safetensors_to_gguf.py ./my-model
+# Convert to GGUF - automatic architecture detection
+uv run safetensors2gguf.py ./my-model

-# Output will be at ./my-model/my-model-f32.gguf
+# Output appears at ./my-model/my-model-f32.gguf
+# Now ready for quantisation with quantise_gguf.py
 ```

 ### Converting with specific architecture

+Force architecture when autodetection fails or you know the model's lineage. Useful for fine-tuned
+models with custom names.
+
 ```bash
-# For a Qwen2-based model
-uv run direct_safetensors_to_gguf.py ./qwen-model --force-arch qwen2
+# Force Qwen2 architecture for a model you know is Qwen2-based
+uv run safetensors2gguf.py ./qwen-model --force-arch qwen2
+
+# Common forced architectures:
+# --force-arch llama    # Most models
+# --force-arch qwen2    # Qwen family
+# --force-arch mistral  # Mistral variants
 ```

 ### Batch conversion

+Bash loops enable bulk conversion for comparing checkpoints or converting model families.
+
 ```bash
-# Convert multiple models
+# Convert directory of models, preserving originals
 for model in ./models/*; do
-    uv run direct_safetensors_to_gguf.py "$model" -o "./gguf/$(basename $model).gguf"
+    echo "Converting $(basename $model)..."
+    uv run safetensors2gguf.py "$model" \
+        -o "./gguf/$(basename $model).gguf" 2>&1 | \
+        tee "./logs/$(basename $model).log"
 done
+
+# Check results
+ls -lh ./gguf/*.gguf
 ```
+
+## Integration with Quantisation Pipeline
+
+Tool produces F32 GGUF ready for quantisation. Typical pipeline:
+
+1. **Download model**: Get SafeTensors model from HuggingFace
+2. **Convert to GGUF**: Use this tool for architecture-agnostic conversion
+3. **Quantise**: Apply quantise_gguf.py for Bartowski-style variants
+4. **Deploy**: Use with llama.cpp, Ollama, or other GGUF-compatible inference engines
+
+Separation enables control at each stage. Convert once, quantise to multiple bit depths, test
+configurations without repeating conversion.
+
+## Troubleshooting
+
+### Model produces gibberish after conversion
+
+Indicates tokeniser mismatch. Ensure tokenizer.json and tokenizer_config.json present. Custom
+tokenisers may need `--force-arch`.
+
+### Conversion succeeds but model won't load
+
+Use recent llama.cpp – GGUF format evolves, older versions lack newer metadata support. Verify
+forced architecture matches actual structure – wrong forcing creates invalid models.
+
+### Out of memory during conversion
+
+Tool loads all weights simultaneously. For large models:
+
+- Close other applications to free RAM
+- Use a system with more memory (cloud instances work well)
+- Consider quantising from a pre-converted F16 model if available
+
+### Warning about unknown tensors
+
+Normal for custom layers. Preserves unknown tensors though inference may not use them. Harmless –
+better to include unused weights than miss critical ones.