Switch to llama-cpp-python

2025-08-08 21:40:15 +01:00 · 2025-08-08 21:40:15 +01:00 · d937f2d5fa
commit d937f2d5fa
parent ef7df1a8c3
25 changed files with 2957 additions and 1181 deletions
--- a/docs/safetensors2gguf.md
+++ b/docs/safetensors2gguf.md
@ -1,164 +1,272 @@
-# direct_safetensors_to_gguf.py - Direct SafeTensors Conversion
+# safetensors2gguf.py - Direct SafeTensors Conversion

-Direct SafeTensors to GGUF converter for unsupported architectures.
+When llama.cpp doesn't recognise your model architecture, this tool provides direct SafeTensors to
+GGUF conversion. It bypasses llama.cpp's architecture-specific logic for experimental models and
+custom architectures that lack official support.

 ## Overview

-This tool converts SafeTensors models directly to GGUF format without requiring specific
-architecture support in llama.cpp. It's particularly useful for experimental models, custom
-architectures, or when llama.cpp's standard conversion tools don't recognise your model
-architecture.
+Most transformer models share common tensor patterns regardless of architecture. While llama.cpp
+requires explicit support for each architecture, this tool maps tensor names to GGUF conventions
+and preserves metadata. Works well for models following standard transformer patterns.

 ## Features

- **Architecture-agnostic**: Works with unsupported model architectures
- **Automatic mapping**: Intelligently maps tensor names to GGUF conventions
- **BFloat16 support**: Handles BF16 tensors with PyTorch (optional)
- **Vision models**: Supports models with vision components
- **Tokeniser preservation**: Extracts and includes tokeniser metadata
- **Fallback mechanisms**: Provides sensible defaults for unknown architectures
+The converter handles real-world models pragmatically:
+
+- **Architecture-agnostic conversion**: Pattern matching identifies common tensor types – embeddings
+  look similar across Llama, Qwen, or custom architectures
+- **Intelligent tensor mapping**: Maps standard patterns (self_attn.q_proj → attn_q) whilst
+  preserving unrecognised tensors rather than dropping them
+- **BFloat16 handling**: Optional PyTorch for BF16→F32 conversion as many models ship in BF16
+- **Vision model support**: Extracts vision tower parameters for multimodal models
+- **Tokeniser preservation**: Copies configuration wholesale to prevent garbled output from mismatches
+- **Graceful fallbacks**: Unknown architectures default to Llama structure – effective since most
+  models derive from Llama

 ## Usage

+Point at a model directory and the tool handles the rest. Most models convert with defaults, though
+forcing architecture helps when autodetection fails.
+
 ### Basic Usage

 ```bash
-# Convert a local SafeTensors model
-uv run direct_safetensors_to_gguf.py /path/to/model/directory
+# Convert a local SafeTensors model - autodetects architecture
+uv run safetensors2gguf.py /path/to/model/directory
 ```

 ### Command Line Options

 ```bash
-# Specify output file
-uv run direct_safetensors_to_gguf.py /path/to/model -o output.gguf
+# Specify output location - useful for organising converted models
+uv run safetensors2gguf.py /path/to/model -o output.gguf

-# Force specific architecture mapping
-uv run direct_safetensors_to_gguf.py /path/to/model --force-arch qwen2
+# Force architecture when autodetection fails or for better compatibility
+uv run safetensors2gguf.py /path/to/model --force-arch qwen2

-# Convert with custom output path
-uv run direct_safetensors_to_gguf.py ./my-model --output ./converted/my-model.gguf
+# Convert with full path control - keeps originals safe
+uv run safetensors2gguf.py ./my-model --output ./converted/my-model.gguf
 ```

 ## Supported Input Formats

-The tool automatically detects and handles:
+The tool handles all packaging formats. Sharding emerged when models exceeded file system limits –
+a 70B model spans dozens of files. Reassembles fragments transparently whether HuggingFace numbered
+shards or custom splits.

-1. **Single file models**: `model.safetensors`
-2. **Sharded models**: `model-00001-of-00005.safetensors`, etc.
-3. **Custom names**: Any `*.safetensors` files in the directory
+1. **Single file models**: `model.safetensors` – common for models under 10GB
+2. **Sharded models**: `model-00001-of-00005.safetensors` – standard for large models, tool
+   automatically finds and merges all shards in sequence
+3. **Custom names**: Any `*.safetensors` files – some fine-tunes use non-standard naming, tool
+   scans for all SafeTensors files regardless of naming convention

 ## Architecture Mapping

-The tool includes built-in mappings for several architectures:
+Architecture mapping bridges naming chaos and GGUF's structured expectations. Model creators invent
+their own names, but patterns remain similar underneath. Translation table for known architectures,
+unknowns default to Llama – reasonable since most models are Llama-inspired.

- `DotsOCRForCausalLM` → `qwen2`
- `GptOssForCausalLM` → `llama`
- Unknown architectures → `llama` (fallback)
+Built-in mappings reflect real-world encounters:

-You can override these with the `--force-arch` parameter.
+- `DotsOCRForCausalLM` → `qwen2` – Dots OCR models are Qwen2-based despite the naming
+- `GptOssForCausalLM` → `llama` – Generic GPT models usually follow Llama architecture
+- Unknown architectures → `llama` – Safe default that works for most transformer models
+
+Use `--force-arch` when you know better than autodetection. Particularly useful for fine-tuned
+models with custom names but standard structure.

 ## Tensor Name Mapping

-The converter automatically maps common tensor patterns:
+Tensor naming diverges most between formats. HuggingFace uses verbose hierarchical names
+(`model.layers.0.self_attn.q_proj.weight`), GGUF prefers terse (`blk.0.attn_q`). Mapping preserves
+semantics whilst adapting conventions, enabling cross-ecosystem compatibility with llama.cpp.

-| Original Pattern | GGUF Name |
-|-----------------|-----------|
-| `model.embed_tokens.weight` | `token_embd.weight` |
-| `model.norm.weight` | `output_norm.weight` |
-| `lm_head.weight` | `output.weight` |
-| `layers.N.self_attn.q_proj` | `blk.N.attn_q` |
-| `layers.N.self_attn.k_proj` | `blk.N.attn_k` |
-| `layers.N.self_attn.v_proj` | `blk.N.attn_v` |
-| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` |
-| `layers.N.mlp.up_proj` | `blk.N.ffn_up` |
-| `layers.N.mlp.down_proj` | `blk.N.ffn_down` |
+| Original Pattern | GGUF Name | Purpose |
+|-----------------|-----------|------|
+| `model.embed_tokens.weight` | `token_embd.weight` | Token embeddings – maps input IDs to vectors |
+| `model.norm.weight` | `output_norm.weight` | Final layer normalisation before output |
+| `lm_head.weight` | `output.weight` | Output projection to vocabulary space |
+| `layers.N.self_attn.q_proj` | `blk.N.attn_q` | Query projection for attention layer N |
+| `layers.N.self_attn.k_proj` | `blk.N.attn_k` | Key projection for attention layer N |
+| `layers.N.self_attn.v_proj` | `blk.N.attn_v` | Value projection for attention layer N |
+| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` | Gate projection in feedforward network |
+| `layers.N.mlp.up_proj` | `blk.N.ffn_up` | Up projection expanding hidden dimension |
+| `layers.N.mlp.down_proj` | `blk.N.ffn_down` | Down projection reducing to model dimension |
+
+Pattern matching handles variations like `transformer.h.N` (GPT-style) or `model.decoder.layers.N`
+(encoder-decoder) by identifying core patterns regardless of prefix.

 ## Configuration Requirements

-The model directory must contain:
+Conversion requires core files though optional components are forgiven. HuggingFace downloads
+typically include everything, manually assembled models may lack critical configuration.

- **config.json**: Model configuration file (required)
- **\*.safetensors**: One or more SafeTensors files (required)
- **tokenizer_config.json**: Tokeniser configuration (optional)
- **tokenizer.json**: Tokeniser data (optional)
+Required files:
+
+- **config.json**: Architecture name, layer counts, vocabulary size – essential for structuring GGUF
+- **\*.safetensors**: Model weights, single or sharded – handled automatically
+
+Optional but recommended:
+
+- **tokenizer_config.json**: Special tokens, chat templates, tokeniser behaviour – missing often
+  causes garbled output
+- **tokenizer.json**: Vocabulary and merge rules – tool extracts from other sources if missing but
+  inclusion ensures compatibility

 ## Output Format

-The tool produces a single GGUF file containing:
+GGUF bundles everything for inference in one file, unlike SafeTensors' scattered JSON configuration.
+Simplifies deployment but requires careful metadata preservation during conversion.

- All model weights in F32 format
- Model architecture metadata
- Tokeniser configuration (if available)
- Special token IDs (BOS, EOS, UNK, PAD)
+The output file contains:
+
+- **Model weights in F32**: Full precision, quantise later with dedicated tools
+- **Architecture metadata**: Layer counts, dimensions, activations for model graph construction
+- **Tokeniser configuration**: Vocabulary, special tokens, chat templates for model behaviour
+- **Special token mappings**: BOS, EOS, UNK, PAD – control generation, must match training config

 ## Error Handling

+Error messages are actionable – explaining what went wrong, why it matters, and how to fix it.
+
 | Error | Message | Solution |
 |-------|---------|----------|
-| Missing config.json | `FileNotFoundError: Config file not found` | Ensure the model directory contains a valid `config.json` file |
-| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Check that the directory contains `.safetensors` files |
-| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch for BF16 support: `uv add torch` |
-| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Use `--force-arch` to specify a known compatible architecture |
+| Missing config.json | `FileNotFoundError: Config file not found` | Download the complete model including config.json, not just weights |
+| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Verify the model uses SafeTensors format – older models might use PyTorch .bin files |
+| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch (`uv add torch`) or accept potential precision loss in BF16→F32 conversion |
+| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Research the model's base architecture and use `--force-arch` with the appropriate type |

 ## Technical Details

 ### Parameter Inference

-The tool infers GGUF parameters from the model configuration:
+Parameter inference bridges naming inconsistencies. Llama's `num_attention_heads` is GPT's
+`n_heads`. Translation layer provides sensible defaults for missing values.

- `vocab_size` → vocabulary size (default: 32000)
- `max_position_embeddings` → context length (default: 2048)
- `hidden_size` → embedding dimension (default: 4096)
- `num_hidden_layers` → number of transformer blocks (default: 32)
- `num_attention_heads` → attention head count (default: 32)
- `num_key_value_heads` → KV head count (defaults to attention heads)
- `rope_theta` → RoPE frequency base (default: 10000.0)
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5)
+Configuration mapping with defaults chosen from common models:
+
+- `vocab_size` → vocabulary size (default: 32000 – Llama's vocabulary)
+- `max_position_embeddings` → context length (default: 2048 – conservative for compatibility)
+- `hidden_size` → embedding dimension (default: 4096 – typical for 7B models)
+- `num_hidden_layers` → transformer blocks (default: 32 – standard for 7B models)
+- `num_attention_heads` → attention heads (default: 32 – balanced for 4096 dimension)
+- `num_key_value_heads` → KV heads for GQA (defaults to attention heads – assumes MHA not GQA)
+- `rope_theta` → RoPE frequency base (default: 10000.0 – standard RoPE configuration)
+- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5 – numerical stability threshold)
+
+Defaults work for most models. Wrong parameters may not error immediately but degrade output quality.

 ### Vision Model Support

-For models with vision components, the tool extracts:
+Multimodal models increasingly common. Tool preserves vision tower configuration though GGUF support
+remains experimental. Vision parameters extracted but may not be fully utilised.

- Vision embedding dimensions
- Vision transformer block count
- Vision attention heads
- Vision feed-forward dimensions
- Patch size and spatial merge parameters
+Extracted vision parameters:
+
+- **Vision embedding dimensions**: Hidden size, typically differs from language dimensions
+- **Vision transformer blocks**: Encoder layers, fewer but wider than language
+- **Vision attention heads**: Usually standard MHA rather than grouped-query
+- **Feed-forward dimensions**: Different expansion ratios from language FFN
+- **Patch configuration**: Size (14×14), spatial merging, position encoding
+
+Vision support best-effort – preserves what's found, can't guarantee inference engine usage.

 ## Limitations

- **F32 only**: Currently outputs only full precision (F32) models
- **Architecture guessing**: May require manual architecture specification
- **Tokeniser compatibility**: Uses llama tokeniser as default fallback
- **Memory usage**: Requires loading full tensors into memory
+Understanding limitations prevents frustration. Design favours broad compatibility over perfection.
+
+- **F32 output only**: Quantisation requires separate tools like quantise_gguf.py for bit depth control
+- **Architecture guessing**: Works for common patterns, novel architectures need manual specification
+- **Tokeniser compatibility**: Falls back to Llama tokeniser when data missing – may cause issues with
+  special tokens
+- **Memory requirements**: Loads entire tensors into RAM – 70B model needs 140GB+, no streaming support
+- **No quantisation**: Preserves full precision, quantise separately for deployment control
+- **Limited validation**: Ensures structure, can't verify output quality – test before deployment

 ## Examples

 ### Converting a custom model

+Typical workflow: download from HuggingFace, convert to GGUF, quantise for deployment. This tool
+handles the SafeTensors→GGUF transformation.
+
 ```bash
-# Download a model first
+# Download complete model with all configuration files
 git clone https://huggingface.co/my-org/my-model ./my-model

-# Convert to GGUF
-uv run direct_safetensors_to_gguf.py ./my-model
+# Convert to GGUF - automatic architecture detection
+uv run safetensors2gguf.py ./my-model

-# Output will be at ./my-model/my-model-f32.gguf
+# Output appears at ./my-model/my-model-f32.gguf
+# Now ready for quantisation with quantise_gguf.py
 ```

 ### Converting with specific architecture

+Force architecture when autodetection fails or you know the model's lineage. Useful for fine-tuned
+models with custom names.
+
 ```bash
-# For a Qwen2-based model
-uv run direct_safetensors_to_gguf.py ./qwen-model --force-arch qwen2
+# Force Qwen2 architecture for a model you know is Qwen2-based
+uv run safetensors2gguf.py ./qwen-model --force-arch qwen2
+
+# Common forced architectures:
+# --force-arch llama    # Most models
+# --force-arch qwen2    # Qwen family
+# --force-arch mistral  # Mistral variants
 ```

 ### Batch conversion

+Bash loops enable bulk conversion for comparing checkpoints or converting model families.
+
 ```bash
-# Convert multiple models
+# Convert directory of models, preserving originals
 for model in ./models/*; do
-    uv run direct_safetensors_to_gguf.py "$model" -o "./gguf/$(basename $model).gguf"
+    echo "Converting $(basename $model)..."
+    uv run safetensors2gguf.py "$model" \
+        -o "./gguf/$(basename $model).gguf" 2>&1 | \
+        tee "./logs/$(basename $model).log"
 done
+
+# Check results
+ls -lh ./gguf/*.gguf
 ```
+
+## Integration with Quantisation Pipeline
+
+Tool produces F32 GGUF ready for quantisation. Typical pipeline:
+
+1. **Download model**: Get SafeTensors model from HuggingFace
+2. **Convert to GGUF**: Use this tool for architecture-agnostic conversion
+3. **Quantise**: Apply quantise_gguf.py for Bartowski-style variants
+4. **Deploy**: Use with llama.cpp, Ollama, or other GGUF-compatible inference engines
+
+Separation enables control at each stage. Convert once, quantise to multiple bit depths, test
+configurations without repeating conversion.
+
+## Troubleshooting
+
+### Model produces gibberish after conversion
+
+Indicates tokeniser mismatch. Ensure tokenizer.json and tokenizer_config.json present. Custom
+tokenisers may need `--force-arch`.
+
+### Conversion succeeds but model won't load
+
+Use recent llama.cpp – GGUF format evolves, older versions lack newer metadata support. Verify
+forced architecture matches actual structure – wrong forcing creates invalid models.
+
+### Out of memory during conversion
+
+Tool loads all weights simultaneously. For large models:
+
+- Close other applications to free RAM
+- Use a system with more memory (cloud instances work well)
+- Consider quantising from a pre-converted F16 model if available
+
+### Warning about unknown tensors
+
+Normal for custom layers. Preserves unknown tensors though inference may not use them. Harmless –
+better to include unused weights than miss critical ones.