272 lines
12 KiB
Markdown
272 lines
12 KiB
Markdown
# safetensors2gguf.py - Direct SafeTensors Conversion
|
||
|
||
When llama.cpp doesn't recognise your model architecture, this tool provides direct SafeTensors to
|
||
GGUF conversion. It bypasses llama.cpp's architecture-specific logic for experimental models and
|
||
custom architectures that lack official support.
|
||
|
||
## Overview
|
||
|
||
Most transformer models share common tensor patterns regardless of architecture. While llama.cpp
|
||
requires explicit support for each architecture, this tool maps tensor names to GGUF conventions
|
||
and preserves metadata. Works well for models following standard transformer patterns.
|
||
|
||
## Features
|
||
|
||
The converter handles real-world models pragmatically:
|
||
|
||
- **Architecture-agnostic conversion**: Pattern matching identifies common tensor types – embeddings
|
||
look similar across Llama, Qwen, or custom architectures
|
||
- **Intelligent tensor mapping**: Maps standard patterns (self_attn.q_proj → attn_q) whilst
|
||
preserving unrecognised tensors rather than dropping them
|
||
- **BFloat16 handling**: Optional PyTorch for BF16→F32 conversion as many models ship in BF16
|
||
- **Vision model support**: Extracts vision tower parameters for multimodal models
|
||
- **Tokeniser preservation**: Copies configuration wholesale to prevent garbled output from mismatches
|
||
- **Graceful fallbacks**: Unknown architectures default to Llama structure – effective since most
|
||
models derive from Llama
|
||
|
||
## Usage
|
||
|
||
Point at a model directory and the tool handles the rest. Most models convert with defaults, though
|
||
forcing architecture helps when autodetection fails.
|
||
|
||
### Basic Usage
|
||
|
||
```bash
|
||
# Convert a local SafeTensors model - autodetects architecture
|
||
uv run safetensors2gguf.py /path/to/model/directory
|
||
```
|
||
|
||
### Command Line Options
|
||
|
||
```bash
|
||
# Specify output location - useful for organising converted models
|
||
uv run safetensors2gguf.py /path/to/model -o output.gguf
|
||
|
||
# Force architecture when autodetection fails or for better compatibility
|
||
uv run safetensors2gguf.py /path/to/model --force-arch qwen2
|
||
|
||
# Convert with full path control - keeps originals safe
|
||
uv run safetensors2gguf.py ./my-model --output ./converted/my-model.gguf
|
||
```
|
||
|
||
## Supported Input Formats
|
||
|
||
The tool handles all packaging formats. Sharding emerged when models exceeded file system limits –
|
||
a 70B model spans dozens of files. Reassembles fragments transparently whether HuggingFace numbered
|
||
shards or custom splits.
|
||
|
||
1. **Single file models**: `model.safetensors` – common for models under 10GB
|
||
2. **Sharded models**: `model-00001-of-00005.safetensors` – standard for large models, tool
|
||
automatically finds and merges all shards in sequence
|
||
3. **Custom names**: Any `*.safetensors` files – some fine-tunes use non-standard naming, tool
|
||
scans for all SafeTensors files regardless of naming convention
|
||
|
||
## Architecture Mapping
|
||
|
||
Architecture mapping bridges naming chaos and GGUF's structured expectations. Model creators invent
|
||
their own names, but patterns remain similar underneath. Translation table for known architectures,
|
||
unknowns default to Llama – reasonable since most models are Llama-inspired.
|
||
|
||
Built-in mappings reflect real-world encounters:
|
||
|
||
- `DotsOCRForCausalLM` → `qwen2` – Dots OCR models are Qwen2-based despite the naming
|
||
- `GptOssForCausalLM` → `llama` – Generic GPT models usually follow Llama architecture
|
||
- Unknown architectures → `llama` – Safe default that works for most transformer models
|
||
|
||
Use `--force-arch` when you know better than autodetection. Particularly useful for fine-tuned
|
||
models with custom names but standard structure.
|
||
|
||
## Tensor Name Mapping
|
||
|
||
Tensor naming diverges most between formats. HuggingFace uses verbose hierarchical names
|
||
(`model.layers.0.self_attn.q_proj.weight`), GGUF prefers terse (`blk.0.attn_q`). Mapping preserves
|
||
semantics whilst adapting conventions, enabling cross-ecosystem compatibility with llama.cpp.
|
||
|
||
| Original Pattern | GGUF Name | Purpose |
|
||
|-----------------|-----------|------|
|
||
| `model.embed_tokens.weight` | `token_embd.weight` | Token embeddings – maps input IDs to vectors |
|
||
| `model.norm.weight` | `output_norm.weight` | Final layer normalisation before output |
|
||
| `lm_head.weight` | `output.weight` | Output projection to vocabulary space |
|
||
| `layers.N.self_attn.q_proj` | `blk.N.attn_q` | Query projection for attention layer N |
|
||
| `layers.N.self_attn.k_proj` | `blk.N.attn_k` | Key projection for attention layer N |
|
||
| `layers.N.self_attn.v_proj` | `blk.N.attn_v` | Value projection for attention layer N |
|
||
| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` | Gate projection in feedforward network |
|
||
| `layers.N.mlp.up_proj` | `blk.N.ffn_up` | Up projection expanding hidden dimension |
|
||
| `layers.N.mlp.down_proj` | `blk.N.ffn_down` | Down projection reducing to model dimension |
|
||
|
||
Pattern matching handles variations like `transformer.h.N` (GPT-style) or `model.decoder.layers.N`
|
||
(encoder-decoder) by identifying core patterns regardless of prefix.
|
||
|
||
## Configuration Requirements
|
||
|
||
Conversion requires core files though optional components are forgiven. HuggingFace downloads
|
||
typically include everything, manually assembled models may lack critical configuration.
|
||
|
||
Required files:
|
||
|
||
- **config.json**: Architecture name, layer counts, vocabulary size – essential for structuring GGUF
|
||
- **\*.safetensors**: Model weights, single or sharded – handled automatically
|
||
|
||
Optional but recommended:
|
||
|
||
- **tokenizer_config.json**: Special tokens, chat templates, tokeniser behaviour – missing often
|
||
causes garbled output
|
||
- **tokenizer.json**: Vocabulary and merge rules – tool extracts from other sources if missing but
|
||
inclusion ensures compatibility
|
||
|
||
## Output Format
|
||
|
||
GGUF bundles everything for inference in one file, unlike SafeTensors' scattered JSON configuration.
|
||
Simplifies deployment but requires careful metadata preservation during conversion.
|
||
|
||
The output file contains:
|
||
|
||
- **Model weights in F32**: Full precision, quantise later with dedicated tools
|
||
- **Architecture metadata**: Layer counts, dimensions, activations for model graph construction
|
||
- **Tokeniser configuration**: Vocabulary, special tokens, chat templates for model behaviour
|
||
- **Special token mappings**: BOS, EOS, UNK, PAD – control generation, must match training config
|
||
|
||
## Error Handling
|
||
|
||
Error messages are actionable – explaining what went wrong, why it matters, and how to fix it.
|
||
|
||
| Error | Message | Solution |
|
||
|-------|---------|----------|
|
||
| Missing config.json | `FileNotFoundError: Config file not found` | Download the complete model including config.json, not just weights |
|
||
| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Verify the model uses SafeTensors format – older models might use PyTorch .bin files |
|
||
| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch (`uv add torch`) or accept potential precision loss in BF16→F32 conversion |
|
||
| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Research the model's base architecture and use `--force-arch` with the appropriate type |
|
||
|
||
## Technical Details
|
||
|
||
### Parameter Inference
|
||
|
||
Parameter inference bridges naming inconsistencies. Llama's `num_attention_heads` is GPT's
|
||
`n_heads`. Translation layer provides sensible defaults for missing values.
|
||
|
||
Configuration mapping with defaults chosen from common models:
|
||
|
||
- `vocab_size` → vocabulary size (default: 32000 – Llama's vocabulary)
|
||
- `max_position_embeddings` → context length (default: 2048 – conservative for compatibility)
|
||
- `hidden_size` → embedding dimension (default: 4096 – typical for 7B models)
|
||
- `num_hidden_layers` → transformer blocks (default: 32 – standard for 7B models)
|
||
- `num_attention_heads` → attention heads (default: 32 – balanced for 4096 dimension)
|
||
- `num_key_value_heads` → KV heads for GQA (defaults to attention heads – assumes MHA not GQA)
|
||
- `rope_theta` → RoPE frequency base (default: 10000.0 – standard RoPE configuration)
|
||
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5 – numerical stability threshold)
|
||
|
||
Defaults work for most models. Wrong parameters may not error immediately but degrade output quality.
|
||
|
||
### Vision Model Support
|
||
|
||
Multimodal models increasingly common. Tool preserves vision tower configuration though GGUF support
|
||
remains experimental. Vision parameters extracted but may not be fully utilised.
|
||
|
||
Extracted vision parameters:
|
||
|
||
- **Vision embedding dimensions**: Hidden size, typically differs from language dimensions
|
||
- **Vision transformer blocks**: Encoder layers, fewer but wider than language
|
||
- **Vision attention heads**: Usually standard MHA rather than grouped-query
|
||
- **Feed-forward dimensions**: Different expansion ratios from language FFN
|
||
- **Patch configuration**: Size (14×14), spatial merging, position encoding
|
||
|
||
Vision support best-effort – preserves what's found, can't guarantee inference engine usage.
|
||
|
||
## Limitations
|
||
|
||
Understanding limitations prevents frustration. Design favours broad compatibility over perfection.
|
||
|
||
- **F32 output only**: Quantisation requires separate tools like quantise_gguf.py for bit depth control
|
||
- **Architecture guessing**: Works for common patterns, novel architectures need manual specification
|
||
- **Tokeniser compatibility**: Falls back to Llama tokeniser when data missing – may cause issues with
|
||
special tokens
|
||
- **Memory requirements**: Loads entire tensors into RAM – 70B model needs 140GB+, no streaming support
|
||
- **No quantisation**: Preserves full precision, quantise separately for deployment control
|
||
- **Limited validation**: Ensures structure, can't verify output quality – test before deployment
|
||
|
||
## Examples
|
||
|
||
### Converting a custom model
|
||
|
||
Typical workflow: download from HuggingFace, convert to GGUF, quantise for deployment. This tool
|
||
handles the SafeTensors→GGUF transformation.
|
||
|
||
```bash
|
||
# Download complete model with all configuration files
|
||
git clone https://huggingface.co/my-org/my-model ./my-model
|
||
|
||
# Convert to GGUF - automatic architecture detection
|
||
uv run safetensors2gguf.py ./my-model
|
||
|
||
# Output appears at ./my-model/my-model-f32.gguf
|
||
# Now ready for quantisation with quantise_gguf.py
|
||
```
|
||
|
||
### Converting with specific architecture
|
||
|
||
Force architecture when autodetection fails or you know the model's lineage. Useful for fine-tuned
|
||
models with custom names.
|
||
|
||
```bash
|
||
# Force Qwen2 architecture for a model you know is Qwen2-based
|
||
uv run safetensors2gguf.py ./qwen-model --force-arch qwen2
|
||
|
||
# Common forced architectures:
|
||
# --force-arch llama # Most models
|
||
# --force-arch qwen2 # Qwen family
|
||
# --force-arch mistral # Mistral variants
|
||
```
|
||
|
||
### Batch conversion
|
||
|
||
Bash loops enable bulk conversion for comparing checkpoints or converting model families.
|
||
|
||
```bash
|
||
# Convert directory of models, preserving originals
|
||
for model in ./models/*; do
|
||
echo "Converting $(basename $model)..."
|
||
uv run safetensors2gguf.py "$model" \
|
||
-o "./gguf/$(basename $model).gguf" 2>&1 | \
|
||
tee "./logs/$(basename $model).log"
|
||
done
|
||
|
||
# Check results
|
||
ls -lh ./gguf/*.gguf
|
||
```
|
||
|
||
## Integration with Quantisation Pipeline
|
||
|
||
Tool produces F32 GGUF ready for quantisation. Typical pipeline:
|
||
|
||
1. **Download model**: Get SafeTensors model from HuggingFace
|
||
2. **Convert to GGUF**: Use this tool for architecture-agnostic conversion
|
||
3. **Quantise**: Apply quantise_gguf.py for Bartowski-style variants
|
||
4. **Deploy**: Use with llama.cpp, Ollama, or other GGUF-compatible inference engines
|
||
|
||
Separation enables control at each stage. Convert once, quantise to multiple bit depths, test
|
||
configurations without repeating conversion.
|
||
|
||
## Troubleshooting
|
||
|
||
### Model produces gibberish after conversion
|
||
|
||
Indicates tokeniser mismatch. Ensure tokenizer.json and tokenizer_config.json present. Custom
|
||
tokenisers may need `--force-arch`.
|
||
|
||
### Conversion succeeds but model won't load
|
||
|
||
Use recent llama.cpp – GGUF format evolves, older versions lack newer metadata support. Verify
|
||
forced architecture matches actual structure – wrong forcing creates invalid models.
|
||
|
||
### Out of memory during conversion
|
||
|
||
Tool loads all weights simultaneously. For large models:
|
||
|
||
- Close other applications to free RAM
|
||
- Use a system with more memory (cloud instances work well)
|
||
- Consider quantising from a pre-converted F16 model if available
|
||
|
||
### Warning about unknown tensors
|
||
|
||
Normal for custom layers. Preserves unknown tensors though inference may not use them. Harmless –
|
||
better to include unused weights than miss critical ones.
|