SHA256

Tom Foster d937f2d5fa Switch to llama-cpp-python

2025-08-08 21:40:15 +01:00

12 KiB

Raw Blame History

safetensors2gguf.py - Direct SafeTensors Conversion

When llama.cpp doesn't recognise your model architecture, this tool provides direct SafeTensors to GGUF conversion. It bypasses llama.cpp's architecture-specific logic for experimental models and custom architectures that lack official support.

Overview

Most transformer models share common tensor patterns regardless of architecture. While llama.cpp requires explicit support for each architecture, this tool maps tensor names to GGUF conventions and preserves metadata. Works well for models following standard transformer patterns.

Features

The converter handles real-world models pragmatically:

Architecture-agnostic conversion: Pattern matching identifies common tensor types – embeddings look similar across Llama, Qwen, or custom architectures
Intelligent tensor mapping: Maps standard patterns (self_attn.q_proj → attn_q) whilst preserving unrecognised tensors rather than dropping them
BFloat16 handling: Optional PyTorch for BF16→F32 conversion as many models ship in BF16
Vision model support: Extracts vision tower parameters for multimodal models
Tokeniser preservation: Copies configuration wholesale to prevent garbled output from mismatches
Graceful fallbacks: Unknown architectures default to Llama structure – effective since most models derive from Llama

Usage

Point at a model directory and the tool handles the rest. Most models convert with defaults, though forcing architecture helps when autodetection fails.

Basic Usage

# Convert a local SafeTensors model - autodetects architecture
uv run safetensors2gguf.py /path/to/model/directory

Command Line Options

# Specify output location - useful for organising converted models
uv run safetensors2gguf.py /path/to/model -o output.gguf

# Force architecture when autodetection fails or for better compatibility
uv run safetensors2gguf.py /path/to/model --force-arch qwen2

# Convert with full path control - keeps originals safe
uv run safetensors2gguf.py ./my-model --output ./converted/my-model.gguf

Supported Input Formats

The tool handles all packaging formats. Sharding emerged when models exceeded file system limits – a 70B model spans dozens of files. Reassembles fragments transparently whether HuggingFace numbered shards or custom splits.

Single file models: model.safetensors – common for models under 10GB
Sharded models: model-00001-of-00005.safetensors – standard for large models, tool automatically finds and merges all shards in sequence
Custom names: Any *.safetensors files – some fine-tunes use non-standard naming, tool scans for all SafeTensors files regardless of naming convention

Architecture Mapping

Architecture mapping bridges naming chaos and GGUF's structured expectations. Model creators invent their own names, but patterns remain similar underneath. Translation table for known architectures, unknowns default to Llama – reasonable since most models are Llama-inspired.

Built-in mappings reflect real-world encounters:

DotsOCRForCausalLM → qwen2 – Dots OCR models are Qwen2-based despite the naming
GptOssForCausalLM → llama – Generic GPT models usually follow Llama architecture
Unknown architectures → llama – Safe default that works for most transformer models

Use --force-arch when you know better than autodetection. Particularly useful for fine-tuned models with custom names but standard structure.

Tensor Name Mapping

Tensor naming diverges most between formats. HuggingFace uses verbose hierarchical names (model.layers.0.self_attn.q_proj.weight), GGUF prefers terse (blk.0.attn_q). Mapping preserves semantics whilst adapting conventions, enabling cross-ecosystem compatibility with llama.cpp.

Original Pattern	GGUF Name	Purpose
`model.embed_tokens.weight`	`token_embd.weight`	Token embeddings – maps input IDs to vectors
`model.norm.weight`	`output_norm.weight`	Final layer normalisation before output
`lm_head.weight`	`output.weight`	Output projection to vocabulary space
`layers.N.self_attn.q_proj`	`blk.N.attn_q`	Query projection for attention layer N
`layers.N.self_attn.k_proj`	`blk.N.attn_k`	Key projection for attention layer N
`layers.N.self_attn.v_proj`	`blk.N.attn_v`	Value projection for attention layer N
`layers.N.mlp.gate_proj`	`blk.N.ffn_gate`	Gate projection in feedforward network
`layers.N.mlp.up_proj`	`blk.N.ffn_up`	Up projection expanding hidden dimension
`layers.N.mlp.down_proj`	`blk.N.ffn_down`	Down projection reducing to model dimension

Pattern matching handles variations like transformer.h.N (GPT-style) or model.decoder.layers.N (encoder-decoder) by identifying core patterns regardless of prefix.

Configuration Requirements

Conversion requires core files though optional components are forgiven. HuggingFace downloads typically include everything, manually assembled models may lack critical configuration.

Required files:

config.json: Architecture name, layer counts, vocabulary size – essential for structuring GGUF
*.safetensors: Model weights, single or sharded – handled automatically

Optional but recommended:

tokenizer_config.json: Special tokens, chat templates, tokeniser behaviour – missing often causes garbled output
tokenizer.json: Vocabulary and merge rules – tool extracts from other sources if missing but inclusion ensures compatibility

Output Format

GGUF bundles everything for inference in one file, unlike SafeTensors' scattered JSON configuration. Simplifies deployment but requires careful metadata preservation during conversion.

The output file contains:

Model weights in F32: Full precision, quantise later with dedicated tools
Architecture metadata: Layer counts, dimensions, activations for model graph construction
Tokeniser configuration: Vocabulary, special tokens, chat templates for model behaviour
Special token mappings: BOS, EOS, UNK, PAD – control generation, must match training config

Error Handling

Error messages are actionable – explaining what went wrong, why it matters, and how to fix it.

Error	Message	Solution
Missing config.json	`FileNotFoundError: Config file not found`	Download the complete model including config.json, not just weights
No SafeTensors files	`FileNotFoundError: No safetensor files found`	Verify the model uses SafeTensors format – older models might use PyTorch .bin files
BFloat16 without PyTorch	`Warning: PyTorch not available, BFloat16 models may not convert properly`	Install PyTorch (`uv add torch`) or accept potential precision loss in BF16→F32 conversion
Unknown architecture	`Warning: Unknown architecture X, using llama as fallback`	Research the model's base architecture and use `--force-arch` with the appropriate type

Technical Details

Parameter Inference

Parameter inference bridges naming inconsistencies. Llama's num_attention_heads is GPT's n_heads. Translation layer provides sensible defaults for missing values.

Configuration mapping with defaults chosen from common models:

vocab_size → vocabulary size (default: 32000 – Llama's vocabulary)
max_position_embeddings → context length (default: 2048 – conservative for compatibility)
hidden_size → embedding dimension (default: 4096 – typical for 7B models)
num_hidden_layers → transformer blocks (default: 32 – standard for 7B models)
num_attention_heads → attention heads (default: 32 – balanced for 4096 dimension)
num_key_value_heads → KV heads for GQA (defaults to attention heads – assumes MHA not GQA)
rope_theta → RoPE frequency base (default: 10000.0 – standard RoPE configuration)
rms_norm_eps → layer normalisation epsilon (default: 1e-5 – numerical stability threshold)

Defaults work for most models. Wrong parameters may not error immediately but degrade output quality.

Vision Model Support

Multimodal models increasingly common. Tool preserves vision tower configuration though GGUF support remains experimental. Vision parameters extracted but may not be fully utilised.

Extracted vision parameters:

Vision embedding dimensions: Hidden size, typically differs from language dimensions
Vision transformer blocks: Encoder layers, fewer but wider than language
Vision attention heads: Usually standard MHA rather than grouped-query
Feed-forward dimensions: Different expansion ratios from language FFN
Patch configuration: Size (14×14), spatial merging, position encoding

Vision support best-effort – preserves what's found, can't guarantee inference engine usage.

Limitations

Understanding limitations prevents frustration. Design favours broad compatibility over perfection.

F32 output only: Quantisation requires separate tools like quantise_gguf.py for bit depth control
Architecture guessing: Works for common patterns, novel architectures need manual specification
Tokeniser compatibility: Falls back to Llama tokeniser when data missing – may cause issues with special tokens
Memory requirements: Loads entire tensors into RAM – 70B model needs 140GB+, no streaming support
No quantisation: Preserves full precision, quantise separately for deployment control
Limited validation: Ensures structure, can't verify output quality – test before deployment

Examples

Converting a custom model

Typical workflow: download from HuggingFace, convert to GGUF, quantise for deployment. This tool handles the SafeTensors→GGUF transformation.

# Download complete model with all configuration files
git clone https://huggingface.co/my-org/my-model ./my-model

# Convert to GGUF - automatic architecture detection
uv run safetensors2gguf.py ./my-model

# Output appears at ./my-model/my-model-f32.gguf
# Now ready for quantisation with quantise_gguf.py

Converting with specific architecture

Force architecture when autodetection fails or you know the model's lineage. Useful for fine-tuned models with custom names.

# Force Qwen2 architecture for a model you know is Qwen2-based
uv run safetensors2gguf.py ./qwen-model --force-arch qwen2

# Common forced architectures:
# --force-arch llama    # Most models
# --force-arch qwen2    # Qwen family
# --force-arch mistral  # Mistral variants

Batch conversion

Bash loops enable bulk conversion for comparing checkpoints or converting model families.

# Convert directory of models, preserving originals
for model in ./models/*; do
    echo "Converting $(basename $model)..."
    uv run safetensors2gguf.py "$model" \
        -o "./gguf/$(basename $model).gguf" 2>&1 | \
        tee "./logs/$(basename $model).log"
done

# Check results
ls -lh ./gguf/*.gguf

Integration with Quantisation Pipeline

Tool produces F32 GGUF ready for quantisation. Typical pipeline:

Download model: Get SafeTensors model from HuggingFace
Convert to GGUF: Use this tool for architecture-agnostic conversion
Quantise: Apply quantise_gguf.py for Bartowski-style variants
Deploy: Use with llama.cpp, Ollama, or other GGUF-compatible inference engines

Separation enables control at each stage. Convert once, quantise to multiple bit depths, test configurations without repeating conversion.

Troubleshooting

Model produces gibberish after conversion

Indicates tokeniser mismatch. Ensure tokenizer.json and tokenizer_config.json present. Custom tokenisers may need --force-arch.

Conversion succeeds but model won't load

Use recent llama.cpp – GGUF format evolves, older versions lack newer metadata support. Verify forced architecture matches actual structure – wrong forcing creates invalid models.

Out of memory during conversion

Tool loads all weights simultaneously. For large models:

Close other applications to free RAM
Use a system with more memory (cloud instances work well)
Consider quantising from a pre-converted F16 model if available

Warning about unknown tensors

Normal for custom layers. Preserves unknown tensors though inference may not use them. Harmless – better to include unused weights than miss critical ones.

12 KiB Raw Blame History Unescape Escape