llm-gguf-tools/docs/safetensors2gguf.md

12 KiB
Raw Blame History

safetensors2gguf.py - Direct SafeTensors Conversion

When llama.cpp doesn't recognise your model architecture, this tool provides direct SafeTensors to GGUF conversion. It bypasses llama.cpp's architecture-specific logic for experimental models and custom architectures that lack official support.

Overview

Most transformer models share common tensor patterns regardless of architecture. While llama.cpp requires explicit support for each architecture, this tool maps tensor names to GGUF conventions and preserves metadata. Works well for models following standard transformer patterns.

Features

The converter handles real-world models pragmatically:

  • Architecture-agnostic conversion: Pattern matching identifies common tensor types embeddings look similar across Llama, Qwen, or custom architectures
  • Intelligent tensor mapping: Maps standard patterns (self_attn.q_proj → attn_q) whilst preserving unrecognised tensors rather than dropping them
  • BFloat16 handling: Optional PyTorch for BF16→F32 conversion as many models ship in BF16
  • Vision model support: Extracts vision tower parameters for multimodal models
  • Tokeniser preservation: Copies configuration wholesale to prevent garbled output from mismatches
  • Graceful fallbacks: Unknown architectures default to Llama structure effective since most models derive from Llama

Usage

Point at a model directory and the tool handles the rest. Most models convert with defaults, though forcing architecture helps when autodetection fails.

Basic Usage

# Convert a local SafeTensors model - autodetects architecture
uv run safetensors2gguf.py /path/to/model/directory

Command Line Options

# Specify output location - useful for organising converted models
uv run safetensors2gguf.py /path/to/model -o output.gguf

# Force architecture when autodetection fails or for better compatibility
uv run safetensors2gguf.py /path/to/model --force-arch qwen2

# Convert with full path control - keeps originals safe
uv run safetensors2gguf.py ./my-model --output ./converted/my-model.gguf

Supported Input Formats

The tool handles all packaging formats. Sharding emerged when models exceeded file system limits a 70B model spans dozens of files. Reassembles fragments transparently whether HuggingFace numbered shards or custom splits.

  1. Single file models: model.safetensors common for models under 10GB
  2. Sharded models: model-00001-of-00005.safetensors standard for large models, tool automatically finds and merges all shards in sequence
  3. Custom names: Any *.safetensors files some fine-tunes use non-standard naming, tool scans for all SafeTensors files regardless of naming convention

Architecture Mapping

Architecture mapping bridges naming chaos and GGUF's structured expectations. Model creators invent their own names, but patterns remain similar underneath. Translation table for known architectures, unknowns default to Llama reasonable since most models are Llama-inspired.

Built-in mappings reflect real-world encounters:

  • DotsOCRForCausalLMqwen2 Dots OCR models are Qwen2-based despite the naming
  • GptOssForCausalLMllama Generic GPT models usually follow Llama architecture
  • Unknown architectures → llama Safe default that works for most transformer models

Use --force-arch when you know better than autodetection. Particularly useful for fine-tuned models with custom names but standard structure.

Tensor Name Mapping

Tensor naming diverges most between formats. HuggingFace uses verbose hierarchical names (model.layers.0.self_attn.q_proj.weight), GGUF prefers terse (blk.0.attn_q). Mapping preserves semantics whilst adapting conventions, enabling cross-ecosystem compatibility with llama.cpp.

Original Pattern GGUF Name Purpose
model.embed_tokens.weight token_embd.weight Token embeddings maps input IDs to vectors
model.norm.weight output_norm.weight Final layer normalisation before output
lm_head.weight output.weight Output projection to vocabulary space
layers.N.self_attn.q_proj blk.N.attn_q Query projection for attention layer N
layers.N.self_attn.k_proj blk.N.attn_k Key projection for attention layer N
layers.N.self_attn.v_proj blk.N.attn_v Value projection for attention layer N
layers.N.mlp.gate_proj blk.N.ffn_gate Gate projection in feedforward network
layers.N.mlp.up_proj blk.N.ffn_up Up projection expanding hidden dimension
layers.N.mlp.down_proj blk.N.ffn_down Down projection reducing to model dimension

Pattern matching handles variations like transformer.h.N (GPT-style) or model.decoder.layers.N (encoder-decoder) by identifying core patterns regardless of prefix.

Configuration Requirements

Conversion requires core files though optional components are forgiven. HuggingFace downloads typically include everything, manually assembled models may lack critical configuration.

Required files:

  • config.json: Architecture name, layer counts, vocabulary size essential for structuring GGUF
  • *.safetensors: Model weights, single or sharded handled automatically

Optional but recommended:

  • tokenizer_config.json: Special tokens, chat templates, tokeniser behaviour missing often causes garbled output
  • tokenizer.json: Vocabulary and merge rules tool extracts from other sources if missing but inclusion ensures compatibility

Output Format

GGUF bundles everything for inference in one file, unlike SafeTensors' scattered JSON configuration. Simplifies deployment but requires careful metadata preservation during conversion.

The output file contains:

  • Model weights in F32: Full precision, quantise later with dedicated tools
  • Architecture metadata: Layer counts, dimensions, activations for model graph construction
  • Tokeniser configuration: Vocabulary, special tokens, chat templates for model behaviour
  • Special token mappings: BOS, EOS, UNK, PAD control generation, must match training config

Error Handling

Error messages are actionable explaining what went wrong, why it matters, and how to fix it.

Error Message Solution
Missing config.json FileNotFoundError: Config file not found Download the complete model including config.json, not just weights
No SafeTensors files FileNotFoundError: No safetensor files found Verify the model uses SafeTensors format older models might use PyTorch .bin files
BFloat16 without PyTorch Warning: PyTorch not available, BFloat16 models may not convert properly Install PyTorch (uv add torch) or accept potential precision loss in BF16→F32 conversion
Unknown architecture Warning: Unknown architecture X, using llama as fallback Research the model's base architecture and use --force-arch with the appropriate type

Technical Details

Parameter Inference

Parameter inference bridges naming inconsistencies. Llama's num_attention_heads is GPT's n_heads. Translation layer provides sensible defaults for missing values.

Configuration mapping with defaults chosen from common models:

  • vocab_size → vocabulary size (default: 32000 Llama's vocabulary)
  • max_position_embeddings → context length (default: 2048 conservative for compatibility)
  • hidden_size → embedding dimension (default: 4096 typical for 7B models)
  • num_hidden_layers → transformer blocks (default: 32 standard for 7B models)
  • num_attention_heads → attention heads (default: 32 balanced for 4096 dimension)
  • num_key_value_heads → KV heads for GQA (defaults to attention heads assumes MHA not GQA)
  • rope_theta → RoPE frequency base (default: 10000.0 standard RoPE configuration)
  • rms_norm_eps → layer normalisation epsilon (default: 1e-5 numerical stability threshold)

Defaults work for most models. Wrong parameters may not error immediately but degrade output quality.

Vision Model Support

Multimodal models increasingly common. Tool preserves vision tower configuration though GGUF support remains experimental. Vision parameters extracted but may not be fully utilised.

Extracted vision parameters:

  • Vision embedding dimensions: Hidden size, typically differs from language dimensions
  • Vision transformer blocks: Encoder layers, fewer but wider than language
  • Vision attention heads: Usually standard MHA rather than grouped-query
  • Feed-forward dimensions: Different expansion ratios from language FFN
  • Patch configuration: Size (14×14), spatial merging, position encoding

Vision support best-effort preserves what's found, can't guarantee inference engine usage.

Limitations

Understanding limitations prevents frustration. Design favours broad compatibility over perfection.

  • F32 output only: Quantisation requires separate tools like quantise_gguf.py for bit depth control
  • Architecture guessing: Works for common patterns, novel architectures need manual specification
  • Tokeniser compatibility: Falls back to Llama tokeniser when data missing may cause issues with special tokens
  • Memory requirements: Loads entire tensors into RAM 70B model needs 140GB+, no streaming support
  • No quantisation: Preserves full precision, quantise separately for deployment control
  • Limited validation: Ensures structure, can't verify output quality test before deployment

Examples

Converting a custom model

Typical workflow: download from HuggingFace, convert to GGUF, quantise for deployment. This tool handles the SafeTensors→GGUF transformation.

# Download complete model with all configuration files
git clone https://huggingface.co/my-org/my-model ./my-model

# Convert to GGUF - automatic architecture detection
uv run safetensors2gguf.py ./my-model

# Output appears at ./my-model/my-model-f32.gguf
# Now ready for quantisation with quantise_gguf.py

Converting with specific architecture

Force architecture when autodetection fails or you know the model's lineage. Useful for fine-tuned models with custom names.

# Force Qwen2 architecture for a model you know is Qwen2-based
uv run safetensors2gguf.py ./qwen-model --force-arch qwen2

# Common forced architectures:
# --force-arch llama    # Most models
# --force-arch qwen2    # Qwen family
# --force-arch mistral  # Mistral variants

Batch conversion

Bash loops enable bulk conversion for comparing checkpoints or converting model families.

# Convert directory of models, preserving originals
for model in ./models/*; do
    echo "Converting $(basename $model)..."
    uv run safetensors2gguf.py "$model" \
        -o "./gguf/$(basename $model).gguf" 2>&1 | \
        tee "./logs/$(basename $model).log"
done

# Check results
ls -lh ./gguf/*.gguf

Integration with Quantisation Pipeline

Tool produces F32 GGUF ready for quantisation. Typical pipeline:

  1. Download model: Get SafeTensors model from HuggingFace
  2. Convert to GGUF: Use this tool for architecture-agnostic conversion
  3. Quantise: Apply quantise_gguf.py for Bartowski-style variants
  4. Deploy: Use with llama.cpp, Ollama, or other GGUF-compatible inference engines

Separation enables control at each stage. Convert once, quantise to multiple bit depths, test configurations without repeating conversion.

Troubleshooting

Model produces gibberish after conversion

Indicates tokeniser mismatch. Ensure tokenizer.json and tokenizer_config.json present. Custom tokenisers may need --force-arch.

Conversion succeeds but model won't load

Use recent llama.cpp GGUF format evolves, older versions lack newer metadata support. Verify forced architecture matches actual structure wrong forcing creates invalid models.

Out of memory during conversion

Tool loads all weights simultaneously. For large models:

  • Close other applications to free RAM
  • Use a system with more memory (cloud instances work well)
  • Consider quantising from a pre-converted F16 model if available

Warning about unknown tensors

Normal for custom layers. Preserves unknown tensors though inference may not use them. Harmless better to include unused weights than miss critical ones.