llm-gguf-tools/docs/safetensors2gguf.md

# safetensors2gguf.py - Direct SafeTensors Conversion

When llama.cpp doesn't recognise your model architecture, this tool provides direct SafeTensors to
GGUF conversion. It bypasses llama.cpp's architecture-specific logic for experimental models and
custom architectures that lack official support.

## Overview

Most transformer models share common tensor patterns regardless of architecture. While llama.cpp
requires explicit support for each architecture, this tool maps tensor names to GGUF conventions
and preserves metadata. Works well for models following standard transformer patterns.

## Features

The converter handles real-world models pragmatically:

- **Architecture-agnostic conversion**: Pattern matching identifies common tensor types – embeddings
  look similar across Llama, Qwen, or custom architectures
- **Intelligent tensor mapping**: Maps standard patterns (self_attn.q_proj → attn_q) whilst
  preserving unrecognised tensors rather than dropping them
- **BFloat16 handling**: Optional PyTorch for BF16→F32 conversion as many models ship in BF16
- **Vision model support**: Extracts vision tower parameters for multimodal models
- **Tokeniser preservation**: Copies configuration wholesale to prevent garbled output from mismatches
- **Graceful fallbacks**: Unknown architectures default to Llama structure – effective since most
  models derive from Llama

## Usage

Point at a model directory and the tool handles the rest. Most models convert with defaults, though
forcing architecture helps when autodetection fails.

### Basic Usage

```bash
# Convert a local SafeTensors model - autodetects architecture
uv run safetensors2gguf.py /path/to/model/directory
```

### Command Line Options

```bash
# Specify output location - useful for organising converted models
uv run safetensors2gguf.py /path/to/model -o output.gguf

# Force architecture when autodetection fails or for better compatibility
uv run safetensors2gguf.py /path/to/model --force-arch qwen2

# Convert with full path control - keeps originals safe
uv run safetensors2gguf.py ./my-model --output ./converted/my-model.gguf
```

## Supported Input Formats

The tool handles all packaging formats. Sharding emerged when models exceeded file system limits –
a 70B model spans dozens of files. Reassembles fragments transparently whether HuggingFace numbered
shards or custom splits.

1. **Single file models**: `model.safetensors` – common for models under 10GB
2. **Sharded models**: `model-00001-of-00005.safetensors` – standard for large models, tool
   automatically finds and merges all shards in sequence
3. **Custom names**: Any `*.safetensors` files – some fine-tunes use non-standard naming, tool
   scans for all SafeTensors files regardless of naming convention

## Architecture Mapping

Architecture mapping bridges naming chaos and GGUF's structured expectations. Model creators invent
their own names, but patterns remain similar underneath. Translation table for known architectures,
unknowns default to Llama – reasonable since most models are Llama-inspired.

Built-in mappings reflect real-world encounters:

- `DotsOCRForCausalLM` → `qwen2` – Dots OCR models are Qwen2-based despite the naming
- `GptOssForCausalLM` → `llama` – Generic GPT models usually follow Llama architecture
- Unknown architectures → `llama` – Safe default that works for most transformer models

Use `--force-arch` when you know better than autodetection. Particularly useful for fine-tuned
models with custom names but standard structure.

## Tensor Name Mapping

Tensor naming diverges most between formats. HuggingFace uses verbose hierarchical names
(`model.layers.0.self_attn.q_proj.weight`), GGUF prefers terse (`blk.0.attn_q`). Mapping preserves
semantics whilst adapting conventions, enabling cross-ecosystem compatibility with llama.cpp.

| Original Pattern | GGUF Name | Purpose |
|-----------------|-----------|------|
| `model.embed_tokens.weight` | `token_embd.weight` | Token embeddings – maps input IDs to vectors |
| `model.norm.weight` | `output_norm.weight` | Final layer normalisation before output |
| `lm_head.weight` | `output.weight` | Output projection to vocabulary space |
| `layers.N.self_attn.q_proj` | `blk.N.attn_q` | Query projection for attention layer N |
| `layers.N.self_attn.k_proj` | `blk.N.attn_k` | Key projection for attention layer N |
| `layers.N.self_attn.v_proj` | `blk.N.attn_v` | Value projection for attention layer N |
| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` | Gate projection in feedforward network |
| `layers.N.mlp.up_proj` | `blk.N.ffn_up` | Up projection expanding hidden dimension |
| `layers.N.mlp.down_proj` | `blk.N.ffn_down` | Down projection reducing to model dimension |

Pattern matching handles variations like `transformer.h.N` (GPT-style) or `model.decoder.layers.N`
(encoder-decoder) by identifying core patterns regardless of prefix.

## Configuration Requirements

Conversion requires core files though optional components are forgiven. HuggingFace downloads
typically include everything, manually assembled models may lack critical configuration.

Required files:

- **config.json**: Architecture name, layer counts, vocabulary size – essential for structuring GGUF
- **\*.safetensors**: Model weights, single or sharded – handled automatically

Optional but recommended:

- **tokenizer_config.json**: Special tokens, chat templates, tokeniser behaviour – missing often
  causes garbled output
- **tokenizer.json**: Vocabulary and merge rules – tool extracts from other sources if missing but
  inclusion ensures compatibility

## Output Format

GGUF bundles everything for inference in one file, unlike SafeTensors' scattered JSON configuration.
Simplifies deployment but requires careful metadata preservation during conversion.

The output file contains:

- **Model weights in F32**: Full precision, quantise later with dedicated tools
- **Architecture metadata**: Layer counts, dimensions, activations for model graph construction
- **Tokeniser configuration**: Vocabulary, special tokens, chat templates for model behaviour
- **Special token mappings**: BOS, EOS, UNK, PAD – control generation, must match training config

## Error Handling

Error messages are actionable – explaining what went wrong, why it matters, and how to fix it.

| Error | Message | Solution |
|-------|---------|----------|
| Missing config.json | `FileNotFoundError: Config file not found` | Download the complete model including config.json, not just weights |
| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Verify the model uses SafeTensors format – older models might use PyTorch .bin files |
| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch (`uv add torch`) or accept potential precision loss in BF16→F32 conversion |
| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Research the model's base architecture and use `--force-arch` with the appropriate type |

## Technical Details

### Parameter Inference

Parameter inference bridges naming inconsistencies. Llama's `num_attention_heads` is GPT's
`n_heads`. Translation layer provides sensible defaults for missing values.

Configuration mapping with defaults chosen from common models:

- `vocab_size` → vocabulary size (default: 32000 – Llama's vocabulary)
- `max_position_embeddings` → context length (default: 2048 – conservative for compatibility)
- `hidden_size` → embedding dimension (default: 4096 – typical for 7B models)
- `num_hidden_layers` → transformer blocks (default: 32 – standard for 7B models)
- `num_attention_heads` → attention heads (default: 32 – balanced for 4096 dimension)
- `num_key_value_heads` → KV heads for GQA (defaults to attention heads – assumes MHA not GQA)
- `rope_theta` → RoPE frequency base (default: 10000.0 – standard RoPE configuration)
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5 – numerical stability threshold)

Defaults work for most models. Wrong parameters may not error immediately but degrade output quality.

### Vision Model Support

Multimodal models increasingly common. Tool preserves vision tower configuration though GGUF support
remains experimental. Vision parameters extracted but may not be fully utilised.

Extracted vision parameters:

- **Vision embedding dimensions**: Hidden size, typically differs from language dimensions
- **Vision transformer blocks**: Encoder layers, fewer but wider than language
- **Vision attention heads**: Usually standard MHA rather than grouped-query
- **Feed-forward dimensions**: Different expansion ratios from language FFN
- **Patch configuration**: Size (14×14), spatial merging, position encoding

Vision support best-effort – preserves what's found, can't guarantee inference engine usage.

## Limitations

Understanding limitations prevents frustration. Design favours broad compatibility over perfection.

- **F32 output only**: Quantisation requires separate tools like quantise_gguf.py for bit depth control
- **Architecture guessing**: Works for common patterns, novel architectures need manual specification
- **Tokeniser compatibility**: Falls back to Llama tokeniser when data missing – may cause issues with
  special tokens
- **Memory requirements**: Loads entire tensors into RAM – 70B model needs 140GB+, no streaming support
- **No quantisation**: Preserves full precision, quantise separately for deployment control
- **Limited validation**: Ensures structure, can't verify output quality – test before deployment

## Examples

### Converting a custom model

Typical workflow: download from HuggingFace, convert to GGUF, quantise for deployment. This tool
handles the SafeTensors→GGUF transformation.

```bash
# Download complete model with all configuration files
git clone https://huggingface.co/my-org/my-model ./my-model

# Convert to GGUF - automatic architecture detection
uv run safetensors2gguf.py ./my-model

# Output appears at ./my-model/my-model-f32.gguf
# Now ready for quantisation with quantise_gguf.py
```

### Converting with specific architecture

Force architecture when autodetection fails or you know the model's lineage. Useful for fine-tuned
models with custom names.

```bash
# Force Qwen2 architecture for a model you know is Qwen2-based
uv run safetensors2gguf.py ./qwen-model --force-arch qwen2

# Common forced architectures:
# --force-arch llama    # Most models
# --force-arch qwen2    # Qwen family
# --force-arch mistral  # Mistral variants
```

### Batch conversion

Bash loops enable bulk conversion for comparing checkpoints or converting model families.

```bash
# Convert directory of models, preserving originals
for model in ./models/*; do
    echo "Converting $(basename $model)..."
    uv run safetensors2gguf.py "$model" \
        -o "./gguf/$(basename $model).gguf" 2>&1 | \
        tee "./logs/$(basename $model).log"
done

# Check results
ls -lh ./gguf/*.gguf
```

## Integration with Quantisation Pipeline

Tool produces F32 GGUF ready for quantisation. Typical pipeline:

1. **Download model**: Get SafeTensors model from HuggingFace
2. **Convert to GGUF**: Use this tool for architecture-agnostic conversion
3. **Quantise**: Apply quantise_gguf.py for Bartowski-style variants
4. **Deploy**: Use with llama.cpp, Ollama, or other GGUF-compatible inference engines

Separation enables control at each stage. Convert once, quantise to multiple bit depths, test
configurations without repeating conversion.

## Troubleshooting

### Model produces gibberish after conversion

Indicates tokeniser mismatch. Ensure tokenizer.json and tokenizer_config.json present. Custom
tokenisers may need `--force-arch`.

### Conversion succeeds but model won't load

Use recent llama.cpp – GGUF format evolves, older versions lack newer metadata support. Verify
forced architecture matches actual structure – wrong forcing creates invalid models.

### Out of memory during conversion

Tool loads all weights simultaneously. For large models:

- Close other applications to free RAM
- Use a system with more memory (cloud instances work well)
- Consider quantising from a pre-converted F16 model if available

### Warning about unknown tensors

Normal for custom layers. Preserves unknown tensors though inference may not use them. Harmless –
better to include unused weights than miss critical ones.