Switch to llama-cpp-python
This commit is contained in:
parent
ef7df1a8c3
commit
d937f2d5fa
25 changed files with 2957 additions and 1181 deletions
|
@ -1,164 +1,272 @@
|
|||
# direct_safetensors_to_gguf.py - Direct SafeTensors Conversion
|
||||
# safetensors2gguf.py - Direct SafeTensors Conversion
|
||||
|
||||
Direct SafeTensors to GGUF converter for unsupported architectures.
|
||||
When llama.cpp doesn't recognise your model architecture, this tool provides direct SafeTensors to
|
||||
GGUF conversion. It bypasses llama.cpp's architecture-specific logic for experimental models and
|
||||
custom architectures that lack official support.
|
||||
|
||||
## Overview
|
||||
|
||||
This tool converts SafeTensors models directly to GGUF format without requiring specific
|
||||
architecture support in llama.cpp. It's particularly useful for experimental models, custom
|
||||
architectures, or when llama.cpp's standard conversion tools don't recognise your model
|
||||
architecture.
|
||||
Most transformer models share common tensor patterns regardless of architecture. While llama.cpp
|
||||
requires explicit support for each architecture, this tool maps tensor names to GGUF conventions
|
||||
and preserves metadata. Works well for models following standard transformer patterns.
|
||||
|
||||
## Features
|
||||
|
||||
- **Architecture-agnostic**: Works with unsupported model architectures
|
||||
- **Automatic mapping**: Intelligently maps tensor names to GGUF conventions
|
||||
- **BFloat16 support**: Handles BF16 tensors with PyTorch (optional)
|
||||
- **Vision models**: Supports models with vision components
|
||||
- **Tokeniser preservation**: Extracts and includes tokeniser metadata
|
||||
- **Fallback mechanisms**: Provides sensible defaults for unknown architectures
|
||||
The converter handles real-world models pragmatically:
|
||||
|
||||
- **Architecture-agnostic conversion**: Pattern matching identifies common tensor types – embeddings
|
||||
look similar across Llama, Qwen, or custom architectures
|
||||
- **Intelligent tensor mapping**: Maps standard patterns (self_attn.q_proj → attn_q) whilst
|
||||
preserving unrecognised tensors rather than dropping them
|
||||
- **BFloat16 handling**: Optional PyTorch for BF16→F32 conversion as many models ship in BF16
|
||||
- **Vision model support**: Extracts vision tower parameters for multimodal models
|
||||
- **Tokeniser preservation**: Copies configuration wholesale to prevent garbled output from mismatches
|
||||
- **Graceful fallbacks**: Unknown architectures default to Llama structure – effective since most
|
||||
models derive from Llama
|
||||
|
||||
## Usage
|
||||
|
||||
Point at a model directory and the tool handles the rest. Most models convert with defaults, though
|
||||
forcing architecture helps when autodetection fails.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
# Convert a local SafeTensors model
|
||||
uv run direct_safetensors_to_gguf.py /path/to/model/directory
|
||||
# Convert a local SafeTensors model - autodetects architecture
|
||||
uv run safetensors2gguf.py /path/to/model/directory
|
||||
```
|
||||
|
||||
### Command Line Options
|
||||
|
||||
```bash
|
||||
# Specify output file
|
||||
uv run direct_safetensors_to_gguf.py /path/to/model -o output.gguf
|
||||
# Specify output location - useful for organising converted models
|
||||
uv run safetensors2gguf.py /path/to/model -o output.gguf
|
||||
|
||||
# Force specific architecture mapping
|
||||
uv run direct_safetensors_to_gguf.py /path/to/model --force-arch qwen2
|
||||
# Force architecture when autodetection fails or for better compatibility
|
||||
uv run safetensors2gguf.py /path/to/model --force-arch qwen2
|
||||
|
||||
# Convert with custom output path
|
||||
uv run direct_safetensors_to_gguf.py ./my-model --output ./converted/my-model.gguf
|
||||
# Convert with full path control - keeps originals safe
|
||||
uv run safetensors2gguf.py ./my-model --output ./converted/my-model.gguf
|
||||
```
|
||||
|
||||
## Supported Input Formats
|
||||
|
||||
The tool automatically detects and handles:
|
||||
The tool handles all packaging formats. Sharding emerged when models exceeded file system limits –
|
||||
a 70B model spans dozens of files. Reassembles fragments transparently whether HuggingFace numbered
|
||||
shards or custom splits.
|
||||
|
||||
1. **Single file models**: `model.safetensors`
|
||||
2. **Sharded models**: `model-00001-of-00005.safetensors`, etc.
|
||||
3. **Custom names**: Any `*.safetensors` files in the directory
|
||||
1. **Single file models**: `model.safetensors` – common for models under 10GB
|
||||
2. **Sharded models**: `model-00001-of-00005.safetensors` – standard for large models, tool
|
||||
automatically finds and merges all shards in sequence
|
||||
3. **Custom names**: Any `*.safetensors` files – some fine-tunes use non-standard naming, tool
|
||||
scans for all SafeTensors files regardless of naming convention
|
||||
|
||||
## Architecture Mapping
|
||||
|
||||
The tool includes built-in mappings for several architectures:
|
||||
Architecture mapping bridges naming chaos and GGUF's structured expectations. Model creators invent
|
||||
their own names, but patterns remain similar underneath. Translation table for known architectures,
|
||||
unknowns default to Llama – reasonable since most models are Llama-inspired.
|
||||
|
||||
- `DotsOCRForCausalLM` → `qwen2`
|
||||
- `GptOssForCausalLM` → `llama`
|
||||
- Unknown architectures → `llama` (fallback)
|
||||
Built-in mappings reflect real-world encounters:
|
||||
|
||||
You can override these with the `--force-arch` parameter.
|
||||
- `DotsOCRForCausalLM` → `qwen2` – Dots OCR models are Qwen2-based despite the naming
|
||||
- `GptOssForCausalLM` → `llama` – Generic GPT models usually follow Llama architecture
|
||||
- Unknown architectures → `llama` – Safe default that works for most transformer models
|
||||
|
||||
Use `--force-arch` when you know better than autodetection. Particularly useful for fine-tuned
|
||||
models with custom names but standard structure.
|
||||
|
||||
## Tensor Name Mapping
|
||||
|
||||
The converter automatically maps common tensor patterns:
|
||||
Tensor naming diverges most between formats. HuggingFace uses verbose hierarchical names
|
||||
(`model.layers.0.self_attn.q_proj.weight`), GGUF prefers terse (`blk.0.attn_q`). Mapping preserves
|
||||
semantics whilst adapting conventions, enabling cross-ecosystem compatibility with llama.cpp.
|
||||
|
||||
| Original Pattern | GGUF Name |
|
||||
|-----------------|-----------|
|
||||
| `model.embed_tokens.weight` | `token_embd.weight` |
|
||||
| `model.norm.weight` | `output_norm.weight` |
|
||||
| `lm_head.weight` | `output.weight` |
|
||||
| `layers.N.self_attn.q_proj` | `blk.N.attn_q` |
|
||||
| `layers.N.self_attn.k_proj` | `blk.N.attn_k` |
|
||||
| `layers.N.self_attn.v_proj` | `blk.N.attn_v` |
|
||||
| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` |
|
||||
| `layers.N.mlp.up_proj` | `blk.N.ffn_up` |
|
||||
| `layers.N.mlp.down_proj` | `blk.N.ffn_down` |
|
||||
| Original Pattern | GGUF Name | Purpose |
|
||||
|-----------------|-----------|------|
|
||||
| `model.embed_tokens.weight` | `token_embd.weight` | Token embeddings – maps input IDs to vectors |
|
||||
| `model.norm.weight` | `output_norm.weight` | Final layer normalisation before output |
|
||||
| `lm_head.weight` | `output.weight` | Output projection to vocabulary space |
|
||||
| `layers.N.self_attn.q_proj` | `blk.N.attn_q` | Query projection for attention layer N |
|
||||
| `layers.N.self_attn.k_proj` | `blk.N.attn_k` | Key projection for attention layer N |
|
||||
| `layers.N.self_attn.v_proj` | `blk.N.attn_v` | Value projection for attention layer N |
|
||||
| `layers.N.mlp.gate_proj` | `blk.N.ffn_gate` | Gate projection in feedforward network |
|
||||
| `layers.N.mlp.up_proj` | `blk.N.ffn_up` | Up projection expanding hidden dimension |
|
||||
| `layers.N.mlp.down_proj` | `blk.N.ffn_down` | Down projection reducing to model dimension |
|
||||
|
||||
Pattern matching handles variations like `transformer.h.N` (GPT-style) or `model.decoder.layers.N`
|
||||
(encoder-decoder) by identifying core patterns regardless of prefix.
|
||||
|
||||
## Configuration Requirements
|
||||
|
||||
The model directory must contain:
|
||||
Conversion requires core files though optional components are forgiven. HuggingFace downloads
|
||||
typically include everything, manually assembled models may lack critical configuration.
|
||||
|
||||
- **config.json**: Model configuration file (required)
|
||||
- **\*.safetensors**: One or more SafeTensors files (required)
|
||||
- **tokenizer_config.json**: Tokeniser configuration (optional)
|
||||
- **tokenizer.json**: Tokeniser data (optional)
|
||||
Required files:
|
||||
|
||||
- **config.json**: Architecture name, layer counts, vocabulary size – essential for structuring GGUF
|
||||
- **\*.safetensors**: Model weights, single or sharded – handled automatically
|
||||
|
||||
Optional but recommended:
|
||||
|
||||
- **tokenizer_config.json**: Special tokens, chat templates, tokeniser behaviour – missing often
|
||||
causes garbled output
|
||||
- **tokenizer.json**: Vocabulary and merge rules – tool extracts from other sources if missing but
|
||||
inclusion ensures compatibility
|
||||
|
||||
## Output Format
|
||||
|
||||
The tool produces a single GGUF file containing:
|
||||
GGUF bundles everything for inference in one file, unlike SafeTensors' scattered JSON configuration.
|
||||
Simplifies deployment but requires careful metadata preservation during conversion.
|
||||
|
||||
- All model weights in F32 format
|
||||
- Model architecture metadata
|
||||
- Tokeniser configuration (if available)
|
||||
- Special token IDs (BOS, EOS, UNK, PAD)
|
||||
The output file contains:
|
||||
|
||||
- **Model weights in F32**: Full precision, quantise later with dedicated tools
|
||||
- **Architecture metadata**: Layer counts, dimensions, activations for model graph construction
|
||||
- **Tokeniser configuration**: Vocabulary, special tokens, chat templates for model behaviour
|
||||
- **Special token mappings**: BOS, EOS, UNK, PAD – control generation, must match training config
|
||||
|
||||
## Error Handling
|
||||
|
||||
Error messages are actionable – explaining what went wrong, why it matters, and how to fix it.
|
||||
|
||||
| Error | Message | Solution |
|
||||
|-------|---------|----------|
|
||||
| Missing config.json | `FileNotFoundError: Config file not found` | Ensure the model directory contains a valid `config.json` file |
|
||||
| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Check that the directory contains `.safetensors` files |
|
||||
| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch for BF16 support: `uv add torch` |
|
||||
| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Use `--force-arch` to specify a known compatible architecture |
|
||||
| Missing config.json | `FileNotFoundError: Config file not found` | Download the complete model including config.json, not just weights |
|
||||
| No SafeTensors files | `FileNotFoundError: No safetensor files found` | Verify the model uses SafeTensors format – older models might use PyTorch .bin files |
|
||||
| BFloat16 without PyTorch | `Warning: PyTorch not available, BFloat16 models may not convert properly` | Install PyTorch (`uv add torch`) or accept potential precision loss in BF16→F32 conversion |
|
||||
| Unknown architecture | `Warning: Unknown architecture X, using llama as fallback` | Research the model's base architecture and use `--force-arch` with the appropriate type |
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Parameter Inference
|
||||
|
||||
The tool infers GGUF parameters from the model configuration:
|
||||
Parameter inference bridges naming inconsistencies. Llama's `num_attention_heads` is GPT's
|
||||
`n_heads`. Translation layer provides sensible defaults for missing values.
|
||||
|
||||
- `vocab_size` → vocabulary size (default: 32000)
|
||||
- `max_position_embeddings` → context length (default: 2048)
|
||||
- `hidden_size` → embedding dimension (default: 4096)
|
||||
- `num_hidden_layers` → number of transformer blocks (default: 32)
|
||||
- `num_attention_heads` → attention head count (default: 32)
|
||||
- `num_key_value_heads` → KV head count (defaults to attention heads)
|
||||
- `rope_theta` → RoPE frequency base (default: 10000.0)
|
||||
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5)
|
||||
Configuration mapping with defaults chosen from common models:
|
||||
|
||||
- `vocab_size` → vocabulary size (default: 32000 – Llama's vocabulary)
|
||||
- `max_position_embeddings` → context length (default: 2048 – conservative for compatibility)
|
||||
- `hidden_size` → embedding dimension (default: 4096 – typical for 7B models)
|
||||
- `num_hidden_layers` → transformer blocks (default: 32 – standard for 7B models)
|
||||
- `num_attention_heads` → attention heads (default: 32 – balanced for 4096 dimension)
|
||||
- `num_key_value_heads` → KV heads for GQA (defaults to attention heads – assumes MHA not GQA)
|
||||
- `rope_theta` → RoPE frequency base (default: 10000.0 – standard RoPE configuration)
|
||||
- `rms_norm_eps` → layer normalisation epsilon (default: 1e-5 – numerical stability threshold)
|
||||
|
||||
Defaults work for most models. Wrong parameters may not error immediately but degrade output quality.
|
||||
|
||||
### Vision Model Support
|
||||
|
||||
For models with vision components, the tool extracts:
|
||||
Multimodal models increasingly common. Tool preserves vision tower configuration though GGUF support
|
||||
remains experimental. Vision parameters extracted but may not be fully utilised.
|
||||
|
||||
- Vision embedding dimensions
|
||||
- Vision transformer block count
|
||||
- Vision attention heads
|
||||
- Vision feed-forward dimensions
|
||||
- Patch size and spatial merge parameters
|
||||
Extracted vision parameters:
|
||||
|
||||
- **Vision embedding dimensions**: Hidden size, typically differs from language dimensions
|
||||
- **Vision transformer blocks**: Encoder layers, fewer but wider than language
|
||||
- **Vision attention heads**: Usually standard MHA rather than grouped-query
|
||||
- **Feed-forward dimensions**: Different expansion ratios from language FFN
|
||||
- **Patch configuration**: Size (14×14), spatial merging, position encoding
|
||||
|
||||
Vision support best-effort – preserves what's found, can't guarantee inference engine usage.
|
||||
|
||||
## Limitations
|
||||
|
||||
- **F32 only**: Currently outputs only full precision (F32) models
|
||||
- **Architecture guessing**: May require manual architecture specification
|
||||
- **Tokeniser compatibility**: Uses llama tokeniser as default fallback
|
||||
- **Memory usage**: Requires loading full tensors into memory
|
||||
Understanding limitations prevents frustration. Design favours broad compatibility over perfection.
|
||||
|
||||
- **F32 output only**: Quantisation requires separate tools like quantise_gguf.py for bit depth control
|
||||
- **Architecture guessing**: Works for common patterns, novel architectures need manual specification
|
||||
- **Tokeniser compatibility**: Falls back to Llama tokeniser when data missing – may cause issues with
|
||||
special tokens
|
||||
- **Memory requirements**: Loads entire tensors into RAM – 70B model needs 140GB+, no streaming support
|
||||
- **No quantisation**: Preserves full precision, quantise separately for deployment control
|
||||
- **Limited validation**: Ensures structure, can't verify output quality – test before deployment
|
||||
|
||||
## Examples
|
||||
|
||||
### Converting a custom model
|
||||
|
||||
Typical workflow: download from HuggingFace, convert to GGUF, quantise for deployment. This tool
|
||||
handles the SafeTensors→GGUF transformation.
|
||||
|
||||
```bash
|
||||
# Download a model first
|
||||
# Download complete model with all configuration files
|
||||
git clone https://huggingface.co/my-org/my-model ./my-model
|
||||
|
||||
# Convert to GGUF
|
||||
uv run direct_safetensors_to_gguf.py ./my-model
|
||||
# Convert to GGUF - automatic architecture detection
|
||||
uv run safetensors2gguf.py ./my-model
|
||||
|
||||
# Output will be at ./my-model/my-model-f32.gguf
|
||||
# Output appears at ./my-model/my-model-f32.gguf
|
||||
# Now ready for quantisation with quantise_gguf.py
|
||||
```
|
||||
|
||||
### Converting with specific architecture
|
||||
|
||||
Force architecture when autodetection fails or you know the model's lineage. Useful for fine-tuned
|
||||
models with custom names.
|
||||
|
||||
```bash
|
||||
# For a Qwen2-based model
|
||||
uv run direct_safetensors_to_gguf.py ./qwen-model --force-arch qwen2
|
||||
# Force Qwen2 architecture for a model you know is Qwen2-based
|
||||
uv run safetensors2gguf.py ./qwen-model --force-arch qwen2
|
||||
|
||||
# Common forced architectures:
|
||||
# --force-arch llama # Most models
|
||||
# --force-arch qwen2 # Qwen family
|
||||
# --force-arch mistral # Mistral variants
|
||||
```
|
||||
|
||||
### Batch conversion
|
||||
|
||||
Bash loops enable bulk conversion for comparing checkpoints or converting model families.
|
||||
|
||||
```bash
|
||||
# Convert multiple models
|
||||
# Convert directory of models, preserving originals
|
||||
for model in ./models/*; do
|
||||
uv run direct_safetensors_to_gguf.py "$model" -o "./gguf/$(basename $model).gguf"
|
||||
echo "Converting $(basename $model)..."
|
||||
uv run safetensors2gguf.py "$model" \
|
||||
-o "./gguf/$(basename $model).gguf" 2>&1 | \
|
||||
tee "./logs/$(basename $model).log"
|
||||
done
|
||||
|
||||
# Check results
|
||||
ls -lh ./gguf/*.gguf
|
||||
```
|
||||
|
||||
## Integration with Quantisation Pipeline
|
||||
|
||||
Tool produces F32 GGUF ready for quantisation. Typical pipeline:
|
||||
|
||||
1. **Download model**: Get SafeTensors model from HuggingFace
|
||||
2. **Convert to GGUF**: Use this tool for architecture-agnostic conversion
|
||||
3. **Quantise**: Apply quantise_gguf.py for Bartowski-style variants
|
||||
4. **Deploy**: Use with llama.cpp, Ollama, or other GGUF-compatible inference engines
|
||||
|
||||
Separation enables control at each stage. Convert once, quantise to multiple bit depths, test
|
||||
configurations without repeating conversion.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Model produces gibberish after conversion
|
||||
|
||||
Indicates tokeniser mismatch. Ensure tokenizer.json and tokenizer_config.json present. Custom
|
||||
tokenisers may need `--force-arch`.
|
||||
|
||||
### Conversion succeeds but model won't load
|
||||
|
||||
Use recent llama.cpp – GGUF format evolves, older versions lack newer metadata support. Verify
|
||||
forced architecture matches actual structure – wrong forcing creates invalid models.
|
||||
|
||||
### Out of memory during conversion
|
||||
|
||||
Tool loads all weights simultaneously. For large models:
|
||||
|
||||
- Close other applications to free RAM
|
||||
- Use a system with more memory (cloud instances work well)
|
||||
- Consider quantising from a pre-converted F16 model if available
|
||||
|
||||
### Warning about unknown tensors
|
||||
|
||||
Normal for custom layers. Preserves unknown tensors though inference may not use them. Harmless –
|
||||
better to include unused weights than miss critical ones.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue