llm-gguf-tools/docs/development.md

6.3 KiB
Raw Blame History

Development Guide

Contributing to GGUF tools requires understanding quantisation workflows and Python's modern dependency ecosystem. This guide covers setup, standards, and architectural decisions for fixing bugs, adding quantisation profiles, or extending conversion capabilities.

Code Quality

Ruff replaces the traditional Black/isort/flake8 stack as both linter and formatter. Mypy provides static type checking to catch type-related bugs before runtime. Zero tolerance for linting and type errors catches issues early. Both tools have extensive configuration in pyproject.toml to enforce only the important code quality standards we've selected. Debug logging reveals quantisation internals when models fail.

# Run linting - catches style violations, potential bugs, and code smells
uvx ruff check

# Format code - enforces consistent style automatically
uvx ruff format

# Run type checking - ensures type safety and catches potential bugs
uv run mypy .

# Run with debug logging - reveals conversion steps and tensor processing
DEBUG=true uv run <script>

Project Structure

Architecture separates concerns cleanly: top-level scripts provide interfaces, helpers encapsulate reusable logic, resources contain community data. Structure evolved from practical needs helpers emerged to eliminate duplication, services to abstract external dependencies.

llm-gguf-tools/
├── quantise.py                    # Bartowski quantisation tool - the main workflow
├── direct_safetensors_to_gguf.py  # Direct conversion for unsupported architectures
├── helpers/                       # Shared utilities and abstractions
│   ├── __init__.py
│   ├── logger.py                  # Colour-coded logging with context awareness
│   ├── services/                  # External service wrappers
│   │   ├── gguf.py                # GGUF writer abstraction
│   │   └── llama_python.py        # llama-cpp-python integration
│   └── utils/                     # Pure utility functions
│       ├── config_parser.py       # Model configuration handling
│       └── tensor_mapping.py      # Architecture-specific tensor name mapping
├── resources/                     # Resource files and calibration data
│   └── imatrix_data.txt           # Curated calibration data from Bartowski
├── docs/                          # Detailed documentation
│   ├── quantise_gguf.md           # Quantisation strategies and profiles
│   ├── safetensors2gguf.md        # Direct conversion documentation
│   ├── bartowski_analysis.md      # Deep dive into variant strategies
│   ├── imatrix_data.md            # Importance matrix guide
│   └── development.md             # This guide
└── pyproject.toml                 # Modern Python project configuration

Contributing Guidelines

The project values pragmatic solutions over theoretical perfection working code that handles edge cases beats elegant abstractions. Contributors should understand how quantisation profiles map to Bartowski's discoveries and where Python-C++ boundaries limit functionality.

Essential requirements:

  1. Style consistency: Run uvx ruff format before committing to keep diffs focused on logic
  2. Documentation: Google-style docstrings explain behaviour and rationale beyond type hints
  3. Type safety: Complete type hints for all public functions enable IDE support
  4. Practical testing: Test with both 1B and 7B+ models to catch scaling issues

Development Workflow

Setting Up Development Environment

The project uses uv for dependency management Rust-fast, automatic Python version management, upfront dependency resolution. Development dependencies include ruff, type stubs, and optional PyTorch for BFloat16 handling.

# Clone the repository - uses Forgejo (GitLab-like) hosting
git clone https://git.tomfos.tr/tom/llm-gguf-tools.git
cd llm-gguf-tools

# Install all dependencies including dev tools
# This installs llama-cpp-python with CUDA support if available
uv sync --all-groups

Code Style

Code style reduces cognitive load by letting reviewers focus on logic rather than layout. UK English maintains llama.cpp consistency. The 100-character line limit balances descriptive names with readability.

Core conventions:

  • PEP 8 compliance: Ruff catches mutable defaults, unused imports automatically
  • UK English: "Optimise" not "optimize", matching upstream llama.cpp
  • Line length: 100 characters maximum except URLs or unbreakable paths
  • Type annotations: Complete hints for public functions documentation that can't go stale
  • Import ordering: Standard library, third-party, local ruff handles automatically

Testing

Formal tests pending. Quantisation "correctness" depends on complex interactions between model architecture, strategy, and downstream usage. Benchmark performance doesn't guarantee production success.

Current validation approach:

  • End-to-end testing: Qwen 0.5B for quick iteration, Llama 3.2 1B for architecture compatibility
  • Output validation: GGUF must load in llama.cpp and degrade gracefully, not produce gibberish
  • Error handling: Test corrupted SafeTensors, missing configs, insufficient disk space
  • Logger consistency: Verify colour coding across terminals, progress bars with piped output

Debugging

Debug logging transforms black box to glass box, revealing failure points. Colour coding highlights stages: blue (info), yellow (warnings), red (errors), green (success). Visual hierarchy enables efficient log scanning.

# Enable comprehensive debug output
DEBUG=true uv run direct_safetensors_to_gguf.py ./model  # Tensor mapping details
DEBUG=true uv run quantise.py <model_url>                # Memory usage tracking

Debug output reveals:

  • Download progress: Bytes transferred, retries, connection issues
  • Conversion pipeline: SafeTensors→GGUF steps, tensor mappings, dimension changes
  • Quantisation decisions: Layer bit depths, importance matrix effects on weight selection
  • Memory usage: Peak consumption for predicting larger model requirements
  • File operations: Read/write/temp patterns for disk usage analysis
  • Error context: Stack traces with local variables at failure points