Switch to llama-cpp-python

2025-08-08 21:40:15 +01:00 · 2025-08-08 21:40:15 +01:00 · d937f2d5fa
commit d937f2d5fa
parent ef7df1a8c3
25 changed files with 2957 additions and 1181 deletions
--- a/docs/development.md
+++ b/docs/development.md
@ -1,86 +1,136 @@
 # Development Guide

-This guide covers development setup, code quality standards, and project structure for contributors.
+Contributing to GGUF tools requires understanding quantisation workflows and Python's modern
+dependency ecosystem. This guide covers setup, standards, and architectural decisions for fixing
+bugs, adding quantisation profiles, or extending conversion capabilities.

 ## Code Quality

+Ruff replaces the traditional Black/isort/flake8 stack as both linter and formatter. Mypy provides
+static type checking to catch type-related bugs before runtime. Zero tolerance for linting and type
+errors catches issues early. Both tools have extensive configuration in `pyproject.toml` to enforce
+only the important code quality standards we've selected. Debug logging reveals quantisation internals
+when models fail.
+
 ```bash
-# Run linting
-uv run ruff check
+# Run linting - catches style violations, potential bugs, and code smells
+uvx ruff check

-# Format code
-uv run ruff format
+# Format code - enforces consistent style automatically
+uvx ruff format

-# Run with debug logging
+# Run type checking - ensures type safety and catches potential bugs
+uv run mypy .
+
+# Run with debug logging - reveals conversion steps and tensor processing
 DEBUG=true uv run <script>
 ```

 ## Project Structure

+Architecture separates concerns cleanly: top-level scripts provide interfaces, helpers encapsulate
+reusable logic, resources contain community data. Structure evolved from practical needs – helpers
+emerged to eliminate duplication, services to abstract external dependencies.
+
 ```plain
 llm-gguf-tools/
-├── quantise.py                    # Bartowski quantisation tool
-├── direct_safetensors_to_gguf.py  # Direct conversion tool
-├── helpers/                       # Shared utilities
+├── quantise.py                    # Bartowski quantisation tool - the main workflow
+├── direct_safetensors_to_gguf.py  # Direct conversion for unsupported architectures
+├── helpers/                       # Shared utilities and abstractions
 │   ├── __init__.py
-│   └── logger.py                  # Colour-coded logging
-├── resources/                     # Resource files
-│   └── imatrix_data.txt          # Calibration data for imatrix
+│   ├── logger.py                  # Colour-coded logging with context awareness
+│   ├── services/                  # External service wrappers
+│   │   ├── gguf.py                # GGUF writer abstraction
+│   │   └── llama_python.py        # llama-cpp-python integration
+│   └── utils/                     # Pure utility functions
+│       ├── config_parser.py       # Model configuration handling
+│       └── tensor_mapping.py      # Architecture-specific tensor name mapping
+├── resources/                     # Resource files and calibration data
+│   └── imatrix_data.txt           # Curated calibration data from Bartowski
 ├── docs/                          # Detailed documentation
-│   ├── quantise.md
-│   ├── direct_safetensors_to_gguf.md
-│   └── development.md
-└── pyproject.toml                # Project configuration
+│   ├── quantise_gguf.md           # Quantisation strategies and profiles
+│   ├── safetensors2gguf.md        # Direct conversion documentation
+│   ├── bartowski_analysis.md      # Deep dive into variant strategies
+│   ├── imatrix_data.md            # Importance matrix guide
+│   └── development.md             # This guide
+└── pyproject.toml                 # Modern Python project configuration
 ```

 ## Contributing Guidelines

-Contributions are welcome! Please ensure:
+The project values pragmatic solutions over theoretical perfection – working code that handles edge
+cases beats elegant abstractions. Contributors should understand how quantisation profiles map to
+Bartowski's discoveries and where Python-C++ boundaries limit functionality.

-1. Code follows the existing style (run `uv run ruff format`)
-2. All functions have Google-style docstrings
-3. Type hints are used throughout
-4. Tests pass (if applicable)
+Essential requirements:
+
+1. **Style consistency**: Run `uvx ruff format` before committing to keep diffs focused on logic
+2. **Documentation**: Google-style docstrings explain behaviour and rationale beyond type hints
+3. **Type safety**: Complete type hints for all public functions enable IDE support
+4. **Practical testing**: Test with both 1B and 7B+ models to catch scaling issues

 ## Development Workflow

 ### Setting Up Development Environment

+The project uses `uv` for dependency management – Rust-fast, automatic Python version management,
+upfront dependency resolution. Development dependencies include ruff, type stubs, and optional
+PyTorch for BFloat16 handling.
+
 ```bash
-# Clone the repository
+# Clone the repository - uses Forgejo (GitLab-like) hosting
 git clone https://git.tomfos.tr/tom/llm-gguf-tools.git
 cd llm-gguf-tools

-# Install all dependencies including dev
+# Install all dependencies including dev tools
+# This installs llama-cpp-python with CUDA support if available
 uv sync --all-groups
 ```

 ### Code Style

- Follow PEP 8 with ruff enforcement
- Use UK English spelling in comments and documentation
- Maximum line length: 100 characters
- Use type hints for all function parameters and returns
+Code style reduces cognitive load by letting reviewers focus on logic rather than layout. UK English
+maintains llama.cpp consistency. The 100-character line limit balances descriptive names with
+readability.
+
+Core conventions:
+
+- **PEP 8 compliance**: Ruff catches mutable defaults, unused imports automatically
+- **UK English**: "Optimise" not "optimize", matching upstream llama.cpp
+- **Line length**: 100 characters maximum except URLs or unbreakable paths
+- **Type annotations**: Complete hints for public functions – documentation that can't go stale
+- **Import ordering**: Standard library, third-party, local – ruff handles automatically

 ### Testing

-While formal tests are not yet implemented, ensure:
+Formal tests pending. Quantisation "correctness" depends on complex interactions between model
+architecture, strategy, and downstream usage. Benchmark performance doesn't guarantee production
+success.

- Scripts run without errors on sample models
- Logger output is correctly formatted
- File I/O operations handle errors gracefully
+Current validation approach:
+
+- **End-to-end testing**: Qwen 0.5B for quick iteration, Llama 3.2 1B for architecture compatibility
+- **Output validation**: GGUF must load in llama.cpp and degrade gracefully, not produce gibberish
+- **Error handling**: Test corrupted SafeTensors, missing configs, insufficient disk space
+- **Logger consistency**: Verify colour coding across terminals, progress bars with piped output

 ### Debugging

-Enable debug logging for verbose output:
+Debug logging transforms black box to glass box, revealing failure points. Colour coding highlights
+stages: blue (info), yellow (warnings), red (errors), green (success). Visual hierarchy enables
+efficient log scanning.

 ```bash
-DEBUG=true uv run quantise.py <model_url>
+# Enable comprehensive debug output
+DEBUG=true uv run direct_safetensors_to_gguf.py ./model  # Tensor mapping details
+DEBUG=true uv run quantise.py <model_url>                # Memory usage tracking
 ```

-This will show additional information about:
+Debug output reveals:

- Model download progress
- Conversion steps
- File operations
- Error details
+- **Download progress**: Bytes transferred, retries, connection issues
+- **Conversion pipeline**: SafeTensors→GGUF steps, tensor mappings, dimension changes
+- **Quantisation decisions**: Layer bit depths, importance matrix effects on weight selection
+- **Memory usage**: Peak consumption for predicting larger model requirements
+- **File operations**: Read/write/temp patterns for disk usage analysis
+- **Error context**: Stack traces with local variables at failure points