Switch to llama-cpp-python

This commit is contained in:
Tom Foster 2025-08-08 21:40:15 +01:00
parent ef7df1a8c3
commit d937f2d5fa
25 changed files with 2957 additions and 1181 deletions

View file

@ -1,86 +1,136 @@
# Development Guide
This guide covers development setup, code quality standards, and project structure for contributors.
Contributing to GGUF tools requires understanding quantisation workflows and Python's modern
dependency ecosystem. This guide covers setup, standards, and architectural decisions for fixing
bugs, adding quantisation profiles, or extending conversion capabilities.
## Code Quality
Ruff replaces the traditional Black/isort/flake8 stack as both linter and formatter. Mypy provides
static type checking to catch type-related bugs before runtime. Zero tolerance for linting and type
errors catches issues early. Both tools have extensive configuration in `pyproject.toml` to enforce
only the important code quality standards we've selected. Debug logging reveals quantisation internals
when models fail.
```bash
# Run linting
uv run ruff check
# Run linting - catches style violations, potential bugs, and code smells
uvx ruff check
# Format code
uv run ruff format
# Format code - enforces consistent style automatically
uvx ruff format
# Run with debug logging
# Run type checking - ensures type safety and catches potential bugs
uv run mypy .
# Run with debug logging - reveals conversion steps and tensor processing
DEBUG=true uv run <script>
```
## Project Structure
Architecture separates concerns cleanly: top-level scripts provide interfaces, helpers encapsulate
reusable logic, resources contain community data. Structure evolved from practical needs helpers
emerged to eliminate duplication, services to abstract external dependencies.
```plain
llm-gguf-tools/
├── quantise.py # Bartowski quantisation tool
├── direct_safetensors_to_gguf.py # Direct conversion tool
├── helpers/ # Shared utilities
├── quantise.py # Bartowski quantisation tool - the main workflow
├── direct_safetensors_to_gguf.py # Direct conversion for unsupported architectures
├── helpers/ # Shared utilities and abstractions
│ ├── __init__.py
│ └── logger.py # Colour-coded logging
├── resources/ # Resource files
│ └── imatrix_data.txt # Calibration data for imatrix
│ ├── logger.py # Colour-coded logging with context awareness
│ ├── services/ # External service wrappers
│ │ ├── gguf.py # GGUF writer abstraction
│ │ └── llama_python.py # llama-cpp-python integration
│ └── utils/ # Pure utility functions
│ ├── config_parser.py # Model configuration handling
│ └── tensor_mapping.py # Architecture-specific tensor name mapping
├── resources/ # Resource files and calibration data
│ └── imatrix_data.txt # Curated calibration data from Bartowski
├── docs/ # Detailed documentation
│ ├── quantise.md
│ ├── direct_safetensors_to_gguf.md
│ └── development.md
└── pyproject.toml # Project configuration
│ ├── quantise_gguf.md # Quantisation strategies and profiles
│ ├── safetensors2gguf.md # Direct conversion documentation
│ ├── bartowski_analysis.md # Deep dive into variant strategies
│ ├── imatrix_data.md # Importance matrix guide
│ └── development.md # This guide
└── pyproject.toml # Modern Python project configuration
```
## Contributing Guidelines
Contributions are welcome! Please ensure:
The project values pragmatic solutions over theoretical perfection working code that handles edge
cases beats elegant abstractions. Contributors should understand how quantisation profiles map to
Bartowski's discoveries and where Python-C++ boundaries limit functionality.
1. Code follows the existing style (run `uv run ruff format`)
2. All functions have Google-style docstrings
3. Type hints are used throughout
4. Tests pass (if applicable)
Essential requirements:
1. **Style consistency**: Run `uvx ruff format` before committing to keep diffs focused on logic
2. **Documentation**: Google-style docstrings explain behaviour and rationale beyond type hints
3. **Type safety**: Complete type hints for all public functions enable IDE support
4. **Practical testing**: Test with both 1B and 7B+ models to catch scaling issues
## Development Workflow
### Setting Up Development Environment
The project uses `uv` for dependency management Rust-fast, automatic Python version management,
upfront dependency resolution. Development dependencies include ruff, type stubs, and optional
PyTorch for BFloat16 handling.
```bash
# Clone the repository
# Clone the repository - uses Forgejo (GitLab-like) hosting
git clone https://git.tomfos.tr/tom/llm-gguf-tools.git
cd llm-gguf-tools
# Install all dependencies including dev
# Install all dependencies including dev tools
# This installs llama-cpp-python with CUDA support if available
uv sync --all-groups
```
### Code Style
- Follow PEP 8 with ruff enforcement
- Use UK English spelling in comments and documentation
- Maximum line length: 100 characters
- Use type hints for all function parameters and returns
Code style reduces cognitive load by letting reviewers focus on logic rather than layout. UK English
maintains llama.cpp consistency. The 100-character line limit balances descriptive names with
readability.
Core conventions:
- **PEP 8 compliance**: Ruff catches mutable defaults, unused imports automatically
- **UK English**: "Optimise" not "optimize", matching upstream llama.cpp
- **Line length**: 100 characters maximum except URLs or unbreakable paths
- **Type annotations**: Complete hints for public functions documentation that can't go stale
- **Import ordering**: Standard library, third-party, local ruff handles automatically
### Testing
While formal tests are not yet implemented, ensure:
Formal tests pending. Quantisation "correctness" depends on complex interactions between model
architecture, strategy, and downstream usage. Benchmark performance doesn't guarantee production
success.
- Scripts run without errors on sample models
- Logger output is correctly formatted
- File I/O operations handle errors gracefully
Current validation approach:
- **End-to-end testing**: Qwen 0.5B for quick iteration, Llama 3.2 1B for architecture compatibility
- **Output validation**: GGUF must load in llama.cpp and degrade gracefully, not produce gibberish
- **Error handling**: Test corrupted SafeTensors, missing configs, insufficient disk space
- **Logger consistency**: Verify colour coding across terminals, progress bars with piped output
### Debugging
Enable debug logging for verbose output:
Debug logging transforms black box to glass box, revealing failure points. Colour coding highlights
stages: blue (info), yellow (warnings), red (errors), green (success). Visual hierarchy enables
efficient log scanning.
```bash
DEBUG=true uv run quantise.py <model_url>
# Enable comprehensive debug output
DEBUG=true uv run direct_safetensors_to_gguf.py ./model # Tensor mapping details
DEBUG=true uv run quantise.py <model_url> # Memory usage tracking
```
This will show additional information about:
Debug output reveals:
- Model download progress
- Conversion steps
- File operations
- Error details
- **Download progress**: Bytes transferred, retries, connection issues
- **Conversion pipeline**: SafeTensors→GGUF steps, tensor mappings, dimension changes
- **Quantisation decisions**: Layer bit depths, importance matrix effects on weight selection
- **Memory usage**: Peak consumption for predicting larger model requirements
- **File operations**: Read/write/temp patterns for disk usage analysis
- **Error context**: Stack traces with local variables at failure points