Switch to llama-cpp-python

2025-08-08 21:40:15 +01:00 · 2025-08-08 21:40:15 +01:00 · d937f2d5fa
commit d937f2d5fa
parent ef7df1a8c3
25 changed files with 2957 additions and 1181 deletions
--- a/docs/imatrix_data.md
+++ b/docs/imatrix_data.md
@ -0,0 +1,115 @@
+# Importance Matrix (IMatrix) Data Guide
+
+An importance matrix guides quantisation by identifying critical weights that need protection. Like
+JPEG compression preserving detail in faces whilst compressing uniform backgrounds, the imatrix
+protects parameters that most affect output quality.
+
+At lower bit rates, imatrix-quantised models show 2-3% better perplexity scores overall, with larger
+gains in specific capabilities. A Q3_K model without imatrix might lose technical vocabulary or
+rare language handling, whilst with imatrix it retains these abilities – the difference between
+simple size reduction and intelligent compression.
+
+1. [The Art of Calibration Data](#the-art-of-calibration-data)
+2. [Finding Pre-computed Matrices](#finding-pre-computed-matrices)
+3. [Creating Your Own Matrix](#creating-your-own-matrix)
+4. [Resource Requirements and Optimisation](#resource-requirements-and-optimisation)
+5. [Integration and Workflow](#integration-and-workflow)
+6. [Future Developments](#future-developments)
+7. [Practical Tips](#practical-tips)
+
+## The Art of Calibration Data
+
+This repository includes `resources/imatrix_data.txt` from
+[Bartowski's collection](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8),
+originally compiled by Dampf building on Kalomaze's work. The dataset systematically activates
+different model capabilities: technical writing for analysis, creative fiction for narrative,
+multilingual text for language diversity, and factual content for knowledge accuracy.
+
+The default calibration data works well for general models, but specialised models benefit from
+targeted calibration. Code models need diverse programming languages and patterns; medical models
+need technical literature and terminology. Calibration should reflect actual use cases – 50-100KB
+of well-chosen text beats gigabytes of random content.
+
+Calibration runs text through the model to observe weight activation patterns. These patterns
+become the importance matrix – a heat map of crucial parameters for intended use cases, similar to
+how brains strengthen frequently-used neural pathways.
+
+## Finding Pre-computed Matrices
+
+Check for existing matrices before generating your own. Bartowski shares pre-computed matrices at
+`https://huggingface.co/bartowski/MODEL-NAME-GGUF/resolve/main/MODEL-NAME.imatrix`. These save
+hours of computation and provide excellent results from high-quality calibration data.
+
+The tool automatically checks for imatrix files. If missing, download the appropriate imatrix to
+your model's work directory as `imatrix.dat`. The quality improvement, especially at lower
+quantisation levels, justifies this extra step.
+
+## Creating Your Own Matrix
+
+Generate your own imatrix for new models, domain-specific calibration, or experimentation.
+Currently requires llama.cpp's binary tools as the functionality isn't exposed through
+llama-cpp-python.
+
+Download llama.cpp from the [official releases](https://github.com/ggerganov/llama.cpp/releases).
+Windows users need `llama-bXXXX-bin-win-cuda-x64.zip` for GPU support; Linux/macOS users can use
+binaries or compile from source.
+
+Use the F16 or F32 GGUF model (found in `./work/<model-name>/` after quantisation). F16 balances
+quality and computation requirements. Run from your llama.cpp directory:
+
+```bash
+./llama-imatrix -m /path/to/model-F16.gguf \
+                 -f /path/to/calibration.txt \
+                 -o /path/to/imatrix.dat \
+                 --chunks 100
+```
+
+Generation runs inference whilst analysing activation patterns. The `--chunks` parameter controls
+thoroughness (100 is standard, more for production, less for experiments). Expect 30 minutes to
+several hours on consumer hardware. GPU acceleration helps significantly.
+
+Generation shows perplexity calculations and progress updates after initial loading. The tool tracks
+activation patterns, calculates importance scores, and builds the statistical model for guiding
+quantisation.
+
+## Resource Requirements and Optimisation
+
+Resource requirements match full inference: 7B models need ~14GB RAM for F16. CPU-only works but
+GPU acceleration reduces days to hours for large models. The process supports interruption and
+resumption.
+
+Matrix quality depends on multiple factors. More chunks improve results with diminishing returns
+beyond 200-300. F16 precision is optimal – F32 doubles computation for minimal gain, whilst
+quantised models create quality-degrading feedback loops.
+
+Temperature affects generation (lower focuses on likely paths, higher explores possibilities) but
+defaults are well-tuned. Good calibration data matters more than parameter tweaking.
+
+## Integration and Workflow
+
+Place the imatrix as `imatrix.dat` in your model's work directory. The tool auto-detects and applies
+it with log confirmation. One imatrix works for all quantisation levels.
+
+The tool acknowledges current limitations whilst providing clean workflows. Though Python generation
+isn't available yet, using external matrices is trivial. This pragmatic approach delivers optimal
+results whilst preparing for future improvements.
+
+## Future Developments
+
+Native imatrix generation is on llama-cpp-python's roadmap for immediate integration when available.
+Meanwhile, this hybrid approach works well. The community shares matrices, calibration datasets
+improve constantly, and algorithms grow more sophisticated.
+
+Research continues into dynamic importance scoring, multi-modal calibration for vision-language
+models, and automated calibration generation. These advances will eventually reach production tools,
+but current approaches already deliver impressive results.
+
+## Practical Tips
+
+Key insights: Quality and diversity beat quantity in calibration data. Include specific use cases
+even if uncommon. Balance languages proportionally for multilingual models. Include edge cases for
+robustness. When in doubt, use Bartowski's pre-computed matrices – they're consistently excellent.
+
+The importance matrix seems obvious in hindsight – preserve critical weights, calibrate for actual
+usage. Yet it took years of experimentation to develop these techniques. Using them well transforms
+quantisation from simple size reduction to intelligent preservation of what matters.