Switch to llama-cpp-python
This commit is contained in:
parent
ef7df1a8c3
commit
d937f2d5fa
25 changed files with 2957 additions and 1181 deletions
115
docs/imatrix_data.md
Normal file
115
docs/imatrix_data.md
Normal file
|
@ -0,0 +1,115 @@
|
|||
# Importance Matrix (IMatrix) Data Guide
|
||||
|
||||
An importance matrix guides quantisation by identifying critical weights that need protection. Like
|
||||
JPEG compression preserving detail in faces whilst compressing uniform backgrounds, the imatrix
|
||||
protects parameters that most affect output quality.
|
||||
|
||||
At lower bit rates, imatrix-quantised models show 2-3% better perplexity scores overall, with larger
|
||||
gains in specific capabilities. A Q3_K model without imatrix might lose technical vocabulary or
|
||||
rare language handling, whilst with imatrix it retains these abilities – the difference between
|
||||
simple size reduction and intelligent compression.
|
||||
|
||||
1. [The Art of Calibration Data](#the-art-of-calibration-data)
|
||||
2. [Finding Pre-computed Matrices](#finding-pre-computed-matrices)
|
||||
3. [Creating Your Own Matrix](#creating-your-own-matrix)
|
||||
4. [Resource Requirements and Optimisation](#resource-requirements-and-optimisation)
|
||||
5. [Integration and Workflow](#integration-and-workflow)
|
||||
6. [Future Developments](#future-developments)
|
||||
7. [Practical Tips](#practical-tips)
|
||||
|
||||
## The Art of Calibration Data
|
||||
|
||||
This repository includes `resources/imatrix_data.txt` from
|
||||
[Bartowski's collection](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8),
|
||||
originally compiled by Dampf building on Kalomaze's work. The dataset systematically activates
|
||||
different model capabilities: technical writing for analysis, creative fiction for narrative,
|
||||
multilingual text for language diversity, and factual content for knowledge accuracy.
|
||||
|
||||
The default calibration data works well for general models, but specialised models benefit from
|
||||
targeted calibration. Code models need diverse programming languages and patterns; medical models
|
||||
need technical literature and terminology. Calibration should reflect actual use cases – 50-100KB
|
||||
of well-chosen text beats gigabytes of random content.
|
||||
|
||||
Calibration runs text through the model to observe weight activation patterns. These patterns
|
||||
become the importance matrix – a heat map of crucial parameters for intended use cases, similar to
|
||||
how brains strengthen frequently-used neural pathways.
|
||||
|
||||
## Finding Pre-computed Matrices
|
||||
|
||||
Check for existing matrices before generating your own. Bartowski shares pre-computed matrices at
|
||||
`https://huggingface.co/bartowski/MODEL-NAME-GGUF/resolve/main/MODEL-NAME.imatrix`. These save
|
||||
hours of computation and provide excellent results from high-quality calibration data.
|
||||
|
||||
The tool automatically checks for imatrix files. If missing, download the appropriate imatrix to
|
||||
your model's work directory as `imatrix.dat`. The quality improvement, especially at lower
|
||||
quantisation levels, justifies this extra step.
|
||||
|
||||
## Creating Your Own Matrix
|
||||
|
||||
Generate your own imatrix for new models, domain-specific calibration, or experimentation.
|
||||
Currently requires llama.cpp's binary tools as the functionality isn't exposed through
|
||||
llama-cpp-python.
|
||||
|
||||
Download llama.cpp from the [official releases](https://github.com/ggerganov/llama.cpp/releases).
|
||||
Windows users need `llama-bXXXX-bin-win-cuda-x64.zip` for GPU support; Linux/macOS users can use
|
||||
binaries or compile from source.
|
||||
|
||||
Use the F16 or F32 GGUF model (found in `./work/<model-name>/` after quantisation). F16 balances
|
||||
quality and computation requirements. Run from your llama.cpp directory:
|
||||
|
||||
```bash
|
||||
./llama-imatrix -m /path/to/model-F16.gguf \
|
||||
-f /path/to/calibration.txt \
|
||||
-o /path/to/imatrix.dat \
|
||||
--chunks 100
|
||||
```
|
||||
|
||||
Generation runs inference whilst analysing activation patterns. The `--chunks` parameter controls
|
||||
thoroughness (100 is standard, more for production, less for experiments). Expect 30 minutes to
|
||||
several hours on consumer hardware. GPU acceleration helps significantly.
|
||||
|
||||
Generation shows perplexity calculations and progress updates after initial loading. The tool tracks
|
||||
activation patterns, calculates importance scores, and builds the statistical model for guiding
|
||||
quantisation.
|
||||
|
||||
## Resource Requirements and Optimisation
|
||||
|
||||
Resource requirements match full inference: 7B models need ~14GB RAM for F16. CPU-only works but
|
||||
GPU acceleration reduces days to hours for large models. The process supports interruption and
|
||||
resumption.
|
||||
|
||||
Matrix quality depends on multiple factors. More chunks improve results with diminishing returns
|
||||
beyond 200-300. F16 precision is optimal – F32 doubles computation for minimal gain, whilst
|
||||
quantised models create quality-degrading feedback loops.
|
||||
|
||||
Temperature affects generation (lower focuses on likely paths, higher explores possibilities) but
|
||||
defaults are well-tuned. Good calibration data matters more than parameter tweaking.
|
||||
|
||||
## Integration and Workflow
|
||||
|
||||
Place the imatrix as `imatrix.dat` in your model's work directory. The tool auto-detects and applies
|
||||
it with log confirmation. One imatrix works for all quantisation levels.
|
||||
|
||||
The tool acknowledges current limitations whilst providing clean workflows. Though Python generation
|
||||
isn't available yet, using external matrices is trivial. This pragmatic approach delivers optimal
|
||||
results whilst preparing for future improvements.
|
||||
|
||||
## Future Developments
|
||||
|
||||
Native imatrix generation is on llama-cpp-python's roadmap for immediate integration when available.
|
||||
Meanwhile, this hybrid approach works well. The community shares matrices, calibration datasets
|
||||
improve constantly, and algorithms grow more sophisticated.
|
||||
|
||||
Research continues into dynamic importance scoring, multi-modal calibration for vision-language
|
||||
models, and automated calibration generation. These advances will eventually reach production tools,
|
||||
but current approaches already deliver impressive results.
|
||||
|
||||
## Practical Tips
|
||||
|
||||
Key insights: Quality and diversity beat quantity in calibration data. Include specific use cases
|
||||
even if uncommon. Balance languages proportionally for multilingual models. Include edge cases for
|
||||
robustness. When in doubt, use Bartowski's pre-computed matrices – they're consistently excellent.
|
||||
|
||||
The importance matrix seems obvious in hindsight – preserve critical weights, calibrate for actual
|
||||
usage. Yet it took years of experimentation to develop these techniques. Using them well transforms
|
||||
quantisation from simple size reduction to intelligent preservation of what matters.
|
Loading…
Add table
Add a link
Reference in a new issue