115 lines
6.1 KiB
Markdown
115 lines
6.1 KiB
Markdown
# Importance Matrix (IMatrix) Data Guide
|
||
|
||
An importance matrix guides quantisation by identifying critical weights that need protection. Like
|
||
JPEG compression preserving detail in faces whilst compressing uniform backgrounds, the imatrix
|
||
protects parameters that most affect output quality.
|
||
|
||
At lower bit rates, imatrix-quantised models show 2-3% better perplexity scores overall, with larger
|
||
gains in specific capabilities. A Q3_K model without imatrix might lose technical vocabulary or
|
||
rare language handling, whilst with imatrix it retains these abilities – the difference between
|
||
simple size reduction and intelligent compression.
|
||
|
||
1. [The Art of Calibration Data](#the-art-of-calibration-data)
|
||
2. [Finding Pre-computed Matrices](#finding-pre-computed-matrices)
|
||
3. [Creating Your Own Matrix](#creating-your-own-matrix)
|
||
4. [Resource Requirements and Optimisation](#resource-requirements-and-optimisation)
|
||
5. [Integration and Workflow](#integration-and-workflow)
|
||
6. [Future Developments](#future-developments)
|
||
7. [Practical Tips](#practical-tips)
|
||
|
||
## The Art of Calibration Data
|
||
|
||
This repository includes `resources/imatrix_data.txt` from
|
||
[Bartowski's collection](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8),
|
||
originally compiled by Dampf building on Kalomaze's work. The dataset systematically activates
|
||
different model capabilities: technical writing for analysis, creative fiction for narrative,
|
||
multilingual text for language diversity, and factual content for knowledge accuracy.
|
||
|
||
The default calibration data works well for general models, but specialised models benefit from
|
||
targeted calibration. Code models need diverse programming languages and patterns; medical models
|
||
need technical literature and terminology. Calibration should reflect actual use cases – 50-100KB
|
||
of well-chosen text beats gigabytes of random content.
|
||
|
||
Calibration runs text through the model to observe weight activation patterns. These patterns
|
||
become the importance matrix – a heat map of crucial parameters for intended use cases, similar to
|
||
how brains strengthen frequently-used neural pathways.
|
||
|
||
## Finding Pre-computed Matrices
|
||
|
||
Check for existing matrices before generating your own. Bartowski shares pre-computed matrices at
|
||
`https://huggingface.co/bartowski/MODEL-NAME-GGUF/resolve/main/MODEL-NAME.imatrix`. These save
|
||
hours of computation and provide excellent results from high-quality calibration data.
|
||
|
||
The tool automatically checks for imatrix files. If missing, download the appropriate imatrix to
|
||
your model's work directory as `imatrix.dat`. The quality improvement, especially at lower
|
||
quantisation levels, justifies this extra step.
|
||
|
||
## Creating Your Own Matrix
|
||
|
||
Generate your own imatrix for new models, domain-specific calibration, or experimentation.
|
||
Currently requires llama.cpp's binary tools as the functionality isn't exposed through
|
||
llama-cpp-python.
|
||
|
||
Download llama.cpp from the [official releases](https://github.com/ggerganov/llama.cpp/releases).
|
||
Windows users need `llama-bXXXX-bin-win-cuda-x64.zip` for GPU support; Linux/macOS users can use
|
||
binaries or compile from source.
|
||
|
||
Use the F16 or F32 GGUF model (found in `./work/<model-name>/` after quantisation). F16 balances
|
||
quality and computation requirements. Run from your llama.cpp directory:
|
||
|
||
```bash
|
||
./llama-imatrix -m /path/to/model-F16.gguf \
|
||
-f /path/to/calibration.txt \
|
||
-o /path/to/imatrix.dat \
|
||
--chunks 100
|
||
```
|
||
|
||
Generation runs inference whilst analysing activation patterns. The `--chunks` parameter controls
|
||
thoroughness (100 is standard, more for production, less for experiments). Expect 30 minutes to
|
||
several hours on consumer hardware. GPU acceleration helps significantly.
|
||
|
||
Generation shows perplexity calculations and progress updates after initial loading. The tool tracks
|
||
activation patterns, calculates importance scores, and builds the statistical model for guiding
|
||
quantisation.
|
||
|
||
## Resource Requirements and Optimisation
|
||
|
||
Resource requirements match full inference: 7B models need ~14GB RAM for F16. CPU-only works but
|
||
GPU acceleration reduces days to hours for large models. The process supports interruption and
|
||
resumption.
|
||
|
||
Matrix quality depends on multiple factors. More chunks improve results with diminishing returns
|
||
beyond 200-300. F16 precision is optimal – F32 doubles computation for minimal gain, whilst
|
||
quantised models create quality-degrading feedback loops.
|
||
|
||
Temperature affects generation (lower focuses on likely paths, higher explores possibilities) but
|
||
defaults are well-tuned. Good calibration data matters more than parameter tweaking.
|
||
|
||
## Integration and Workflow
|
||
|
||
Place the imatrix as `imatrix.dat` in your model's work directory. The tool auto-detects and applies
|
||
it with log confirmation. One imatrix works for all quantisation levels.
|
||
|
||
The tool acknowledges current limitations whilst providing clean workflows. Though Python generation
|
||
isn't available yet, using external matrices is trivial. This pragmatic approach delivers optimal
|
||
results whilst preparing for future improvements.
|
||
|
||
## Future Developments
|
||
|
||
Native imatrix generation is on llama-cpp-python's roadmap for immediate integration when available.
|
||
Meanwhile, this hybrid approach works well. The community shares matrices, calibration datasets
|
||
improve constantly, and algorithms grow more sophisticated.
|
||
|
||
Research continues into dynamic importance scoring, multi-modal calibration for vision-language
|
||
models, and automated calibration generation. These advances will eventually reach production tools,
|
||
but current approaches already deliver impressive results.
|
||
|
||
## Practical Tips
|
||
|
||
Key insights: Quality and diversity beat quantity in calibration data. Include specific use cases
|
||
even if uncommon. Balance languages proportionally for multilingual models. Include edge cases for
|
||
robustness. When in doubt, use Bartowski's pre-computed matrices – they're consistently excellent.
|
||
|
||
The importance matrix seems obvious in hindsight – preserve critical weights, calibrate for actual
|
||
usage. Yet it took years of experimentation to develop these techniques. Using them well transforms
|
||
quantisation from simple size reduction to intelligent preservation of what matters.
|