llm-gguf-tools/docs/imatrix_data.md

115 lines
6.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Importance Matrix (IMatrix) Data Guide
An importance matrix guides quantisation by identifying critical weights that need protection. Like
JPEG compression preserving detail in faces whilst compressing uniform backgrounds, the imatrix
protects parameters that most affect output quality.
At lower bit rates, imatrix-quantised models show 2-3% better perplexity scores overall, with larger
gains in specific capabilities. A Q3_K model without imatrix might lose technical vocabulary or
rare language handling, whilst with imatrix it retains these abilities the difference between
simple size reduction and intelligent compression.
1. [The Art of Calibration Data](#the-art-of-calibration-data)
2. [Finding Pre-computed Matrices](#finding-pre-computed-matrices)
3. [Creating Your Own Matrix](#creating-your-own-matrix)
4. [Resource Requirements and Optimisation](#resource-requirements-and-optimisation)
5. [Integration and Workflow](#integration-and-workflow)
6. [Future Developments](#future-developments)
7. [Practical Tips](#practical-tips)
## The Art of Calibration Data
This repository includes `resources/imatrix_data.txt` from
[Bartowski's collection](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8),
originally compiled by Dampf building on Kalomaze's work. The dataset systematically activates
different model capabilities: technical writing for analysis, creative fiction for narrative,
multilingual text for language diversity, and factual content for knowledge accuracy.
The default calibration data works well for general models, but specialised models benefit from
targeted calibration. Code models need diverse programming languages and patterns; medical models
need technical literature and terminology. Calibration should reflect actual use cases 50-100KB
of well-chosen text beats gigabytes of random content.
Calibration runs text through the model to observe weight activation patterns. These patterns
become the importance matrix a heat map of crucial parameters for intended use cases, similar to
how brains strengthen frequently-used neural pathways.
## Finding Pre-computed Matrices
Check for existing matrices before generating your own. Bartowski shares pre-computed matrices at
`https://huggingface.co/bartowski/MODEL-NAME-GGUF/resolve/main/MODEL-NAME.imatrix`. These save
hours of computation and provide excellent results from high-quality calibration data.
The tool automatically checks for imatrix files. If missing, download the appropriate imatrix to
your model's work directory as `imatrix.dat`. The quality improvement, especially at lower
quantisation levels, justifies this extra step.
## Creating Your Own Matrix
Generate your own imatrix for new models, domain-specific calibration, or experimentation.
Currently requires llama.cpp's binary tools as the functionality isn't exposed through
llama-cpp-python.
Download llama.cpp from the [official releases](https://github.com/ggerganov/llama.cpp/releases).
Windows users need `llama-bXXXX-bin-win-cuda-x64.zip` for GPU support; Linux/macOS users can use
binaries or compile from source.
Use the F16 or F32 GGUF model (found in `./work/<model-name>/` after quantisation). F16 balances
quality and computation requirements. Run from your llama.cpp directory:
```bash
./llama-imatrix -m /path/to/model-F16.gguf \
-f /path/to/calibration.txt \
-o /path/to/imatrix.dat \
--chunks 100
```
Generation runs inference whilst analysing activation patterns. The `--chunks` parameter controls
thoroughness (100 is standard, more for production, less for experiments). Expect 30 minutes to
several hours on consumer hardware. GPU acceleration helps significantly.
Generation shows perplexity calculations and progress updates after initial loading. The tool tracks
activation patterns, calculates importance scores, and builds the statistical model for guiding
quantisation.
## Resource Requirements and Optimisation
Resource requirements match full inference: 7B models need ~14GB RAM for F16. CPU-only works but
GPU acceleration reduces days to hours for large models. The process supports interruption and
resumption.
Matrix quality depends on multiple factors. More chunks improve results with diminishing returns
beyond 200-300. F16 precision is optimal F32 doubles computation for minimal gain, whilst
quantised models create quality-degrading feedback loops.
Temperature affects generation (lower focuses on likely paths, higher explores possibilities) but
defaults are well-tuned. Good calibration data matters more than parameter tweaking.
## Integration and Workflow
Place the imatrix as `imatrix.dat` in your model's work directory. The tool auto-detects and applies
it with log confirmation. One imatrix works for all quantisation levels.
The tool acknowledges current limitations whilst providing clean workflows. Though Python generation
isn't available yet, using external matrices is trivial. This pragmatic approach delivers optimal
results whilst preparing for future improvements.
## Future Developments
Native imatrix generation is on llama-cpp-python's roadmap for immediate integration when available.
Meanwhile, this hybrid approach works well. The community shares matrices, calibration datasets
improve constantly, and algorithms grow more sophisticated.
Research continues into dynamic importance scoring, multi-modal calibration for vision-language
models, and automated calibration generation. These advances will eventually reach production tools,
but current approaches already deliver impressive results.
## Practical Tips
Key insights: Quality and diversity beat quantity in calibration data. Include specific use cases
even if uncommon. Balance languages proportionally for multilingual models. Include edge cases for
robustness. When in doubt, use Bartowski's pre-computed matrices they're consistently excellent.
The importance matrix seems obvious in hindsight preserve critical weights, calibrate for actual
usage. Yet it took years of experimentation to develop these techniques. Using them well transforms
quantisation from simple size reduction to intelligent preservation of what matters.