llm-gguf-tools/docs/imatrix_data.md

6.1 KiB
Raw Blame History

Importance Matrix (IMatrix) Data Guide

An importance matrix guides quantisation by identifying critical weights that need protection. Like JPEG compression preserving detail in faces whilst compressing uniform backgrounds, the imatrix protects parameters that most affect output quality.

At lower bit rates, imatrix-quantised models show 2-3% better perplexity scores overall, with larger gains in specific capabilities. A Q3_K model without imatrix might lose technical vocabulary or rare language handling, whilst with imatrix it retains these abilities the difference between simple size reduction and intelligent compression.

  1. The Art of Calibration Data
  2. Finding Pre-computed Matrices
  3. Creating Your Own Matrix
  4. Resource Requirements and Optimisation
  5. Integration and Workflow
  6. Future Developments
  7. Practical Tips

The Art of Calibration Data

This repository includes resources/imatrix_data.txt from Bartowski's collection, originally compiled by Dampf building on Kalomaze's work. The dataset systematically activates different model capabilities: technical writing for analysis, creative fiction for narrative, multilingual text for language diversity, and factual content for knowledge accuracy.

The default calibration data works well for general models, but specialised models benefit from targeted calibration. Code models need diverse programming languages and patterns; medical models need technical literature and terminology. Calibration should reflect actual use cases 50-100KB of well-chosen text beats gigabytes of random content.

Calibration runs text through the model to observe weight activation patterns. These patterns become the importance matrix a heat map of crucial parameters for intended use cases, similar to how brains strengthen frequently-used neural pathways.

Finding Pre-computed Matrices

Check for existing matrices before generating your own. Bartowski shares pre-computed matrices at https://huggingface.co/bartowski/MODEL-NAME-GGUF/resolve/main/MODEL-NAME.imatrix. These save hours of computation and provide excellent results from high-quality calibration data.

The tool automatically checks for imatrix files. If missing, download the appropriate imatrix to your model's work directory as imatrix.dat. The quality improvement, especially at lower quantisation levels, justifies this extra step.

Creating Your Own Matrix

Generate your own imatrix for new models, domain-specific calibration, or experimentation. Currently requires llama.cpp's binary tools as the functionality isn't exposed through llama-cpp-python.

Download llama.cpp from the official releases. Windows users need llama-bXXXX-bin-win-cuda-x64.zip for GPU support; Linux/macOS users can use binaries or compile from source.

Use the F16 or F32 GGUF model (found in ./work/<model-name>/ after quantisation). F16 balances quality and computation requirements. Run from your llama.cpp directory:

./llama-imatrix -m /path/to/model-F16.gguf \
                 -f /path/to/calibration.txt \
                 -o /path/to/imatrix.dat \
                 --chunks 100

Generation runs inference whilst analysing activation patterns. The --chunks parameter controls thoroughness (100 is standard, more for production, less for experiments). Expect 30 minutes to several hours on consumer hardware. GPU acceleration helps significantly.

Generation shows perplexity calculations and progress updates after initial loading. The tool tracks activation patterns, calculates importance scores, and builds the statistical model for guiding quantisation.

Resource Requirements and Optimisation

Resource requirements match full inference: 7B models need ~14GB RAM for F16. CPU-only works but GPU acceleration reduces days to hours for large models. The process supports interruption and resumption.

Matrix quality depends on multiple factors. More chunks improve results with diminishing returns beyond 200-300. F16 precision is optimal F32 doubles computation for minimal gain, whilst quantised models create quality-degrading feedback loops.

Temperature affects generation (lower focuses on likely paths, higher explores possibilities) but defaults are well-tuned. Good calibration data matters more than parameter tweaking.

Integration and Workflow

Place the imatrix as imatrix.dat in your model's work directory. The tool auto-detects and applies it with log confirmation. One imatrix works for all quantisation levels.

The tool acknowledges current limitations whilst providing clean workflows. Though Python generation isn't available yet, using external matrices is trivial. This pragmatic approach delivers optimal results whilst preparing for future improvements.

Future Developments

Native imatrix generation is on llama-cpp-python's roadmap for immediate integration when available. Meanwhile, this hybrid approach works well. The community shares matrices, calibration datasets improve constantly, and algorithms grow more sophisticated.

Research continues into dynamic importance scoring, multi-modal calibration for vision-language models, and automated calibration generation. These advances will eventually reach production tools, but current approaches already deliver impressive results.

Practical Tips

Key insights: Quality and diversity beat quantity in calibration data. Include specific use cases even if uncommon. Balance languages proportionally for multilingual models. Include edge cases for robustness. When in doubt, use Bartowski's pre-computed matrices they're consistently excellent.

The importance matrix seems obvious in hindsight preserve critical weights, calibrate for actual usage. Yet it took years of experimentation to develop these techniques. Using them well transforms quantisation from simple size reduction to intelligent preservation of what matters.