Script and results for Nvidia GPU benchmarking across various context lengths

Find a file

Tom Foster 430dc059d4 All checks were successful Build Vast.ai Ollama Benchmark Image / Build and Push (push) Successful in 6m4s Details Debugging		2025-07-28 21:15:37 +01:00
.forgejo/workflows	Initial commit	2025-07-28 16:58:21 +01:00
.github/workflows	Initial commit	2025-07-28 16:58:21 +01:00
docs	Initial commit	2025-07-28 16:58:21 +01:00
results	Initial commit	2025-07-28 16:58:21 +01:00
scripts	Debugging	2025-07-28 21:15:37 +01:00
.env.example	Initial commit	2025-07-28 16:58:21 +01:00
.gitignore	Initial commit	2025-07-28 16:58:21 +01:00
Dockerfile.ollama	Initial commit	2025-07-28 16:58:21 +01:00
LICENSE	Initial commit	2025-07-28 16:58:21 +01:00
pyproject.toml	Initial commit	2025-07-28 16:58:21 +01:00
README.md	Initial commit	2025-07-28 16:58:21 +01:00
uv.lock	Initial commit	2025-07-28 16:58:21 +01:00

README.md

GPU LLM Benchmarking

Comprehensive benchmarking tool for testing Large Language Model inference performance across different context window sizes, designed to evaluate GPU performance for informed upgrade decisions.

Features

Dual Scenario Testing: Tests both short prompts (allocation overhead) and half-context prompts (realistic usage)
Comprehensive Metrics: GPU utilization, memory bandwidth, VRAM usage, temperature, power consumption, and generation speed
Configurable Context Windows: Test from 8K to 128K contexts with geometric progression
Detailed Monitoring: Real-time GPU metrics captured at 0.5s intervals
Statistical Analysis: Mean and standard deviation calculations across multiple runs
Export Ready: CSV outputs for community analysis and comparison

Quick Start

Using Docker (Recommended)

Pull and run the pre-configured Docker image with Ollama:

docker pull ghcr.io/tomfoster/gpu-llm-benchmark:latest
docker run --gpus all -v ./results:/app/results ghcr.io/tomfoster/gpu-llm-benchmark:latest

Manual Installation

# Clone the repository
git clone https://git.tomfos.tr/tom/gpu-llm-benchmarking.git
cd gpu-llm-benchmarking

# Install dependencies
pip install requests

# Run quick validation
python scripts/llm_benchmark_python.py --quick

# Run full benchmark
python scripts/llm_benchmark_python.py

Configuration

Environment Variables

All configuration can be overridden using environment variables:

# Model configuration
export HF_MODEL="hf.co/unsloth/gemma-3-12b-it-GGUF:Q5_K_XL"
export TEST_ITERATIONS=3

# Context window configuration
export CONTEXT_START=8192
export CONTEXT_END=131072
export CONTEXT_MULTIPLIER=2

# Output directory
export OUTPUT_PATH="./benchmark_results"

# Ollama configuration
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KEEP_ALIVE="10m"
export OLLAMA_KV_CACHE_TYPE="q8_0"
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_NUM_GPU=1
export OLLAMA_NUM_PARALLEL=1

Docker Environment Variables

When using Docker, pass environment variables with -e:

docker run --gpus all \
  -e HF_MODEL="your-model:tag" \
  -e TEST_ITERATIONS=5 \
  -e CONTEXT_END=65536 \
  -v ./results:/app/results \
  ghcr.io/tomfoster/gpu-llm-benchmark:latest

Benchmark Modes

Quick Mode

Tests only the starting context length (8K by default) for rapid validation:

python scripts/llm_benchmark_python.py --quick
# Or with Docker:
docker run --gpus all -v ./results:/app/results ghcr.io/tomfoster/gpu-llm-benchmark:latest --quick

Full Mode

Complete benchmark suite testing all context lengths:

python scripts/llm_benchmark_python.py
# Or with Docker:
docker run --gpus all -v ./results:/app/results ghcr.io/tomfoster/gpu-llm-benchmark:latest

Output Structure

Results are saved in timestamped directories:

benchmark_results_20250127_143022/
├── system_info.txt              # GPU specs, CPU info, software versions
├── individual_results.csv       # Raw test data
├── summary_stats.csv           # Statistical analysis
├── benchmark.log               # Execution log
├── prompt_*.txt                # Test prompts
├── response_*.txt              # Generated responses
├── ollama_response_*.json      # Full API responses
└── gpu_monitor_*.csv           # Real-time GPU metrics

Metrics Collected

Performance: Tokens/second, evaluation time, total duration
GPU Utilization: Core usage percentage, memory bandwidth utilization
Memory: VRAM used/free in GB
Thermal: Temperature, power consumption, fan speed
Clocks: Graphics and memory clock speeds
States: GPU performance state (P0-P8)

Repository Structure

scripts/: Benchmarking scripts and utilities
- llm_benchmark_python.py: Main benchmarking script
- Additional analysis tools (coming soon)
results/: Community benchmark results
- Organised by GPU model and configuration
- Includes raw data and analysis summaries
- View all results

Default Test Configuration

Model: Gemma 3 12B Instruct (Q5_K_XL quantization)
Context Lengths: 8K → 16K → 32K → 64K → 128K
Iterations: 3 runs per context length
Scenarios: Short prompts + Half-context prompts

Requirements

Hardware

NVIDIA GPU with CUDA support
Sufficient VRAM for target model and context lengths
nvidia-smi installed and accessible

Software

Python 3.8+
Docker (for containerized usage)
Ollama (automatically installed if not present)
NVIDIA drivers with nvidia-smi

Community Contribution

Share your benchmark results to help the community:

Run the full benchmark suite
Compress the results directory
Share on the community forums with your GPU model and system specs
Compare results at gpu-llm-benchmarks.com

CI/CD

This repository includes GitHub Actions workflows that:

Build and publish Docker images on each release
Run validation tests on pull requests
Ensure benchmark consistency across platforms

License

Apache License 2.0 - See LICENSE file for details