llama-go: Run LLMs locally with Go https://github.com/tcpipuk/llama-go
  • C++ 42.8%
  • Go 38.1%
  • C 17.8%
  • Makefile 1.3%
Find a file
Tom Foster 8864514fe3
All checks were successful
Build CUDA Image / build (push) Successful in 1m10s
CI/CD Pipeline / test (push) Successful in 49m24s
CI/CD Pipeline / libs-image (push) Successful in 1m49s
feat(deps): update llama.cpp to b9744 and Go to 1.26.4
Update the llama.cpp submodule from b9106 to b9744 (638 commits). Port the
speculative decoding path in wrapper.cpp to the refactored common_speculative
API: stateful per-sequence draft params (get_draft_params/draft/accept),
explicit prompt seeding and verify-decode on the draft context, and
caller-managed KV trimming of both contexts. Guard common_speculative_accept
behind a non-empty draft, since impl_last[seq_id] is only set when an
implementation actually drafts.

Bump Go to 1.26.4 for the latest security fixes across go.mod, Dockerfile.cuda
and the build-cuda image workflow, and re-vendor cgo_headers from b9744 (adds
common/imatrix-loader.h).

Static and shared CUDA suites pass (387/410, 0 failures) on an RTX 3090.

Upstream release: https://github.com/ggml-org/llama.cpp/releases/tag/b9744

Claude-Session: https://claude.ai/code/session_01ECuWNq4fUxATyuWufGy3H1
2026-06-21 15:11:29 +01:00
.forgejo/workflows feat(deps): update llama.cpp to b9744 and Go to 1.26.4 2026-06-21 15:11:29 +01:00
cgo_headers/llama.cpp feat(deps): update llama.cpp to b9744 and Go to 1.26.4 2026-06-21 15:11:29 +01:00
docs feat(deps): update llama.cpp to b9002 with static-default linkage redesign 2026-05-02 11:42:19 +01:00
examples refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
internal/exampleui feat(api): add channel-based generation and native chat support 2025-10-09 20:56:26 +01:00
llama.cpp@063d9c156e feat(deps): update llama.cpp to b9744 and Go to 1.26.4 2026-06-21 15:11:29 +01:00
.gitignore feat(cgo): vendor llama.cpp headers into the module 2026-05-03 10:07:24 +01:00
.gitmodules chore(deps): update llama.cpp submodule URL to ggml-org 2026-04-02 21:31:35 +01:00
.markdownlint.yaml feat(fork): establish active fork with modern development workflow 2025-09-27 12:48:35 +01:00
.pre-commit-config.yaml feat(cgo): vendor llama.cpp headers into the module 2026-05-03 10:07:24 +01:00
channel_test.go fix(tests): resolve channel error propagation race condition 2026-04-03 08:07:19 +01:00
chat.go refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
chat_options.go feat(api): add channel-based generation and native chat support 2025-10-09 20:56:26 +01:00
chat_test.go refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
chat_tools.go feat(api): add channel-based generation and native chat support 2025-10-09 20:56:26 +01:00
chat_types.go feat(api): add channel-based generation and native chat support 2025-10-09 20:56:26 +01:00
context.go fix(deps): adapt speculative decoding for llama.cpp b8635 API changes 2026-04-03 07:39:22 +01:00
doc.go refactor(bindings): reorganise model.go into focused files with comprehensive godocs 2025-10-07 20:19:08 +01:00
Dockerfile.build refactor(bindings): rewrite to use llama.cpp API directly 2025-09-29 21:00:17 +01:00
Dockerfile.cuda feat(deps): update llama.cpp to b9744 and Go to 1.26.4 2026-06-21 15:11:29 +01:00
Dockerfile.libs-cuda feat(cgo): vendor llama.cpp headers into the module 2026-05-03 10:07:24 +01:00
embeddings_test.go refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
error_handling_test.go test(tests): fix 7 failing tests after segfault fixes 2026-01-29 18:13:58 +00:00
generation_test.go fix(deps): adapt speculative decoding for llama.cpp b8635 API changes 2026-04-03 07:39:22 +01:00
go.mod feat(deps): update llama.cpp to b9744 and Go to 1.26.4 2026-06-21 15:11:29 +01:00
go.sum feat(deps): update llama.cpp to b9744 and Go to 1.26.4 2026-06-21 15:11:29 +01:00
gpu_layers_test.go test(ci): add Serial decorator and --flake-attempts 3 for timing-sensitive tests 2026-05-02 13:14:37 +01:00
LICENSE feat(fork): establish active fork with modern development workflow 2025-09-27 12:48:35 +01:00
linkage_shared.go feat(cgo): vendor llama.cpp headers into the module 2026-05-03 10:07:24 +01:00
linkage_static.go feat(cgo): vendor llama.cpp headers into the module 2026-05-03 10:07:24 +01:00
llama_cublas.go feat(deps): update llama.cpp to b9002 with static-default linkage redesign 2026-05-02 11:42:19 +01:00
llama_cublas_shared.go fix(cuda): add stubs path so -lcuda resolves without GPU at build time 2026-05-03 10:33:22 +01:00
llama_cublas_static.go fix(cuda): add stubs path so -lcuda resolves without GPU at build time 2026-05-03 10:33:22 +01:00
llama_hipblas.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
llama_metal.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
llama_openblas.go feat(deps): update llama.cpp to b9002 with static-default linkage redesign 2026-05-02 11:42:19 +01:00
llama_opencl.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
llama_rpc.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
llama_suite_test.go refactor(project): rename from go-llama.cpp to llama-go 2025-09-29 23:03:15 +01:00
llama_sycl.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
llama_vulkan.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
Makefile feat(cgo): vendor llama.cpp headers into the module 2026-05-03 10:07:24 +01:00
model.go fix(context): add WithUBatch for encoder-only embedding models 2026-05-02 18:15:34 +01:00
model_loading_test.go test(tests): fix 7 failing tests after segfault fixes 2026-01-29 18:13:58 +00:00
options_context.go fix(context): add WithUBatch for encoder-only embedding models 2026-05-02 18:15:34 +01:00
options_generate.go refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
options_model.go refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
prefix_caching_test.go refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
progress_callback.go feat(loading): add progress callback control (WithSilentLoading, WithProgressCallback) 2025-10-23 13:11:18 +01:00
README.md feat(deps): update llama.cpp to b9002 with static-default linkage redesign 2026-05-02 11:42:19 +01:00
RELEASE.md feat(deps): update llama.cpp to b9744 and Go to 1.26.4 2026-06-21 15:11:29 +01:00
renovate.json Add renovate.json 2023-04-24 12:05:59 +00:00
speculative_test.go test(tests): fix 7 failing tests after segfault fixes 2026-01-29 18:13:58 +00:00
stats.go refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
streaming_test.go refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
thread_config_test.go refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
tokenisation_test.go refactor(api): separate model weights from execution contexts 2025-10-25 08:39:15 +01:00
types.go fix(context): add WithUBatch for encoder-only embedding models 2026-05-02 18:15:34 +01:00
wrapper.cpp feat(deps): update llama.cpp to b9744 and Go to 1.26.4 2026-06-21 15:11:29 +01:00
wrapper.h fix(context): add WithUBatch for encoder-only embedding models 2026-05-02 18:15:34 +01:00

llama-go: Run LLMs locally with Go

Go Reference

Go bindings for llama.cpp, enabling you to run large language models locally with GPU acceleration. Production-ready library with thread-safe concurrent inference and comprehensive test coverage. Integrate LLM inference directly into Go applications with a clean, idiomatic API.

This is an active fork of go-skynet/go-llama.cpp, which hasn't been maintained since October 2023. The goal is keeping Go developers up-to-date with llama.cpp whilst offering a lighter, more performant alternative to Python-based ML stacks like PyTorch and/or vLLM.

Documentation:

Quick start

# Clone with submodules
git clone --recurse-submodules https://github.com/tcpipuk/llama-go
cd llama-go

# Build the library (default: static linkage, single-binary friendly)
make libbinding.a

# Download a test model
wget https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

# Run an example
export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD
go run ./examples/simple -m Qwen3-0.6B-Q8_0.gguf -p "Hello world" -n 50

Basic usage

package main

import (
    "context"
    "fmt"
    llama "github.com/tcpipuk/llama-go"
)

func main() {
    // Load model weights (ModelOption: WithGPULayers, WithMLock, etc.)
    model, err := llama.LoadModel(
        "/path/to/model.gguf",
        llama.WithGPULayers(-1), // Offload all layers to GPU
    )
    if err != nil {
        panic(err)
    }
    defer model.Close()

    // Create execution context (ContextOption: WithContext, WithBatch, etc.)
    ctx, err := model.NewContext(
        llama.WithContext(2048),
        llama.WithF16Memory(),
    )
    if err != nil {
        panic(err)
    }
    defer ctx.Close()

    // Chat completion (uses model's chat template)
    messages := []llama.ChatMessage{
        {Role: "system", Content: "You are a helpful assistant."},
        {Role: "user", Content: "What is the capital of France?"},
    }
    response, err := ctx.Chat(context.Background(), messages, llama.ChatOptions{
        MaxTokens: llama.Int(100),
    })
    if err != nil {
        panic(err)
    }
    fmt.Println(response.Content)

    // Or raw text generation
    text, err := ctx.Generate("Hello world", llama.WithMaxTokens(50))
    if err != nil {
        panic(err)
    }
    fmt.Println(text)
}

When building, set these environment variables:

export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD

The default static linkage links every llama.cpp library directly into your Go binary — no LD_LIBRARY_PATH setup, no shared libraries to ship alongside the executable. If you'd rather use shared libraries, build with BUILD_LINKAGE=shared make libbinding.a and pass -tags shared_lib to go build; the shared mode bakes -Wl,-rpath,$ORIGIN into the binary so the .so files only need to sit next to the executable.

Key capabilities

Text generation and chat: Generate text with LLMs using native chat completion (with automatic chat template formatting) or raw text generation. Extract embeddings for semantic search, clustering, and similarity tasks.

GPU acceleration: Supports NVIDIA (CUDA), AMD (ROCm), Apple Silicon (Metal), Intel (SYCL), and cross-platform acceleration (Vulkan, OpenCL). Eight backend options cover virtually all modern GPU hardware, plus distributed inference via RPC.

Production ready: Comprehensive test suite with almost 400 test cases and CI validation including CUDA builds. Active development tracking llama.cpp releases - maintained for production use, not a demo project.

Advanced features: Model/Context separation enables efficient VRAM usage - load model weights once, create multiple contexts with different configurations. Cache common prompt prefixes to avoid recomputing system prompts across thousands of generations. Serve multiple concurrent requests with a single model loaded in VRAM (no weight duplication). Stream tokens via callbacks or buffered channels (decouples GPU inference from slow processing). Speculative decoding for 2-3× generation speedup.

Architecture

The library bridges Go and C++ using CGO, keeping the heavy computation in llama.cpp's optimised C++ code whilst providing a clean Go API. This minimises CGO overhead whilst maximising performance.

Model/Context separation: The API separates model weights (Model) from execution state (Context). Load model weights once, create multiple contexts with different configurations. Each context maintains its own KV cache and state for independent inference operations.

Key components:

  • wrapper.cpp/wrapper.h - CGO interface to llama.cpp
  • model.go - Model loading and weight management (thread-safe)
  • context.go - Execution contexts for inference (one per goroutine)
  • Clean Go API with comprehensive godoc comments
  • llama.cpp/ - Git submodule tracking upstream releases

The design uses functional options for configuration (ModelOption vs ContextOption), explicit context creation for thread safety, automatic KV cache prefix reuse for performance, resource management with finalizers, and streaming callbacks via cgo.Handle for safe Go-C interaction.

Licence

MIT