llama-go: Run LLMs locally with Go https://github.com/tcpipuk/llama-go

Find a file

Tom Foster 4bfd26c86d Some checks failed CI/CD Pipeline / test (push) Failing after 23m59s Details feat(deps): update llama.cpp to b6974 Update llama.cpp submodule from b6836 to b6974 (138 commits). This release includes critical CUDA fixes (crash on uneven context without Flash Attention), improved Metal4 tensor API support, and optimizations across multiple backends (CUDA bandwidth, RVV kernels, SYCL operators). All test examples (inference and embedding) pass successfully with the updated bindings. Library builds cleanly with CUDA backend enabled. Upstream release: https://github.com/ggerganov/llama.cpp/releases/tag/b6974		2025-11-07 19:43:05 +00:00
.forgejo/workflows	feat(embeddings): add batch processing with multi-sequence support	2025-10-16 17:32:30 +01:00
docs	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
examples	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
internal/exampleui	feat(api): add channel-based generation and native chat support	2025-10-09 20:56:26 +01:00
llama.cpp@9eb9a1331d	feat(deps): update llama.cpp to b6974	2025-11-07 19:43:05 +00:00
.gitignore	refactor(bindings): rewrite to use llama.cpp API directly	2025-09-29 21:00:17 +01:00
.gitmodules	First import	2023-04-04 20:58:16 +02:00
.markdownlint.yaml	feat(fork): establish active fork with modern development workflow	2025-09-27 12:48:35 +01:00
.pre-commit-config.yaml	refactor(project): rename from go-llama.cpp to llama-go	2025-09-29 23:03:15 +01:00
channel_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
chat.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
chat_options.go	feat(api): add channel-based generation and native chat support	2025-10-09 20:56:26 +01:00
chat_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
chat_tools.go	feat(api): add channel-based generation and native chat support	2025-10-09 20:56:26 +01:00
chat_types.go	feat(api): add channel-based generation and native chat support	2025-10-09 20:56:26 +01:00
context.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
doc.go	refactor(bindings): reorganise model.go into focused files with comprehensive godocs	2025-10-07 20:19:08 +01:00
Dockerfile.build	refactor(bindings): rewrite to use llama.cpp API directly	2025-09-29 21:00:17 +01:00
Dockerfile.cuda	feat(cuda): add Flash Attention support for quantized KV cache	2025-10-24 15:25:44 +01:00
embeddings_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
error_handling_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
generation_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
go.mod	feat(api): add channel-based generation and native chat support	2025-10-09 20:56:26 +01:00
go.sum	feat(api): add channel-based generation and native chat support	2025-10-09 20:56:26 +01:00
gpu_layers_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
LICENSE	feat(fork): establish active fork with modern development workflow	2025-09-27 12:48:35 +01:00
llama_cublas.go	feat(api): add channel-based generation and native chat support	2025-10-09 20:56:26 +01:00
llama_hipblas.go	feat(sampling): add 26 advanced sampling parameters	2025-10-08 20:51:11 +01:00
llama_metal.go	feat(sampling): add 26 advanced sampling parameters	2025-10-08 20:51:11 +01:00
llama_openblas.go	feat(bindings): expand GPU acceleration backend support to 8 platforms	2025-10-07 20:23:41 +01:00
llama_opencl.go	feat(sampling): add 26 advanced sampling parameters	2025-10-08 20:51:11 +01:00
llama_rpc.go	feat(sampling): add 26 advanced sampling parameters	2025-10-08 20:51:11 +01:00
llama_suite_test.go	refactor(project): rename from go-llama.cpp to llama-go	2025-09-29 23:03:15 +01:00
llama_sycl.go	feat(sampling): add 26 advanced sampling parameters	2025-10-08 20:51:11 +01:00
llama_vulkan.go	feat(sampling): add 26 advanced sampling parameters	2025-10-08 20:51:11 +01:00
Makefile	feat(cuda): add Flash Attention support for quantized KV cache	2025-10-24 15:25:44 +01:00
model.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
model_loading_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
options_context.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
options_generate.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
options_model.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
prefix_caching_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
progress_callback.go	feat(loading): add progress callback control (WithSilentLoading, WithProgressCallback)	2025-10-23 13:11:18 +01:00
README.md	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
RELEASE.md	feat(cuda): add Flash Attention support for quantized KV cache	2025-10-24 15:25:44 +01:00
renovate.json	Add renovate.json	2023-04-24 12:05:59 +00:00
speculative_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
stats.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
streaming_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
thread_config_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
thread_safety_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
tokenisation_test.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
types.go	refactor(api): separate model weights from execution contexts	2025-10-25 08:39:15 +01:00
wrapper.cpp	feat(cuda): add Flash Attention support for quantized KV cache	2025-10-24 15:25:44 +01:00
wrapper.h	feat(cuda): add Flash Attention support for quantized KV cache	2025-10-24 15:25:44 +01:00

README.md

llama-go: Run LLMs locally with Go

Go bindings for llama.cpp, enabling you to run large language models locally with GPU acceleration. Production-ready library with thread-safe concurrent inference and comprehensive test coverage. Integrate LLM inference directly into Go applications with a clean, idiomatic API.

This is an active fork of go-skynet/go-llama.cpp, which hasn't been maintained since October 2023. The goal is keeping Go developers up-to-date with llama.cpp whilst offering a lighter, more performant alternative to Python-based ML stacks like PyTorch and/or vLLM.

Documentation:

Getting started: Installation guide | API guide | Build options
Migration: v1 to v2 migration guide for upgrading from the old API
API reference: pkg.go.dev (complete godoc with examples)
Examples: Working code examples for chat, streaming, embeddings, speculative decoding
Upstream: llama.cpp for model formats and engine details

Quick start

# Clone with submodules
git clone --recurse-submodules https://github.com/tcpipuk/llama-go
cd llama-go

# Build the library
make libbinding.a

# Download a test model
wget https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

# Run an example
export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD LD_LIBRARY_PATH=$PWD
go run ./examples/simple -m Qwen3-0.6B-Q8_0.gguf -p "Hello world" -n 50

Basic usage

package main

import (
    "context"
    "fmt"
    llama "github.com/tcpipuk/llama-go"
)

func main() {
    // Load model weights (ModelOption: WithGPULayers, WithMLock, etc.)
    model, err := llama.LoadModel(
        "/path/to/model.gguf",
        llama.WithGPULayers(-1), // Offload all layers to GPU
    )
    if err != nil {
        panic(err)
    }
    defer model.Close()

    // Create execution context (ContextOption: WithContext, WithBatch, etc.)
    ctx, err := model.NewContext(
        llama.WithContext(2048),
        llama.WithF16Memory(),
    )
    if err != nil {
        panic(err)
    }
    defer ctx.Close()

    // Chat completion (uses model's chat template)
    messages := []llama.ChatMessage{
        {Role: "system", Content: "You are a helpful assistant."},
        {Role: "user", Content: "What is the capital of France?"},
    }
    response, err := ctx.Chat(context.Background(), messages, llama.ChatOptions{
        MaxTokens: llama.Int(100),
    })
    if err != nil {
        panic(err)
    }
    fmt.Println(response.Content)

    // Or raw text generation
    text, err := ctx.Generate("Hello world", llama.WithMaxTokens(50))
    if err != nil {
        panic(err)
    }
    fmt.Println(text)
}

When building, set these environment variables:

export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD LD_LIBRARY_PATH=$PWD

Key capabilities

Text generation and chat: Generate text with LLMs using native chat completion (with automatic chat template formatting) or raw text generation. Extract embeddings for semantic search, clustering, and similarity tasks.

GPU acceleration: Supports NVIDIA (CUDA), AMD (ROCm), Apple Silicon (Metal), Intel (SYCL), and cross-platform acceleration (Vulkan, OpenCL). Eight backend options cover virtually all modern GPU hardware, plus distributed inference via RPC.

Production ready: Comprehensive test suite with almost 400 test cases and CI validation including CUDA builds. Active development tracking llama.cpp releases - maintained for production use, not a demo project.

Advanced features: Model/Context separation enables efficient VRAM usage - load model weights once, create multiple contexts with different configurations. Cache common prompt prefixes to avoid recomputing system prompts across thousands of generations. Serve multiple concurrent requests with a single model loaded in VRAM (no weight duplication). Stream tokens via callbacks or buffered channels (decouples GPU inference from slow processing). Speculative decoding for 2-3× generation speedup.

Architecture

The library bridges Go and C++ using CGO, keeping the heavy computation in llama.cpp's optimised C++ code whilst providing a clean Go API. This minimises CGO overhead whilst maximising performance.

Model/Context separation: The API separates model weights (Model) from execution state (Context). Load model weights once, create multiple contexts with different configurations. Each context maintains its own KV cache and state for independent inference operations.

Key components:

wrapper.cpp/wrapper.h - CGO interface to llama.cpp
model.go - Model loading and weight management (thread-safe)
context.go - Execution contexts for inference (one per goroutine)
Clean Go API with comprehensive godoc comments
llama.cpp/ - Git submodule tracking upstream releases

The design uses functional options for configuration (ModelOption vs ContextOption), explicit context creation for thread safety, automatic KV cache prefix reuse for performance, resource management with finalizers, and streaming callbacks via cgo.Handle for safe Go-C interaction.

Licence

MIT

README.md Unescape Escape