llama-go: Run LLMs locally with Go https://github.com/tcpipuk/llama-go
Find a file
Tom Foster 2287b48101
All checks were successful
CI/CD Pipeline / test (push) Successful in 7m49s
feat(sampling): add 26 advanced sampling parameters
Expand sampling parameter coverage from 4 to 30 parameters (13% to 100%). Add explicit With* functions for all llama.cpp sampling options including repetition penalties, DRY sampling, dynamic temperature, Mirostat, XTC, and more. Each parameter includes comprehensive godocs with usage examples.

Parameters organised by category: basic sampling (min-p, typical-p, top-n-sigma, min-keep), repetition penalties (repeat, frequency, presence), DRY sampling (multiplier, base, allowed length, sequence breakers), dynamic temperature, XTC, Mirostat (version, tau, eta), and other options (n-prev, n-probs, ignore-eos). All defaults match llama.cpp behaviour with penalties disabled by default.

All 363 existing tests pass. Parameters flow correctly through entire stack: Go options → C struct → C++ → llama.cpp sampling.
2025-10-08 20:51:11 +01:00
.forgejo/workflows chore(ci): update default Go version to 1.25.2 2025-10-08 17:44:09 +01:00
docs feat(bindings): expand GPU acceleration backend support to 8 platforms 2025-10-07 20:23:41 +01:00
examples docs(examples): add comprehensive package documentation to all examples 2025-10-07 20:22:11 +01:00
llama.cpp@d2ee056e1d feat(deps): update llama.cpp to b6713 with test improvements 2025-10-08 18:43:45 +01:00
.gitignore refactor(bindings): rewrite to use llama.cpp API directly 2025-09-29 21:00:17 +01:00
.gitmodules First import 2023-04-04 20:58:16 +02:00
.markdownlint.yaml feat(fork): establish active fork with modern development workflow 2025-09-27 12:48:35 +01:00
.pre-commit-config.yaml refactor(project): rename from go-llama.cpp to llama-go 2025-09-29 23:03:15 +01:00
context_pool.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
doc.go refactor(bindings): reorganise model.go into focused files with comprehensive godocs 2025-10-07 20:19:08 +01:00
Dockerfile.build refactor(bindings): rewrite to use llama.cpp API directly 2025-09-29 21:00:17 +01:00
Dockerfile.cuda refactor(project): rename from go-llama.cpp to llama-go 2025-09-29 23:03:15 +01:00
embeddings_test.go test(gpu): set explicit 2048 context size for memory efficiency 2025-10-08 08:24:34 +01:00
error_handling_test.go fix(tests): fix flaky tests and update error expectations 2025-10-08 16:25:55 +01:00
generate.go refactor(bindings): reorganise model.go into focused files with comprehensive godocs 2025-10-07 20:19:08 +01:00
generate_internal.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
generate_tokens.go refactor(bindings): reorganise model.go into focused files with comprehensive godocs 2025-10-07 20:19:08 +01:00
generation_test.go feat(deps): update llama.cpp to b6713 with test improvements 2025-10-08 18:43:45 +01:00
go.mod refactor(project): rename from go-llama.cpp to llama-go 2025-09-29 23:03:15 +01:00
go.sum fix(deps): update all dependencies to address security vulnerabilities 2025-09-29 21:38:01 +01:00
gpu_layers_test.go test(gpu): set explicit 2048 context size for memory efficiency 2025-10-08 08:24:34 +01:00
LICENSE feat(fork): establish active fork with modern development workflow 2025-09-27 12:48:35 +01:00
llama_cublas.go feat(bindings): expand GPU acceleration backend support to 8 platforms 2025-10-07 20:23:41 +01:00
llama_hipblas.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
llama_metal.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
llama_openblas.go feat(bindings): expand GPU acceleration backend support to 8 platforms 2025-10-07 20:23:41 +01:00
llama_opencl.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
llama_rpc.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
llama_suite_test.go refactor(project): rename from go-llama.cpp to llama-go 2025-09-29 23:03:15 +01:00
llama_sycl.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
llama_vulkan.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
Makefile refactor(bindings): rewrite to use llama.cpp API directly 2025-09-29 21:00:17 +01:00
model.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
model_loading_test.go test(gpu): set explicit 2048 context size for memory efficiency 2025-10-08 08:24:34 +01:00
options_generate.go feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
options_model.go refactor(bindings): reorganise model.go into focused files with comprehensive godocs 2025-10-07 20:19:08 +01:00
prefix_caching_test.go fix(tests): fix flaky tests and update error expectations 2025-10-08 16:25:55 +01:00
README.md docs(readme): streamline documentation references and remove redundancy 2025-10-07 22:44:36 +01:00
RELEASE.md docs(release): use build-cuda container and update build timeout 2025-10-08 18:50:27 +01:00
renovate.json Add renovate.json 2023-04-24 12:05:59 +00:00
speculative_test.go fix(tests): fix flaky tests and update error expectations 2025-10-08 16:25:55 +01:00
streaming_test.go fix(tests): fix flaky tests and update error expectations 2025-10-08 16:25:55 +01:00
thread_config_test.go test(gpu): set explicit 2048 context size for memory efficiency 2025-10-08 08:24:34 +01:00
thread_safety_test.go fix(tests): skip concurrent failure test causing heap corruption 2025-10-08 17:07:16 +01:00
tokenisation_test.go test(gpu): set explicit 2048 context size for memory efficiency 2025-10-08 08:24:34 +01:00
util.go refactor(bindings): reorganise model.go into focused files with comprehensive godocs 2025-10-07 20:19:08 +01:00
wrapper.cpp feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00
wrapper.h feat(sampling): add 26 advanced sampling parameters 2025-10-08 20:51:11 +01:00

llama-go: Run LLMs locally with Go

Go Reference

Go bindings for llama.cpp, enabling you to run large language models locally with GPU acceleration. Production-ready library with thread-safe concurrent inference and comprehensive test coverage. Integrate LLM inference directly into Go applications with a clean, idiomatic API.

This is an active fork of go-skynet/go-llama.cpp, which hasn't been maintained since October 2023. The goal is keeping Go developers up-to-date with llama.cpp whilst offering a lighter, more performant alternative to Python-based ML stacks like PyTorch and/or vLLM.

Documentation: See getting started guide, building guide, API guide, examples, Go package docs, and llama.cpp for model format and engine details.

Quick start

# Clone with submodules
git clone --recurse-submodules https://github.com/tcpipuk/llama-go
cd llama-go

# Build the library
make libbinding.a

# Download a test model
wget https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

# Run an example
export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD LD_LIBRARY_PATH=$PWD
go run ./examples/simple -m Qwen3-0.6B-Q8_0.gguf -p "Hello world" -n 50

Basic usage

package main

import (
    "fmt"
    llama "github.com/tcpipuk/llama-go"
)

func main() {
    model, err := llama.LoadModel(
        "/path/to/model.gguf",
        llama.WithF16Memory(),
        llama.WithContext(512),
    )
    if err != nil {
        panic(err)
    }
    defer model.Close()

    response, err := model.Generate("Hello world", llama.WithMaxTokens(50))
    if err != nil {
        panic(err)
    }

    fmt.Println(response)
}

When building, set these environment variables:

export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD LD_LIBRARY_PATH=$PWD

Key capabilities

Text generation and embeddings: Generate text with LLMs or extract embeddings for semantic search, clustering, and similarity tasks.

GPU acceleration: Supports NVIDIA (CUDA), AMD (ROCm), Apple Silicon (Metal), Intel (SYCL), and cross-platform acceleration (Vulkan, OpenCL). Eight backend options cover virtually all modern GPU hardware, plus distributed inference via RPC.

Production ready: Comprehensive test suite with almost 400 test cases and CI validation including CUDA builds. Active development tracking llama.cpp releases - maintained for production use, not a demo project.

Advanced features: Cache common prompt prefixes to avoid recomputing system prompts across thousands of generations. Serve multiple concurrent requests with a single model loaded in VRAM (no weight duplication). Stream tokens as they generate for ChatGPT-style typing effects. Speculative decoding for 2-3× generation speedup.

Architecture

The library bridges Go and C++ using CGO, keeping the heavy computation in llama.cpp's optimised C++ code whilst providing a clean Go API. This minimises CGO overhead whilst maximising performance.

Key components:

  • wrapper.cpp/wrapper.h - CGO interface to llama.cpp
  • Clean Go API with comprehensive godoc comments
  • llama.cpp/ - Git submodule tracking upstream releases

The design uses functional options for configuration, dynamic context pooling for thread safety, automatic KV cache prefix reuse for performance, resource management with finalizers, and streaming callbacks via cgo.Handle for safe Go-C interaction.

Licence

MIT