adam bien's blog

lightmetal: GPU LLM Inference From a Single Java 25 JAR 📎

GPU LLM inference on Apple Silicon, packaged as one Java 25 executable JAR, zero dependencies. lightmetal binds a Metal-enabled libllama.dylib through the Foreign Function & Memory API and runs Mistral- and Gemma-architecture GGUF models locally.

Build it with zb, point it at a GGUF, prompt it:

zb build
java --enable-native-access=ALL-UNNAMED -jar zbo/lightmetal.jar \
     -model ~/models/Mistral-Medium-3.5-128B-UD-Q5_K_XL-00001-of-00003.gguf \
     -prompt "What is Java?"

Add -serve and the same JAR exposes an Anthropic-compatible POST /v1/messages and an OpenAI-compatible POST /v1/chat/completions. xisting clients (zsmith, vibe) only need a base URL switch — the loaded GGUF wins, the model field is accepted and ignored.

Embedding into another Java app needs no compile-time dependency. lightmetal.jar registers a BinaryOperator via META-INF/services:

var generator = ServiceLoader.load(BinaryOperator.class).iterator().next();
var response  = generator.apply("/path/to/model.gguf", "What is Java?");

Just Java 25, llama.cpp, FFM, Metal — and a GGUF on disk.