lightmetal: GPU LLM Inference From a Single Java 25 JAR 📎
GPU LLM inference on Apple Silicon, packaged as one Java 25 executable JAR, zero dependencies. lightmetal binds a Metal-enabled libllama.dylib through the Foreign Function & Memory API and runs Mistral- and Gemma-architecture GGUF models locally.
Build it with zb, point it at a GGUF, prompt it:
zb build
java --enable-native-access=ALL-UNNAMED -jar zbo/lightmetal.jar \
-model ~/models/Mistral-Medium-3.5-128B-UD-Q5_K_XL-00001-of-00003.gguf \
-prompt "What is Java?"
Add -serve and the same JAR exposes an Anthropic-compatible POST /v1/messages and an OpenAI-compatible POST /v1/chat/completions.
xisting clients (zsmith, vibe) only need a base URL switch — the loaded GGUF wins, the model field is accepted and ignored.
Embedding into another Java app needs no compile-time dependency. lightmetal.jar registers a BinaryOperator via META-INF/services:
var generator = ServiceLoader.load(BinaryOperator.class).iterator().next();
var response = generator.apply("/path/to/model.gguf", "What is Java?");
Just Java 25, llama.cpp, FFM, Metal — and a GGUF on disk.