How to Use
Run any Lemma model locally in minutes. Ollama, Docker, llama.cpp, MLX, Python, and OpenAI-compatible API servers.
Quick Start
Every model works with the same tools. Swap the model name to switch between variants.
Examples below use lemma — replace with
lemer, lemmy, or
lemrd as needed.
Ollama
ollama run hf.co/lthn/lemma:Q4_K_M
Also available: Q5_K_M, Q6_K, Q8_0, BF16
Docker
# From HuggingFace
docker model run hf.co/lthn/lemma
# From Docker Hub
docker model run lthn/lemma
llama.cpp
# Install
brew install llama.cpp # macOS/Linux
winget install llama.cpp # Windows
# Server with web UI
llama-server -hf lthn/lemma:Q4_K_M
# CLI inference
llama-cli -hf lthn/lemma:Q4_K_M
MLX (Apple Silicon)
uv tool install mlx-lm
mlx_lm.chat --model lthn/lemma
mlx_lm.generate --model lthn/lemma \
--prompt "Hello, how are you?"
Native Apple Silicon — no emulation, no quantisation overhead on M-series
Python Libraries
llama-cpp-python
uv pip install llama-cpp-python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="lthn/lemma",
filename="lemma-q4_k_m.gguf",
)
# Text
llm.create_chat_completion(
messages=[{
"role": "user",
"content": "Hello, how are you?"
}]
)
# Vision (multimodal)
llm.create_chat_completion(
messages=[{
"role": "user",
"content": [
{"type": "text",
"text": "Describe this image."},
{"type": "image_url",
"image_url": {"url": "image.jpg"}}
]
}]
)
mlx-vlm (vision)
uv tool install mlx-vlm
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("lthn/lemma")
config = load_config("lthn/lemma")
image = ["photo.jpg"]
prompt = "Describe this image."
formatted = apply_chat_template(
processor, config, prompt, num_images=1
)
output = generate(
model, processor, formatted, image
)
print(output.text)
mlx-vlm handles vision tensor routing for Gemma 4. Use this for multimodal inference on Apple Silicon.
Servers (OpenAI-compatible API)
Serve any Lemma model as an OpenAI-compatible API endpoint. Drop-in replacement for
api.openai.com at localhost:8080/v1.
MLX Server
Lemma models are multimodal, so use mlx_vlm.server — the vision-aware variant.
The text-only mlx_lm.server does not correctly route multimodal tensors for Gemma 4.
mlx_vlm.server --model lthn/lemma
curl -X POST "http://localhost:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "lthn/lemma",
"messages": [{
"role": "user",
"content": "Hello, how are you?"
}],
"max_tokens": 200
}'
vLLM
vLLM requires the original (non-quantised) safetensors weights from the
LetheanNetwork/ repos. Linux + NVIDIA GPU with adequate VRAM.
uv pip install vllm
vllm serve "LetheanNetwork/lemma"
Serves at http://localhost:8000/v1 by default. Compatible with the OpenAI Python SDK.
llama.cpp Server
The built-in llama-server includes a web UI and serves an OpenAI-compatible API.
Works on macOS, Linux, and Windows.
llama-server -hf lthn/lemma:Q4_K_M
Serves at http://localhost:8080 with web UI and /v1/chat/completions endpoint.
Recommended Sampling
Gemma 4 is calibrated for temperature: 1.0 — this is
not the same as the typical 0.7 default for other models.
Lower values reduce diversity without improving quality. These defaults are pre-configured in the
params file (Ollama) and generation_config.json (transformers/MLX).
| Parameter | Value |
|---|---|
temperature |
1.0 |
top_p |
0.95 |
top_k |
64 |
stop |
<turn|>, <eos> |
Variable Image Resolution
Gemma 4 supports a configurable visual token budget. Higher = more detail, lower = faster inference. Place image content before text for best results.
| Token Budget | Use Case |
|---|---|
| 70 | Classification, captioning, video frames |
| 140 | General image understanding |
| 280 | Default — balanced quality and speed |
| 560 | OCR, document parsing, fine-grained detail |
| 1120 | Maximum detail (small text, complex documents) |
Default budget (280) is set in processor_config.json via
image_seq_length and max_soft_tokens.
The Lemma Family
| Name | Architecture | Parameters | Context | Consumer (GGUF + MLX) | Base (BF16) |
|---|---|---|---|---|---|
| Lemer | Gemma 4 E2B | 2.3B eff | 128K | lthn/lemer |
LetheanNetwork/lemer |
| Lemma | Gemma 4 E4B | 4.5B eff | 128K | lthn/lemma |
LetheanNetwork/lemma |
| Lemmy | Gemma 4 26B A4B MoE | 3.8B active / 26B total | 256K | lthn/lemmy |
LetheanNetwork/lemmy |
| Lemrd | Gemma 4 31B Dense | 30.7B | 256K | lthn/lemrd |
LetheanNetwork/lemrd |
All models share the same chat template, tokeniser, and inference pipeline. Swap the model name in any command above to use a different variant. See the model showcase for detailed specs and descriptions.
Why EUPL-1.2
Lemma models are licensed under the European Union Public Licence v1.2 — not Apache 2.0 or MIT. This is a deliberate choice.
23 Official Languages
The only OSS licence designed by lawmakers across multiple legal systems. "Derivative work" means the same thing in German, French, Estonian, and Maltese law.
Copyleft with Compatibility
Modifications must be shared back, but the licence plays cleanly with GPL, LGPL, MPL, and other major OSS licences. No accidental relicensing.
No Proprietary Capture
Anyone can use Lemma commercially — but they cannot fork it, train a competitor model on it, and close-source the result. The ethical layer stays in the open.
Built for Institutions
Government, research, and enterprise users get a licence designed for cross-border compliance, not a US-centric one.
Download from HuggingFace
All Lemma models are available under EUPL-1.2. Consumer models (GGUF + MLX) at lthn/, base models (BF16 safetensors) at LetheanNetwork/.