Skip to main content
Usage Guide · Lemma Model Family

How to Use

Run any Lemma model locally in minutes. Ollama, Docker, llama.cpp, MLX, Python, and OpenAI-compatible API servers.

Quick Start

Every model works with the same tools. Swap the model name to switch between variants. Examples below use lemma — replace with lemer, lemmy, or lemrd as needed.

Ollama

ollama run hf.co/lthn/lemma:Q4_K_M

Also available: Q5_K_M, Q6_K, Q8_0, BF16

Docker

# From HuggingFace
docker model run hf.co/lthn/lemma

# From Docker Hub
docker model run lthn/lemma

llama.cpp

# Install
brew install llama.cpp        # macOS/Linux
winget install llama.cpp      # Windows

# Server with web UI
llama-server -hf lthn/lemma:Q4_K_M

# CLI inference
llama-cli -hf lthn/lemma:Q4_K_M

MLX (Apple Silicon)

uv tool install mlx-lm

mlx_lm.chat --model lthn/lemma
mlx_lm.generate --model lthn/lemma \
  --prompt "Hello, how are you?"

Native Apple Silicon — no emulation, no quantisation overhead on M-series

Python Libraries

llama-cpp-python

uv pip install llama-cpp-python
from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="lthn/lemma",
    filename="lemma-q4_k_m.gguf",
)

# Text
llm.create_chat_completion(
    messages=[{
        "role": "user",
        "content": "Hello, how are you?"
    }]
)

# Vision (multimodal)
llm.create_chat_completion(
    messages=[{
        "role": "user",
        "content": [
            {"type": "text",
             "text": "Describe this image."},
            {"type": "image_url",
             "image_url": {"url": "image.jpg"}}
        ]
    }]
)

mlx-vlm (vision)

uv tool install mlx-vlm
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("lthn/lemma")
config = load_config("lthn/lemma")

image = ["photo.jpg"]
prompt = "Describe this image."

formatted = apply_chat_template(
    processor, config, prompt, num_images=1
)

output = generate(
    model, processor, formatted, image
)
print(output.text)

mlx-vlm handles vision tensor routing for Gemma 4. Use this for multimodal inference on Apple Silicon.

Servers (OpenAI-compatible API)

Serve any Lemma model as an OpenAI-compatible API endpoint. Drop-in replacement for api.openai.com at localhost:8080/v1.

MLX Server

Lemma models are multimodal, so use mlx_vlm.server — the vision-aware variant. The text-only mlx_lm.server does not correctly route multimodal tensors for Gemma 4.

mlx_vlm.server --model lthn/lemma
curl -X POST "http://localhost:8080/v1/chat/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "lthn/lemma",
    "messages": [{
      "role": "user",
      "content": "Hello, how are you?"
    }],
    "max_tokens": 200
  }'

vLLM

vLLM requires the original (non-quantised) safetensors weights from the LetheanNetwork/ repos. Linux + NVIDIA GPU with adequate VRAM.

uv pip install vllm
vllm serve "LetheanNetwork/lemma"

Serves at http://localhost:8000/v1 by default. Compatible with the OpenAI Python SDK.

llama.cpp Server

The built-in llama-server includes a web UI and serves an OpenAI-compatible API. Works on macOS, Linux, and Windows.

llama-server -hf lthn/lemma:Q4_K_M

Serves at http://localhost:8080 with web UI and /v1/chat/completions endpoint.

Recommended Sampling

Gemma 4 is calibrated for temperature: 1.0 — this is not the same as the typical 0.7 default for other models. Lower values reduce diversity without improving quality. These defaults are pre-configured in the params file (Ollama) and generation_config.json (transformers/MLX).

Parameter Value
temperature 1.0
top_p 0.95
top_k 64
stop <turn|>, <eos>

Variable Image Resolution

Gemma 4 supports a configurable visual token budget. Higher = more detail, lower = faster inference. Place image content before text for best results.

Token Budget Use Case
70 Classification, captioning, video frames
140 General image understanding
280 Default — balanced quality and speed
560 OCR, document parsing, fine-grained detail
1120 Maximum detail (small text, complex documents)

Default budget (280) is set in processor_config.json via image_seq_length and max_soft_tokens.

The Lemma Family

Name Architecture Parameters Context Consumer (GGUF + MLX) Base (BF16)
Lemer Gemma 4 E2B 2.3B eff 128K lthn/lemer LetheanNetwork/lemer
Lemma Gemma 4 E4B 4.5B eff 128K lthn/lemma LetheanNetwork/lemma
Lemmy Gemma 4 26B A4B MoE 3.8B active / 26B total 256K lthn/lemmy LetheanNetwork/lemmy
Lemrd Gemma 4 31B Dense 30.7B 256K lthn/lemrd LetheanNetwork/lemrd

All models share the same chat template, tokeniser, and inference pipeline. Swap the model name in any command above to use a different variant. See the model showcase for detailed specs and descriptions.

Why EUPL-1.2

Lemma models are licensed under the European Union Public Licence v1.2 — not Apache 2.0 or MIT. This is a deliberate choice.

23 Official Languages

The only OSS licence designed by lawmakers across multiple legal systems. "Derivative work" means the same thing in German, French, Estonian, and Maltese law.

Copyleft with Compatibility

Modifications must be shared back, but the licence plays cleanly with GPL, LGPL, MPL, and other major OSS licences. No accidental relicensing.

No Proprietary Capture

Anyone can use Lemma commercially — but they cannot fork it, train a competitor model on it, and close-source the result. The ethical layer stays in the open.

Built for Institutions

Government, research, and enterprise users get a licence designed for cross-border compliance, not a US-centric one.

Download from HuggingFace

All Lemma models are available under EUPL-1.2. Consumer models (GGUF + MLX) at lthn/, base models (BF16 safetensors) at LetheanNetwork/.