How to Use Lemma — Quick Start Guide

Quick Start

Every model works with the same tools. Swap the model name to switch between variants. Examples below use lemma — replace with lemer, lemmy, or lemrd as needed.

Ollama

ollama run hf.co/lthn/lemma:Q4_K_M

Also available: Q5_K_M, Q6_K, Q8_0, BF16

Docker

# From HuggingFace
docker model run hf.co/lthn/lemma

# From Docker Hub
docker model run lthn/lemma

llama.cpp

# Install
brew install llama.cpp        # macOS/Linux
winget install llama.cpp      # Windows

# Server with web UI
llama-server -hf lthn/lemma:Q4_K_M

# CLI inference
llama-cli -hf lthn/lemma:Q4_K_M

MLX (Apple Silicon)

uv tool install mlx-lm

mlx_lm.chat --model lthn/lemma
mlx_lm.generate --model lthn/lemma \
  --prompt "Hello, how are you?"

Native Apple Silicon — no emulation, no quantisation overhead on M-series

Python Libraries

llama-cpp-python

uv pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="lthn/lemma",
    filename="lemma-q4_k_m.gguf",
)

# Text
llm.create_chat_completion(
    messages=[{
        "role": "user",
        "content": "Hello, how are you?"
    }]
)

# Vision (multimodal)
llm.create_chat_completion(
    messages=[{
        "role": "user",
        "content": [
            {"type": "text",
             "text": "Describe this image."},
            {"type": "image_url",
             "image_url": {"url": "image.jpg"}}
        ]
    }]
)

mlx-vlm (vision)

uv tool install mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("lthn/lemma")
config = load_config("lthn/lemma")

image = ["photo.jpg"]
prompt = "Describe this image."

formatted = apply_chat_template(
    processor, config, prompt, num_images=1
)

output = generate(
    model, processor, formatted, image
)
print(output.text)

mlx-vlm handles vision tensor routing for Gemma 4. Use this for multimodal inference on Apple Silicon.

Servers (OpenAI-compatible API)

Serve any Lemma model as an OpenAI-compatible API endpoint. Drop-in replacement for api.openai.com at localhost:8080/v1.

MLX Server

Lemma models are multimodal, so use mlx_vlm.server — the vision-aware variant. The text-only mlx_lm.server does not correctly route multimodal tensors for Gemma 4.

mlx_vlm.server --model lthn/lemma

curl -X POST "http://localhost:8080/v1/chat/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "lthn/lemma",
    "messages": [{
      "role": "user",
      "content": "Hello, how are you?"
    }],
    "max_tokens": 200
  }'

vLLM

vLLM requires the original (non-quantised) safetensors weights from the LetheanNetwork/ repos. Linux + NVIDIA GPU with adequate VRAM.

uv pip install vllm
vllm serve "LetheanNetwork/lemma"

Serves at http://localhost:8000/v1 by default. Compatible with the OpenAI Python SDK.

llama.cpp Server

The built-in llama-server includes a web UI and serves an OpenAI-compatible API. Works on macOS, Linux, and Windows.

llama-server -hf lthn/lemma:Q4_K_M

Serves at http://localhost:8080 with web UI and /v1/chat/completions endpoint.

Recommended Sampling

Gemma 4 is calibrated for temperature: 1.0 — this is not the same as the typical 0.7 default for other models. Lower values reduce diversity without improving quality. These defaults are pre-configured in the params file (Ollama) and generation_config.json (transformers/MLX).

Parameter	Value
`temperature`	`1.0`
`top_p`	`0.95`
`top_k`	`64`
`stop`	`<turn\|>`, `<eos>`

Variable Image Resolution

Gemma 4 supports a configurable visual token budget. Higher = more detail, lower = faster inference. Place image content before text for best results.

Token Budget	Use Case
70	Classification, captioning, video frames
140	General image understanding
280	Default — balanced quality and speed
560	OCR, document parsing, fine-grained detail
1120	Maximum detail (small text, complex documents)

Default budget (280) is set in processor_config.json via image_seq_length and max_soft_tokens.

The Lemma Family

Name	Architecture	Parameters	Context	Consumer (GGUF + MLX)	Base (BF16)
Lemer	Gemma 4 E2B	2.3B eff	128K	`lthn/lemer`	`LetheanNetwork/lemer`
Lemma	Gemma 4 E4B	4.5B eff	128K	`lthn/lemma`	`LetheanNetwork/lemma`
Lemmy	Gemma 4 26B A4B MoE	3.8B active / 26B total	256K	`lthn/lemmy`	`LetheanNetwork/lemmy`
Lemrd	Gemma 4 31B Dense	30.7B	256K	`lthn/lemrd`	`LetheanNetwork/lemrd`

All models share the same chat template, tokeniser, and inference pipeline. Swap the model name in any command above to use a different variant. See the model showcase for detailed specs and descriptions.

Why EUPL-1.2

Lemma models are licensed under the European Union Public Licence v1.2 — not Apache 2.0 or MIT. This is a deliberate choice.

23 Official Languages

The only OSS licence designed by lawmakers across multiple legal systems. "Derivative work" means the same thing in German, French, Estonian, and Maltese law.

Copyleft with Compatibility

Modifications must be shared back, but the licence plays cleanly with GPL, LGPL, MPL, and other major OSS licences. No accidental relicensing.

No Proprietary Capture

Anyone can use Lemma commercially — but they cannot fork it, train a competitor model on it, and close-source the result. The ethical layer stays in the open.

Built for Institutions

Government, research, and enterprise users get a licence designed for cross-border compliance, not a US-centric one.

Download from HuggingFace

All Lemma models are available under EUPL-1.2. Consumer models (GGUF + MLX) at lthn/, base models (BF16 safetensors) at LetheanNetwork/.

Lemma Collection Model Showcase