Documentation Index
Fetch the complete documentation index at: https://docs.mellea.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites: Quick Start complete,
pip install mellea, Ollama running locally, pytest installed.
Contributing to Mellea itself? See the Contributing Guide
for Mellea’s own test markers, fixtures, and CI setup.
Testing generative code requires you to separate concerns: some assertions are
always deterministic (the output is the right type), while others depend on model
behaviour and are inherently qualitative. This page shows you how to structure
both categories, configure the right pytest markers, and make your CI pipeline
fast and reliable.
Three levels of assertion
Every test for a @generative function falls into one of three levels:
| Level | What you assert | Deterministic? |
|---|
| Type check | isinstance(result, bool) | Yes — constrained decoding always returns the declared type |
| Structural check | result in ["positive", "negative"] or field names present | Yes — schema enforcement is deterministic |
| Qualitative check | assert result is True | No — depends on the model and prompt |
Type and structural checks run in CI. Qualitative checks carry
@pytest.mark.qualitative and are skipped in CI when CICD=1 is set.
Setting up a test session fixture
Use a backend fixture to handle CI versus local configuration, and a
function-scoped session fixture to give each test a clean slate:
import os
import pytest
from mellea import MelleaSession
from mellea.backends.ollama import OllamaModelBackend
_MODEL_ID = "granite4:micro"
@pytest.fixture(scope="module")
def backend():
"""Ollama backend — swap for any backend your app uses."""
host = os.environ.get("OLLAMA_HOST", "http://localhost:11434")
return OllamaModelBackend(model_id=_MODEL_ID, host=host)
@pytest.fixture(scope="function")
def session(backend):
"""Fresh MelleaSession for each test."""
m = MelleaSession(backend=backend)
yield m
m.reset()
Note: Scoping backend to module and session to function strikes a
balance between setup cost and test isolation. Each test gets a clean context,
but the backend connection is created once per module.
Module-level markers
Declare markers at the top of your test file with pytestmark so they apply to
every test in the module without repetition. Register your own markers in
pyproject.toml under [tool.pytest.ini_options] markers to avoid warnings:
[tool.pytest.ini_options]
markers = [
"qualitative: tests that assert on LLM output content (skipped in CI)",
"requires_ollama: tests that need Ollama running locally",
]
import pytest
pytestmark = [pytest.mark.requires_ollama]
Testing @generative functions
Type assertions — always deterministic
The return type of a @generative function is enforced by constrained decoding
or output parsing. An isinstance check never depends on model behaviour:
from typing import Literal
import pytest
from mellea import generative
from mellea.stdlib.requirements import Requirement, simple_validate
@generative
def classify_sentiment(text: str) -> Literal["positive", "negative"]:
"""Classify the sentiment of the provided text."""
def test_classify_sentiment_type(session):
result = classify_sentiment(session, text="I love this product!")
# Type check: always passes regardless of which value the model chose.
assert isinstance(result, str)
Structural assertions — always deterministic
For Literal return types, membership in the allowed values is enforced before
your test sees the result. The assertion is still deterministic:
def test_classify_sentiment_structure(session):
result = classify_sentiment(session, text="I love this product!")
assert result in ["positive", "negative"]
For Pydantic model return types, assert that the required fields are present and
have the right types:
from pydantic import BaseModel
from mellea import generative
class Review(BaseModel):
summary: str
score: int
tags: list[str]
@generative
def extract_review(raw: str) -> Review:
"""Extract a structured review from raw text."""
def test_extract_review_structure(session):
result = extract_review(
session,
raw="Excellent build quality. I rate it 9 out of 10. #durable #premium",
)
assert isinstance(result, Review)
assert isinstance(result.summary, str)
assert isinstance(result.score, int)
assert isinstance(result.tags, list)
Qualitative assertions — mark and skip in CI
When you want to assert on the content of a response, add
@pytest.mark.qualitative. These tests are skipped automatically in CI
(CICD=1) and are intended to run locally or in a dedicated quality gate:
import pytest
from mellea import generative
@generative
def is_happy(text: str) -> bool:
"""Determine if the text has a happy mood."""
@pytest.mark.qualitative
def test_is_happy_positive(session):
result = is_happy(session, text="I'm enjoying life.")
assert isinstance(result, bool)
# Qualitative: the correct answer is True, but this is model-dependent.
assert result is True
@pytest.mark.qualitative
def test_classify_sentiment_positive(session):
result = classify_sentiment(session, text="I love this product!")
assert result == "positive"
Warning: Do not assert on qualitative behaviour without @pytest.mark.qualitative.
A deterministic-looking assertion like assert score > 5 can flake across
model versions, temperatures, and quantisation levels.
Testing instruct() calls
instruct() calls are non-qualitative when you test structure, not content.
Assert that the call returns a value and that the value has the right type:
from mellea.stdlib.sampling import RejectionSamplingStrategy
def test_instruct_returns_string(session):
res = session.instruct(
"Write an email to the interns.",
requirements=["be funny"],
strategy=RejectionSamplingStrategy(loop_budget=3),
)
assert res is not None
assert isinstance(res.value, str)
Inspecting logged model options
_generate_log.model_options lets you confirm that options you passed were
forwarded to the model. This is useful when testing custom model option handling:
from mellea.backends import ModelOption
def test_model_options_forwarded(session):
model_options = {
ModelOption.TEMPERATURE: 0.5,
ModelOption.MAX_NEW_TOKENS: 100,
"custom_param": "should_pass_through",
}
res = session.instruct(
"Write a one-sentence summary.",
model_options=model_options,
)
assert "custom_param" in res._generate_log.model_options
Note: _generate_log is an internal attribute. Its structure may change
between Mellea versions. Use it for debugging and option-forwarding tests, not
as a primary correctness check.
Using simple_validate for deterministic checks
simple_validate wraps a plain function into a validation callable that
Requirement accepts. Use it to assert deterministic structural constraints
inside the IVR loop, or directly in tests to verify that your validator logic
behaves correctly:
from mellea.stdlib.requirements import Requirement, simple_validate
def test_simple_validate_logic():
"""Unit-test a validator without making any LLM calls."""
validator = simple_validate(lambda x: (len(x) > 0, "Output must not be empty."))
# Confirm the validator passes for non-empty output.
# simple_validate returns a Context -> ValidationResult callable.
# You can test the underlying function directly:
result_fn = lambda text: (len(text) > 0, "Output must not be empty.")
ok, _ = result_fn("hello")
assert ok is True
empty_ok, reason = result_fn("")
assert empty_ok is False
assert "empty" in reason
When you attach simple_validate to a Requirement, it checks the last model
output as a string, regardless of how the output was parsed:
from mellea.stdlib.requirements import Requirement, simple_validate
from mellea.stdlib.sampling import RejectionSamplingStrategy
def test_with_simple_validate_requirement(session):
res = session.instruct(
"Reply with a number between 1 and 10.",
requirements=[
Requirement(
"Reply with a number between 1 and 10.",
validation_fn=simple_validate(
lambda x: (x.strip().isdigit(), "Expected a digit.")
),
)
],
strategy=RejectionSamplingStrategy(loop_budget=5),
)
assert res is not None
assert isinstance(res.value, str)
The unit_test_eval component
mellea.stdlib.components.unit_test_eval provides TestBasedEval, a
Component that formats an LLM-as-a-judge evaluation task. You load test cases
from a JSON file and pass them to a judge session. This is useful for offline
evaluation pipelines, not for individual pytest assertions.
Each entry in the JSON array defines one test:
[
{
"source": "email-classifier",
"name": "positive_case_001",
"instructions": "Evaluate whether the prediction correctly identifies the category.",
"id": "tc-001",
"examples": [
{
"input_id": "ex-001",
"input": [{"role": "user", "content": "Is this email spam?"}],
"targets": [{"role": "assistant", "content": "no"}]
}
]
}
]
Loading and running evaluations
from mellea import MelleaSession, start_session
from mellea.stdlib.components.unit_test_eval import TestBasedEval
# Load one TestBasedEval per test definition in the file.
test_evals = TestBasedEval.from_json_file("tests/eval_data/email_classifier.json")
judge_session = start_session()
for eval_case in test_evals:
for idx, input_text in enumerate(eval_case.inputs):
# Generate the prediction from the system under test.
prediction = "no" # replace with your actual model call
targets = eval_case.targets[idx] if eval_case.targets else []
eval_case.set_judge_context(input_text, prediction, targets)
verdict = judge_session.instruct(eval_case)
print(f"{eval_case.name}: {verdict.value}")
Note: TestBasedEval calls the judge model once per input. For large
evaluation sets, consider batching or running evaluations asynchronously.
CI strategy
A simple conftest.py that skips qualitative tests in CI:
# conftest.py
import os
import pytest
def pytest_configure(config):
config.addinivalue_line(
"markers", "qualitative: assert on LLM output content — skip in CI"
)
def pytest_collection_modifyitems(config, items):
if os.environ.get("CI"):
skip = pytest.mark.skip(reason="qualitative tests skipped in CI")
for item in items:
if "qualitative" in item.keywords:
item.add_marker(skip)
Then in your GitHub Actions workflow:
- name: Run tests
run: pytest
env:
CI: "true" # qualitative tests are automatically skipped
To run the full suite including qualitative tests locally:
| Test category | Marker | Runs in CI? |
|---|
| Type and structural checks | (none needed) | Yes |
| Qualitative content checks | @pytest.mark.qualitative | No — skipped when CI=true |
| Tests needing a running backend | @pytest.mark.requires_ollama | Only if Ollama is in CI |
| Long-running tests | @pytest.mark.slow | Optionally excluded |
Next steps