mellea.stdlib.components.unit_test_eval

LLM Evaluation with Unit Tests in Mellea.

Classes

CLASS `Message`

Schema for a message in the test data. Attributes:

role: The role of the message sender (e.g. "user" or "assistant").
content: The text content of the message.

CLASS `Example`

Schema for an example in the test data. Attributes:

input: The input messages for this example.
targets: The expected target messages for scoring.
input_id: An optional identifier for this input example.

CLASS `TestData`

Schema for test data loaded from json. Attributes:

source: Origin identifier for this test dataset.
name: Human-readable name for this test dataset.
instructions: Evaluation guidelines used by the judge model.
examples: The individual input/target example pairs.
id: Unique identifier for this test dataset.

Methods:

FUNC `validate_examples`

validate_examples(cls, v: list[Example]) -> list[Example]

Validate that the examples list is not empty. Args:

v: The value of the examples field being validated.

Returns:

list[Example]: The validated examples list, unchanged.

Raises:

ValueError: If the examples list is empty.

CLASS `TestBasedEval`

Each TestBasedEval represents a single unit test. Args:

source: Origin identifier for this test dataset.
name: Human-readable name for this test.
instructions: Evaluation guidelines used by the judge model.
inputs: The input texts for each example.
targets: Expected target strings for each input. None is treated as an empty list.
test_id: Optional unique identifier for this test.
input_ids: Optional identifiers for each input.

Methods:

FUNC `parts`

parts(self) -> list[Component | CBlock]

Return the constituent parts of this component. Returns:

list[Component | CBlock]: Always an empty list; the component
renders entirely via format_for_llm.

FUNC `format_for_llm`

format_for_llm(self) -> TemplateRepresentation

Format this test for judge evaluation. Returns:

A template representation containing the
judge context (input, prediction, target, guidelines) set by
set_judge_context, or an empty args dict if no context has
been set yet.

FUNC `set_judge_context`

set_judge_context(self, input_text: str, prediction: str, targets_for_input: list[str]) -> None

Set the context dictionary used when formatting this test for judge evaluation. Args:

input_text: The original input text shown to the model.
prediction: The model’s generated output to evaluate.
targets_for_input: Reference target strings for this input. An empty list results in "N/A" as the target text.

FUNC `from_json_file`

from_json_file(cls, filepath: str) -> list['TestBasedEval']

Load test evaluations from a JSON file, returning one TestBasedEval per unit test. Args:

filepath: Path to a JSON file containing one test-data object or a JSON array of test-data objects.

Returns:

list[TestBasedEval]: A list of TestBasedEval instances, one for
each object found in the file.

Raises:

ValueError: If any test-data object in the file does not conform to the TestData schema.

mellea

cli

​Classes

​CLASS Message

​CLASS Example

​CLASS TestData

​FUNC validate_examples

​CLASS TestBasedEval

​FUNC parts

​FUNC format_for_llm

​FUNC set_judge_context

​FUNC from_json_file