Skip to main content
LLM Evaluation with Unit Tests in Mellea.

Classes

CLASS Message

Schema for a message in the test data. Attributes:
  • role: The role of the message sender (e.g. "user" or "assistant").
  • content: The text content of the message.

CLASS Example

Schema for an example in the test data. Attributes:
  • input: The input messages for this example.
  • targets: The expected target messages for scoring.
  • input_id: An optional identifier for this input example.

CLASS TestData

Schema for test data loaded from json. Attributes:
  • source: Origin identifier for this test dataset.
  • name: Human-readable name for this test dataset.
  • instructions: Evaluation guidelines used by the judge model.
  • examples: The individual input/target example pairs.
  • id: Unique identifier for this test dataset.
Methods:

FUNC validate_examples

validate_examples(cls, v: list[Example]) -> list[Example]
Validate that the examples list is not empty. Args:
  • v: The value of the examples field being validated.
Returns:
  • list[Example]: The validated examples list, unchanged.
Raises:
  • ValueError: If the examples list is empty.

CLASS TestBasedEval

Each TestBasedEval represents a single unit test. Args:
  • source: Origin identifier for this test dataset.
  • name: Human-readable name for this test.
  • instructions: Evaluation guidelines used by the judge model.
  • inputs: The input texts for each example.
  • targets: Expected target strings for each input. None is treated as an empty list.
  • test_id: Optional unique identifier for this test.
  • input_ids: Optional identifiers for each input.
Methods:

FUNC parts

parts(self) -> list[Component | CBlock]
Return the constituent parts of this component. Returns:
  • list[Component | CBlock]: Always an empty list; the component
  • renders entirely via format_for_llm.

FUNC format_for_llm

format_for_llm(self) -> TemplateRepresentation
Format this test for judge evaluation. Returns:
  • A template representation containing the
  • judge context (input, prediction, target, guidelines) set by
  • set_judge_context, or an empty args dict if no context has
  • been set yet.

FUNC set_judge_context

set_judge_context(self, input_text: str, prediction: str, targets_for_input: list[str]) -> None
Set the context dictionary used when formatting this test for judge evaluation. Args:
  • input_text: The original input text shown to the model.
  • prediction: The model’s generated output to evaluate.
  • targets_for_input: Reference target strings for this input. An empty list results in "N/A" as the target text.

FUNC from_json_file

from_json_file(cls, filepath: str) -> list['TestBasedEval']
Load test evaluations from a JSON file, returning one TestBasedEval per unit test. Args:
  • filepath: Path to a JSON file containing one test-data object or a JSON array of test-data objects.
Returns:
  • list[TestBasedEval]: A list of TestBasedEval instances, one for
  • each object found in the file.
Raises:
  • ValueError: If any test-data object in the file does not conform to the TestData schema.