cli.eval.runner

Execution engine for the test-based LLM evaluation pipeline. Loads JSON test files into TestBasedEval objects and, for each test, runs a generator model to produce responses and a separate judge model to score them. Parses the judge output for a \{"score": ..., "justification": ...\} JSON fragment, aggregates per-input pass/fail counts, and saves the full results to JSON or JSONL.

Functions

FUNC `create_session`

create_session(backend: str, model: str | None, max_tokens: int | None) -> mellea.MelleaSession

Create a mellea session with the specified backend and model. Args:

backend: Backend name: "ollama", "openai", "hf", "watsonx", or "litellm".
model: Model ID or ModelIdentifier attribute name, or None to use the default model.
max_tokens: Maximum number of tokens to generate, or None for the backend default.

Returns:

A configured MelleaSession ready for generation.

Raises:

ValueError: If backend is not one of the supported backend names.
Exception: Re-raised from backend or session construction if initialisation fails.

FUNC `run_evaluations`

run_evaluations(test_files: list[str], backend: str, model: str | None, max_gen_tokens: int | None, judge_backend: str | None, judge_model: str | None, max_judge_tokens: int | None, output_path: str, output_format: str, continue_on_error: bool)

Run all unit-test evaluations against a generation model and a judge model. Args:

test_files: List of paths to JSON test files. Each file should contain "id", "source", "name", "instructions", and "examples" fields.
backend: Backend name for the generation model.
model: Model ID for the generator, or None for the default.
max_gen_tokens: Maximum tokens for the generator, or None for the backend default.
judge_backend: Backend name for the judge model, or None to reuse the generation backend.
judge_model: Model ID for the judge, or None for the default.
max_judge_tokens: Maximum tokens for the judge, or None for the backend default.
output_path: File path prefix for saving results.
output_format: Output format: "json" or "jsonl".
continue_on_error: If True, skip failed test evaluations instead of raising.

FUNC `execute_test_eval`

execute_test_eval(test_eval: TestBasedEval, generation_session: mellea.MelleaSession, judge_session: mellea.MelleaSession) -> TestEvalResult

Execute a single test evaluation. For each input in the test, generates a response using generation_session, then validates using judge_session. Args:

test_eval: The TestBasedEval object containing inputs and targets.
generation_session: MelleaSession used to produce model responses.
judge_session: MelleaSession used to score model responses.

Returns:

A TestEvalResult with per-input pass/fail outcomes.

FUNC `parse_judge_output`

parse_judge_output(judge_output: str) -> tuple[int | None, str]

Parse score and justification from a judge model’s output string. Args:

judge_output: Raw text output from the judge model.

Returns:

A (score, justification) tuple where score is an integer (or
None if parsing failed) and justification is an explanatory
string.

FUNC `save_results`

save_results(results: list[TestEvalResult], output_path: str, output_format: str)

Persist evaluation results to disk in JSON or JSONL format. Args:

results: List of TestEvalResult objects to serialise.
output_path: Destination file path (extension may be appended if it does not match output_format).
output_format: Format string: "json" or "jsonl".

FUNC `summary_stats`

summary_stats(results: list[TestEvalResult])

Print aggregated pass-rate statistics for a set of evaluation results. Args:

results: List of TestEvalResult objects to summarise.

Classes

CLASS `InputEvalResult`

Store results of a single input evaluation (within a unit test). Args:

input_text: The raw input text sent to the generation model.
model_output: The text response produced by the generation model.
validation_passed: Whether the judge scored this response as passing.
score: Numeric score assigned by the judge (1 for pass, 0 for fail).
validation_reason: Justification text returned by the judge model.

Methods:

FUNC `to_dict`

to_dict(self) -> dict

Serialise the input evaluation result to a plain dictionary. Returns:

A dictionary with keys "input", "model_output",
"passed", "score", and "justification".

CLASS `TestEvalResult`

Store results of a single test evaluation. Args:

test_eval: The unit test specification containing the test ID, name, instructions, inputs, and expected targets.
input_results: Per-input evaluation outcomes produced by running the generation and judge models.

Attributes:

passed_count: Number of inputs that received a passing score.
total_count: Total number of inputs evaluated.
pass_rate: Fraction of inputs that passed (passed_count / total_count).

Methods:

FUNC `to_dict`

to_dict(self) -> dict

Serialise the test evaluation result to a plain dictionary. Returns:

A dictionary containing the test metadata ("test_id",
"source", "name", "instructions"), per-input results
under "input_results", expected targets under
"expected_targets", and summary counts ("passed",
"total_count", "pass_rate").

FUNC `passed_count`

passed_count(self) -> int

FUNC `total_count`

total_count(self) -> int

FUNC `pass_rate`

pass_rate(self) -> float

mellea

cli

Documentation Index

​Functions

​FUNC create_session

​FUNC run_evaluations

​FUNC execute_test_eval

​FUNC parse_judge_output

​FUNC save_results

​FUNC summary_stats

​Classes

​CLASS InputEvalResult

​FUNC to_dict

​CLASS TestEvalResult

​FUNC to_dict

​FUNC passed_count

​FUNC total_count

​FUNC pass_rate

Functions

FUNC `create_session`

FUNC `run_evaluations`

FUNC `execute_test_eval`

FUNC `parse_judge_output`

FUNC `save_results`

FUNC `summary_stats`

Classes

CLASS `InputEvalResult`

FUNC `to_dict`

CLASS `TestEvalResult`

FUNC `to_dict`

FUNC `passed_count`

FUNC `total_count`

FUNC `pass_rate`