TestBasedEval objects and, for each test, runs a
generator model to produce responses and a separate judge model to score them. Parses
the judge output for a \{"score": ..., "justification": ...\} JSON fragment,
aggregates per-input pass/fail counts, and saves the full results to JSON or JSONL.
Functions
FUNC create_session
backend: Backend name:"ollama","openai","hf","watsonx", or"litellm".model: Model ID or[ModelIdentifier](../../mellea/backends/model_ids#class-modelidentifier)attribute name, orNoneto use the default model.max_tokens: Maximum number of tokens to generate, orNonefor the backend default.
- A configured
MelleaSessionready for generation.
ValueError: Ifbackendis not one of the supported backend names.Exception: Re-raised from backend or session construction if initialisation fails.
FUNC run_evaluations
test_files: List of paths to JSON test files. Each file should contain"id","source","name","instructions", and"examples"fields.backend: Backend name for the generation model.model: Model ID for the generator, orNonefor the default.max_gen_tokens: Maximum tokens for the generator, orNonefor the backend default.judge_backend: Backend name for the judge model, orNoneto reuse the generation backend.judge_model: Model ID for the judge, orNonefor the default.max_judge_tokens: Maximum tokens for the judge, orNonefor the backend default.output_path: File path prefix for saving results.output_format: Output format:"json"or"jsonl".continue_on_error: IfTrue, skip failed test evaluations instead of raising.
FUNC execute_test_eval
generation_session,
then validates using judge_session.
Args:
test_eval: TheTestBasedEvalobject containing inputs and targets.generation_session:MelleaSessionused to produce model responses.judge_session:MelleaSessionused to score model responses.
- A
TestEvalResultwith per-input pass/fail outcomes.
FUNC parse_judge_output
judge_output: Raw text output from the judge model.
- A
(score, justification)tuple wherescoreis an integer (or Noneif parsing failed) andjustificationis an explanatory- string.
FUNC save_results
results: List ofTestEvalResultobjects to serialise.output_path: Destination file path (extension may be appended if it does not matchoutput_format).output_format: Format string:"json"or"jsonl".
FUNC summary_stats
results: List ofTestEvalResultobjects to summarise.
Classes
CLASS InputEvalResult
Store results of a single input evaluation (within a unit test).
Args:
input_text: The raw input text sent to the generation model.model_output: The text response produced by the generation model.validation_passed: Whether the judge scored this response as passing.score: Numeric score assigned by the judge (1for pass,0for fail).validation_reason: Justification text returned by the judge model.
FUNC to_dict
- A dictionary with keys
"input","model_output", "passed","score", and"justification".
CLASS TestEvalResult
Store results of a single test evaluation.
Args:
test_eval: The unit test specification containing the test ID, name, instructions, inputs, and expected targets.input_results: Per-input evaluation outcomes produced by running the generation and judge models.
passed_count: Number of inputs that received a passing score.total_count: Total number of inputs evaluated.pass_rate: Fraction of inputs that passed (passed_count / total_count).
FUNC to_dict
- A dictionary containing the test metadata (
"test_id", "source","name","instructions"), per-input results- under
"input_results", expected targets under "expected_targets", and summary counts ("passed","total_count","pass_rate").