Skip to main content

Debug with Plugins

Prerequisites: The Requirements System, uv add 'mellea[hooks]'.

Mellea's plugin system provides debug hooks that trace the full lifecycle of generation, validation, and sampling. Use these plugins to understand:

  • What prompts are sent to the LLM
  • Model latency and token usage
  • Which requirements pass/fail and why
  • When repair strategies trigger and what feedback they provide
  • End-to-end flow through the sampling loop

Built-in debug pluginsโ€‹

Mellea ships with three categories of debug plugins in mellea.plugins.builtin_debug:

Generation pipeline pluginsโ€‹

Trace all LLM backend calls with request/response inspection, latency, and tokens.

from mellea.plugins.builtin_debug.generation import (
log_generation_pre_call,
log_generation_post_call,
)
from mellea.plugins import register

register([
log_generation_pre_call,
log_generation_post_call,
])

Output:

[๐Ÿ“ค GEN-PRE-CALL gen_id=abc123...] model=granite4.1:3b | prompt=Write a thank you note
[๐Ÿ“ฅ GEN-POST-CALL gen_id=abc123...] model=granite4.1:3b | latency=397ms | tokens=(47+19=66) | response=hello there thank you...

Logs:

  • Generation ID for correlation
  • Model being called
  • Request: prompt preview (first 100 chars)
  • Response: preview, latency, token counts
  • Repair feedback when present (shows guidance the model receives during repair)

Validation pipeline pluginsโ€‹

Trace requirement validation with pre-check setup and per-requirement results.

from mellea.plugins.builtin_debug.validation import (
log_validation_pre_check,
log_validation_post_check,
)
from mellea.plugins import register

register([
log_validation_pre_check,
log_validation_post_check,
])

Output:

[๐Ÿ” VALIDATION-PRE-CHECK] requirements=3 | target=ModelOutputThunk

[โŒ VALIDATION-POST-CHECK] MIXED RESULTS: 2/3 passed, 1/3 failed
โœ“ Use only lowercase letters
โœ“ Include the phrase 'thank you'
โŒ Start with a greeting
โ””โ”€ validated as "no"

Logs:

  • Pre-check: how many requirements, what's being validated
  • Post-check: pass/fail count per requirement
  • Per-requirement status with reasons for failures

Sampling pipeline pluginsโ€‹

Trace the sampling strategy lifecycle including iterations, validation results, and repair events.

from mellea.plugins.builtin_debug.sampling import (
log_sampling_loop_start,
log_sampling_iteration,
log_sampling_repair,
log_sampling_loop_end,
)
from mellea.plugins import register

register([
log_sampling_loop_start,
log_sampling_iteration,
log_sampling_repair,
log_sampling_loop_end,
])

Output:

[๐ŸŽฏ SAMPLING-START] strategy=RepairTemplateStrategy | loop_budget=3 | requirements=3

[โŒ SAMPLING-ITER 1] FAILED: 2/3 validations passed
โŒ Start with a greeting

[๐Ÿ”ง REPAIR-TRIGGERED] at iteration 1
repair_type=template
failed_validations:
โ€ข Start with a greeting

[โŒ SAMPLING-ITER 2] FAILED: 2/3 validations passed
โŒ Start with a greeting

[๐ŸŽ‰ SAMPLING-END] SUCCESS in 2 iteration(s) using RepairTemplateStrategy
total_attempts=2
best_validation_score=3/3

Logs:

  • Loop start: strategy, budget, requirement count
  • Each iteration: pass/fail count, failed requirement names
  • Repair events: when triggered, repair type, failed requirements
  • Loop end: success/failure, iterations used, final statistics

Enabling multiple plugins togetherโ€‹

Combine plugins for complete end-to-end visibility:

from mellea.plugins.builtin_debug.generation import (
log_generation_pre_call,
log_generation_post_call,
)
from mellea.plugins.builtin_debug.validation import (
log_validation_pre_check,
log_validation_post_check,
)
from mellea.plugins.builtin_debug.sampling import (
log_sampling_loop_start,
log_sampling_iteration,
log_sampling_repair,
log_sampling_loop_end,
)
from mellea.plugins import register

register([
# Generation hooks
log_generation_pre_call,
log_generation_post_call,
# Validation hooks
log_validation_pre_check,
log_validation_post_check,
# Sampling hooks
log_sampling_loop_start,
log_sampling_iteration,
log_sampling_repair,
log_sampling_loop_end,
])

This reveals the complete flow:

[๐ŸŽฏ SAMPLING-START] strategy=... | loop_budget=... | requirements=...

[๐Ÿ“ค GEN-PRE-CALL] prompt=...
[๐Ÿ“ฅ GEN-POST-CALL] response=... | latency=... | tokens=...

[๐Ÿ” VALIDATION-PRE-CHECK] requirements=... | target=...
[๐Ÿ“ค GEN-PRE-CALL] prompt=Start with a greeting (validation check)
[๐Ÿ“ฅ GEN-POST-CALL] response=no
[โŒ VALIDATION-POST-CHECK] MIXED RESULTS: 2/3 passed, 1/3 failed

[โŒ SAMPLING-ITER 1] FAILED: 2/3 validations passed

[๐Ÿ”ง REPAIR-TRIGGERED] at iteration 1
failed_validations: Start with a greeting

[๐Ÿ“ค GEN-PRE-CALL] prompt=Write a thank you note
[โญ REPAIR ATTEMPT] Repair feedback provided: ...
[๐Ÿ“ฅ GEN-POST-CALL] response=... | latency=... | tokens=...

[๐Ÿ” VALIDATION-PRE-CHECK] requirements=... | target=...
[๐Ÿ“ค GEN-PRE-CALL] prompt=Start with a greeting
[๐Ÿ“ฅ GEN-POST-CALL] response=yes
[โœ… VALIDATION-POST-CHECK] ALL PASSED: 3/3 requirements

[โœ… SAMPLING-ITER 2] SUCCESS: 3/3 validations passed

[๐ŸŽ‰ SAMPLING-END] SUCCESS in 2 iteration(s)

Example scriptsโ€‹

Ready-to-run examples are available in docs/examples/plugins/:

ScriptPluginsPurpose
builtin_generation_tracing.pyGenerationBasic model call tracing
builtin_validation_tracing.pyValidationRequirement validation
builtin_validation_failures.pyValidationShow validation failures
builtin_sampling_diagnostics.pySamplingStrategy iterations
builtin_full_pipeline_tracing.pyGeneration + SamplingEnd-to-end with model visibility
builtin_complete_diagnostics.pyAll 3Complete pipeline with validation

Run any example:

uv run python docs/examples/plugins/builtin_generation_tracing.py
uv run python docs/examples/plugins/builtin_validation_failures.py
uv run python docs/examples/plugins/builtin_complete_diagnostics.py

Common debugging scenariosโ€‹

"Why is the model generating a different response than I expected?"โ€‹

Enable generation tracing to see:

  • Exactly what prompt was sent
  • Model's latency and token usage
  • Response preview
  • When repair feedback is provided (if using RepairTemplateStrategy)

This shows whether the issue is in the prompt, model behavior, or repair strategy.

"Why are my requirements failing?"โ€‹

Enable validation tracing to see:

  • Each requirement being checked
  • Pass/fail status per requirement
  • Failure reason (e.g., "validated as 'no'")
  • Pass/fail counts

This pinpoints which requirements are problematic and why.

"Why isn't the repair strategy helping?"โ€‹

Enable all three plugin categories to see:

  • Initial attempt (generation + validation)
  • What failed (validation results)
  • Repair feedback provided (in generation pre-call logs)
  • Second attempt with feedback (generation + validation)
  • Whether the repair improved the results

This reveals whether the repair strategy is receiving the right feedback and the model is responding appropriately.

"Why is sampling taking so long?"โ€‹

Enable sampling tracing to see:

  • How many iterations ran
  • Validation results per iteration
  • When repairs were triggered
  • Total attempts before success/failure

This identifies whether the issue is budget exhaustion, frequent failures, or ineffective repair.

Controlling log outputโ€‹

By default, debug plugins log at INFO level for important events and DEBUG level for details. Control verbosity:

import logging

# Show only failures and key events
logging.basicConfig(level=logging.INFO)

# Show all details including passed requirements
logging.basicConfig(level=logging.DEBUG)

# Silence a specific logger
logging.getLogger("httpx").setLevel(logging.ERROR)
logging.getLogger("ollama").setLevel(logging.ERROR)

Performance notesโ€‹

Debug plugins have minimal overhead:

  • Pre-hooks check whether plugins are registered before building payloads
  • Logging is formatted efficiently
  • No plugins fire in the hot path when not registered

For production use, you can safely leave plugins registered โ€” they only log when enabled. For maximum performance, simply don't register them.

Next stepsโ€‹