Debug with Plugins
Prerequisites: The Requirements System, uv add 'mellea[hooks]'.
Mellea's plugin system provides debug hooks that trace the full lifecycle of generation, validation, and sampling. Use these plugins to understand:
- What prompts are sent to the LLM
- Model latency and token usage
- Which requirements pass/fail and why
- When repair strategies trigger and what feedback they provide
- End-to-end flow through the sampling loop
Built-in debug pluginsโ
Mellea ships with three categories of debug plugins in mellea.plugins.builtin_debug:
Generation pipeline pluginsโ
Trace all LLM backend calls with request/response inspection, latency, and tokens.
from mellea.plugins.builtin_debug.generation import (
log_generation_pre_call,
log_generation_post_call,
)
from mellea.plugins import register
register([
log_generation_pre_call,
log_generation_post_call,
])
Output:
[๐ค GEN-PRE-CALL gen_id=abc123...] model=granite4.1:3b | prompt=Write a thank you note
[๐ฅ GEN-POST-CALL gen_id=abc123...] model=granite4.1:3b | latency=397ms | tokens=(47+19=66) | response=hello there thank you...
Logs:
- Generation ID for correlation
- Model being called
- Request: prompt preview (first 100 chars)
- Response: preview, latency, token counts
- Repair feedback when present (shows guidance the model receives during repair)
Validation pipeline pluginsโ
Trace requirement validation with pre-check setup and per-requirement results.
from mellea.plugins.builtin_debug.validation import (
log_validation_pre_check,
log_validation_post_check,
)
from mellea.plugins import register
register([
log_validation_pre_check,
log_validation_post_check,
])
Output:
[๐ VALIDATION-PRE-CHECK] requirements=3 | target=ModelOutputThunk
[โ VALIDATION-POST-CHECK] MIXED RESULTS: 2/3 passed, 1/3 failed
โ Use only lowercase letters
โ Include the phrase 'thank you'
โ Start with a greeting
โโ validated as "no"
Logs:
- Pre-check: how many requirements, what's being validated
- Post-check: pass/fail count per requirement
- Per-requirement status with reasons for failures
Sampling pipeline pluginsโ
Trace the sampling strategy lifecycle including iterations, validation results, and repair events.
from mellea.plugins.builtin_debug.sampling import (
log_sampling_loop_start,
log_sampling_iteration,
log_sampling_repair,
log_sampling_loop_end,
)
from mellea.plugins import register
register([
log_sampling_loop_start,
log_sampling_iteration,
log_sampling_repair,
log_sampling_loop_end,
])
Output:
[๐ฏ SAMPLING-START] strategy=RepairTemplateStrategy | loop_budget=3 | requirements=3
[โ SAMPLING-ITER 1] FAILED: 2/3 validations passed
โ Start with a greeting
[๐ง REPAIR-TRIGGERED] at iteration 1
repair_type=template
failed_validations:
โข Start with a greeting
[โ SAMPLING-ITER 2] FAILED: 2/3 validations passed
โ Start with a greeting
[๐ SAMPLING-END] SUCCESS in 2 iteration(s) using RepairTemplateStrategy
total_attempts=2
best_validation_score=3/3
Logs:
- Loop start: strategy, budget, requirement count
- Each iteration: pass/fail count, failed requirement names
- Repair events: when triggered, repair type, failed requirements
- Loop end: success/failure, iterations used, final statistics
Enabling multiple plugins togetherโ
Combine plugins for complete end-to-end visibility:
from mellea.plugins.builtin_debug.generation import (
log_generation_pre_call,
log_generation_post_call,
)
from mellea.plugins.builtin_debug.validation import (
log_validation_pre_check,
log_validation_post_check,
)
from mellea.plugins.builtin_debug.sampling import (
log_sampling_loop_start,
log_sampling_iteration,
log_sampling_repair,
log_sampling_loop_end,
)
from mellea.plugins import register
register([
# Generation hooks
log_generation_pre_call,
log_generation_post_call,
# Validation hooks
log_validation_pre_check,
log_validation_post_check,
# Sampling hooks
log_sampling_loop_start,
log_sampling_iteration,
log_sampling_repair,
log_sampling_loop_end,
])
This reveals the complete flow:
[๐ฏ SAMPLING-START] strategy=... | loop_budget=... | requirements=...
[๐ค GEN-PRE-CALL] prompt=...
[๐ฅ GEN-POST-CALL] response=... | latency=... | tokens=...
[๐ VALIDATION-PRE-CHECK] requirements=... | target=...
[๐ค GEN-PRE-CALL] prompt=Start with a greeting (validation check)
[๐ฅ GEN-POST-CALL] response=no
[โ VALIDATION-POST-CHECK] MIXED RESULTS: 2/3 passed, 1/3 failed
[โ SAMPLING-ITER 1] FAILED: 2/3 validations passed
[๐ง REPAIR-TRIGGERED] at iteration 1
failed_validations: Start with a greeting
[๐ค GEN-PRE-CALL] prompt=Write a thank you note
[โญ REPAIR ATTEMPT] Repair feedback provided: ...
[๐ฅ GEN-POST-CALL] response=... | latency=... | tokens=...
[๐ VALIDATION-PRE-CHECK] requirements=... | target=...
[๐ค GEN-PRE-CALL] prompt=Start with a greeting
[๐ฅ GEN-POST-CALL] response=yes
[โ
VALIDATION-POST-CHECK] ALL PASSED: 3/3 requirements
[โ
SAMPLING-ITER 2] SUCCESS: 3/3 validations passed
[๐ SAMPLING-END] SUCCESS in 2 iteration(s)
Example scriptsโ
Ready-to-run examples are available in docs/examples/plugins/:
| Script | Plugins | Purpose |
|---|---|---|
builtin_generation_tracing.py | Generation | Basic model call tracing |
builtin_validation_tracing.py | Validation | Requirement validation |
builtin_validation_failures.py | Validation | Show validation failures |
builtin_sampling_diagnostics.py | Sampling | Strategy iterations |
builtin_full_pipeline_tracing.py | Generation + Sampling | End-to-end with model visibility |
builtin_complete_diagnostics.py | All 3 | Complete pipeline with validation |
Run any example:
uv run python docs/examples/plugins/builtin_generation_tracing.py
uv run python docs/examples/plugins/builtin_validation_failures.py
uv run python docs/examples/plugins/builtin_complete_diagnostics.py
Common debugging scenariosโ
"Why is the model generating a different response than I expected?"โ
Enable generation tracing to see:
- Exactly what prompt was sent
- Model's latency and token usage
- Response preview
- When repair feedback is provided (if using RepairTemplateStrategy)
This shows whether the issue is in the prompt, model behavior, or repair strategy.
"Why are my requirements failing?"โ
Enable validation tracing to see:
- Each requirement being checked
- Pass/fail status per requirement
- Failure reason (e.g., "validated as 'no'")
- Pass/fail counts
This pinpoints which requirements are problematic and why.
"Why isn't the repair strategy helping?"โ
Enable all three plugin categories to see:
- Initial attempt (generation + validation)
- What failed (validation results)
- Repair feedback provided (in generation pre-call logs)
- Second attempt with feedback (generation + validation)
- Whether the repair improved the results
This reveals whether the repair strategy is receiving the right feedback and the model is responding appropriately.
"Why is sampling taking so long?"โ
Enable sampling tracing to see:
- How many iterations ran
- Validation results per iteration
- When repairs were triggered
- Total attempts before success/failure
This identifies whether the issue is budget exhaustion, frequent failures, or ineffective repair.
Controlling log outputโ
By default, debug plugins log at INFO level for important events and DEBUG level for details. Control verbosity:
import logging
# Show only failures and key events
logging.basicConfig(level=logging.INFO)
# Show all details including passed requirements
logging.basicConfig(level=logging.DEBUG)
# Silence a specific logger
logging.getLogger("httpx").setLevel(logging.ERROR)
logging.getLogger("ollama").setLevel(logging.ERROR)
Performance notesโ
Debug plugins have minimal overhead:
- Pre-hooks check whether plugins are registered before building payloads
- Logging is formatted efficiently
- No plugins fire in the hot path when not registered
For production use, you can safely leave plugins registered โ they only log when enabled. For maximum performance, simply don't register them.
Next stepsโ
- Observability: Tracing โ export traces to Jaeger or Grafana
- Handling Exceptions and Failures โ work with sampling failures
- The Requirements System โ understand validation in depth