mellea.backends.vllm

A backend that uses a VLLM in the current process. The purpose of the VLLM backend is to provide a locally running fast inference engine.

Classes

CLASS `LocalVLLMBackend`

The LocalVLLMBackend uses vLLM’s python interface for inference, and uses a Formatter to convert Components into prompts. The support for Activated LoRAs (ALoras)](https://arxiv.org/pdf/2504.12397) is planned. This backend is designed for running an HF model for small-scale inference locally on your machine. Its throughput is generally higher than that of LocalHFBackend. However, it takes longer to load the weights during the instantiation. Also, if you submit a request one by one, it can be slower. Note: vLLM defaults to ~16 tokens. Always set ModelOption.MAX_NEW_TOKENS explicitly (100-1000+). Structured output needs 200-500+ tokens. Args:

model_id: HuggingFace model ID used to load model weights via vLLM.
formatter: Formatter for rendering components into prompts. Defaults to a [TemplateFormatter](../formatters/template_formatter#class-templateformatter) for the given model_id.
model_options: Default model options for generation requests.

Attributes:

to_mellea_model_opts_map: Mapping from backend-specific option names to Mellea [ModelOption](model_options#class-modeloption) sentinel keys.
from_mellea_model_opts_map: Mapping from Mellea [ModelOption](model_options#class-modeloption) sentinel keys to backend-specific option names.
engine_args: vLLM engine arguments used at instantiation; retained so the engine can be restarted when the event loop changes.

Methods:

FUNC `processing`

processing(self, mot: ModelOutputThunk, chunk: vllm.RequestOutput)

Accumulate text from a single vLLM output chunk into the model output thunk. Called during streaming or final generation to add each incremental result to mot._underlying_value. Args:

mot: The output thunk being populated.
chunk: A single output from the vLLM generate stream.

FUNC `post_processing`

post_processing(self, mot: ModelOutputThunk, conversation: list[dict], _format: type[BaseModelSubclass] | None, tool_calls: bool, tools: dict[str, AbstractMelleaTool], seed)

Finalize the model output thunk after generation completes. Parses any tool calls from the raw output, attaches the generate log, and records metadata needed for telemetry. Args:

mot: The output thunk to finalize.
conversation: The chat conversation sent to the model, used for logging.
_format: The structured output format class used during generation, if any.
tool_calls: Whether tool calling was enabled for this request.
tools: Available tools, keyed by name.
seed: The random seed used during generation, or None.

FUNC `generate_from_raw`

generate_from_raw(self, actions: list[Component[C]], ctx: Context) -> list[ModelOutputThunk[C]]

FUNC `generate_from_raw`

generate_from_raw(self, actions: list[Component[C] | CBlock], ctx: Context) -> list[ModelOutputThunk[C | str]]

FUNC `generate_from_raw`

generate_from_raw(self, actions: Sequence[Component[C] | CBlock], ctx: Context) -> list[ModelOutputThunk]

Generate completions for multiple actions without chat templating. Passes the formatted prompt strings directly to vLLM’s completion endpoint. Tool calling is not supported by this method. Args:

actions: Actions to generate completions for.
ctx: The current generation context.
format: Optional Pydantic model for structured output decoding.
model_options: Per-call model options.
tool_calls: Ignored; tool calling is not supported on this endpoint.

Returns:

list[ModelOutputThunk]: A list of model output thunks, one per action.

mellea

cli

​Classes

​CLASS LocalVLLMBackend

​FUNC processing

​FUNC post_processing

​FUNC generate_from_raw

​FUNC generate_from_raw

​FUNC generate_from_raw

Classes

CLASS `LocalVLLMBackend`

FUNC `processing`

FUNC `post_processing`

FUNC `generate_from_raw`

FUNC `generate_from_raw`

FUNC `generate_from_raw`