Skip to main content
A backend that uses a VLLM in the current process. The purpose of the VLLM backend is to provide a locally running fast inference engine.

Classes

CLASS LocalVLLMBackend

The LocalVLLMBackend uses vLLM’s python interface for inference, and uses a Formatter to convert Components into prompts. The support for Activated LoRAs (ALoras)](https://arxiv.org/pdf/2504.12397) is planned. This backend is designed for running an HF model for small-scale inference locally on your machine. Its throughput is generally higher than that of LocalHFBackend. However, it takes longer to load the weights during the instantiation. Also, if you submit a request one by one, it can be slower. Note: vLLM defaults to ~16 tokens. Always set ModelOption.MAX_NEW_TOKENS explicitly (100-1000+). Structured output needs 200-500+ tokens. Args:
  • model_id: HuggingFace model ID used to load model weights via vLLM.
  • formatter: Formatter for rendering components into prompts. Defaults to a [TemplateFormatter](../formatters/template_formatter#class-templateformatter) for the given model_id.
  • model_options: Default model options for generation requests.
Attributes:
  • to_mellea_model_opts_map: Mapping from backend-specific option names to Mellea [ModelOption](model_options#class-modeloption) sentinel keys.
  • from_mellea_model_opts_map: Mapping from Mellea [ModelOption](model_options#class-modeloption) sentinel keys to backend-specific option names.
  • engine_args: vLLM engine arguments used at instantiation; retained so the engine can be restarted when the event loop changes.
Methods:

FUNC processing

processing(self, mot: ModelOutputThunk, chunk: vllm.RequestOutput)
Accumulate text from a single vLLM output chunk into the model output thunk. Called during streaming or final generation to add each incremental result to mot._underlying_value. Args:
  • mot: The output thunk being populated.
  • chunk: A single output from the vLLM generate stream.

FUNC post_processing

post_processing(self, mot: ModelOutputThunk, conversation: list[dict], _format: type[BaseModelSubclass] | None, tool_calls: bool, tools: dict[str, AbstractMelleaTool], seed)
Finalize the model output thunk after generation completes. Parses any tool calls from the raw output, attaches the generate log, and records metadata needed for telemetry. Args:
  • mot: The output thunk to finalize.
  • conversation: The chat conversation sent to the model, used for logging.
  • _format: The structured output format class used during generation, if any.
  • tool_calls: Whether tool calling was enabled for this request.
  • tools: Available tools, keyed by name.
  • seed: The random seed used during generation, or None.

FUNC generate_from_raw

generate_from_raw(self, actions: list[Component[C]], ctx: Context) -> list[ModelOutputThunk[C]]

FUNC generate_from_raw

generate_from_raw(self, actions: list[Component[C] | CBlock], ctx: Context) -> list[ModelOutputThunk[C | str]]

FUNC generate_from_raw

generate_from_raw(self, actions: Sequence[Component[C] | CBlock], ctx: Context) -> list[ModelOutputThunk]
Generate completions for multiple actions without chat templating. Passes the formatted prompt strings directly to vLLM’s completion endpoint. Tool calling is not supported by this method. Args:
  • actions: Actions to generate completions for.
  • ctx: The current generation context.
  • format: Optional Pydantic model for structured output decoding.
  • model_options: Per-call model options.
  • tool_calls: Ignored; tool calling is not supported on this endpoint.
Returns:
  • list[ModelOutputThunk]: A list of model output thunks, one per action.