Classes
CLASS LocalVLLMBackend
The LocalVLLMBackend uses vLLM’s python interface for inference, and uses a Formatter to convert Components into prompts.
The support for Activated LoRAs (ALoras)](https://arxiv.org/pdf/2504.12397) is planned.
This backend is designed for running an HF model for small-scale inference locally on your machine.
Its throughput is generally higher than that of LocalHFBackend.
However, it takes longer to load the weights during the instantiation.
Also, if you submit a request one by one, it can be slower.
Note: vLLM defaults to ~16 tokens. Always set ModelOption.MAX_NEW_TOKENS explicitly (100-1000+).
Structured output needs 200-500+ tokens.
Args:
model_id: HuggingFace model ID used to load model weights via vLLM.formatter: Formatter for rendering components into prompts. Defaults to a[TemplateFormatter](../formatters/template_formatter#class-templateformatter)for the givenmodel_id.model_options: Default model options for generation requests.
to_mellea_model_opts_map: Mapping from backend-specific option names to Mellea[ModelOption](model_options#class-modeloption)sentinel keys.from_mellea_model_opts_map: Mapping from Mellea[ModelOption](model_options#class-modeloption)sentinel keys to backend-specific option names.engine_args: vLLM engine arguments used at instantiation; retained so the engine can be restarted when the event loop changes.
FUNC processing
mot._underlying_value.
Args:
mot: The output thunk being populated.chunk: A single output from the vLLM generate stream.
FUNC post_processing
mot: The output thunk to finalize.conversation: The chat conversation sent to the model, used for logging._format: The structured output format class used during generation, if any.tool_calls: Whether tool calling was enabled for this request.tools: Available tools, keyed by name.seed: The random seed used during generation, orNone.
FUNC generate_from_raw
FUNC generate_from_raw
FUNC generate_from_raw
actions: Actions to generate completions for.ctx: The current generation context.format: Optional Pydantic model for structured output decoding.model_options: Per-call model options.tool_calls: Ignored; tool calling is not supported on this endpoint.
- list[ModelOutputThunk]: A list of model output thunks, one per action.