mellea.backends.huggingface

A backend that uses the Huggingface Transformers library. The purpose of the Hugginface backend is to provide a setting for implementing experimental features. If you want a performance local backend, and do not need experimental features such as Span-based context or ALoras, consider using Ollama backends instead.

Classes

CLASS `HFAloraCacheInfo`

A dataclass for holding a KV cache and associated generation metadata. Used by LocalHFBackend to store intermediate model state that can be reused across generation requests via an LRU cache. Args:

kv_cache: The HuggingFace DynamicCache holding precomputed key/value tensors, or None if not available.
merged_token_ids: Token IDs corresponding to the cached prefix.
merged_attention: Attention mask for the cached prefix tokens.
q_end: Index of the last prompt token in the merged token sequence; defaults to -1.
scores: Optional logit scores from the generation step; defaults to None.

Attributes:

kv_cache: The cached key/value tensors.
merged_token_ids: Token IDs for the cached prefix.
merged_attention: Attention mask for the cached prefix.
q_end: End index of the prompt portion in merged token IDs.
scores: Logit scores from generation, or None.

CLASS `LocalHFBackend`

The LocalHFBackend uses Huggingface’s transformers library for inference, and uses a Formatter to convert Components into prompts. This backend also supports Activated LoRAs (ALoras)](https://arxiv.org/pdf/2504.12397). This backend is designed for running an HF model for small-scale inference locally on your machine. This backend is NOT designed for inference scaling on CUDA-enabled hardware. Args:

model_id: Used to load the model and tokenizer via HuggingFace Auto* classes.
formatter: Formatter for rendering components into prompts. Defaults to [TemplateFormatter](../formatters/template_formatter#class-templateformatter).
use_caches: If False, KV caching is disabled even if a Cache is provided.
cache: Caching strategy; defaults to SimpleLRUCache(0, on_evict=_cleanup_kv_cache).
custom_config: Override for tokenizer/model/device; if provided, model_id is not used for loading.
default_to_constraint_checking_alora: If False, aLoRA constraint checking is deactivated; mainly for benchmarking and debugging.
model_options: Default model options for generation requests.

Attributes:

to_mellea_model_opts_map: Mapping from HF-specific option names to Mellea [ModelOption](model_options#class-modeloption) sentinel keys.
from_mellea_model_opts_map: Mapping from Mellea sentinel keys to HF-specific option names.

Methods:

FUNC `processing`

processing(self, mot: ModelOutputThunk, chunk: str | GenerateDecoderOnlyOutput, input_ids)

Accumulate decoded text from a streaming chunk or full generation output. For streaming responses the chunk is an already-decoded string from AsyncTextIteratorStreamer; for non-streaming it is a GenerateDecoderOnlyOutput that is decoded here. Args:

mot: The output thunk being populated.
chunk: A decoded text chunk (streaming) or a full HuggingFace generation output object (non-streaming).
input_ids: The prompt token IDs used for decoding; required to slice off the prompt portion from the generated sequences.

FUNC `post_processing`

post_processing(self, mot: ModelOutputThunk, conversation: list[dict], _format: type[BaseModelSubclass] | None, tool_calls: bool, tools: dict[str, AbstractMelleaTool], seed, input_ids)

Finalize the output thunk after HuggingFace generation completes. Stores the KV cache for future reuse, parses tool calls if applicable, records token usage metrics, emits telemetry, and attaches the generate log. Args:

mot: The output thunk to finalize.
conversation: The chat conversation sent to the model, used for logging.
_format: The structured output format class used during generation, if any.
tool_calls: Whether tool calling was enabled for this request.
tools: Available tools, keyed by name.
seed: The random seed used during generation, or None.
input_ids: The prompt token IDs; used to compute token counts and for KV cache bookkeeping.

FUNC `generate_from_raw`

generate_from_raw(self, actions: list[Component[C]], ctx: Context) -> list[ModelOutputThunk[C]]

FUNC `generate_from_raw`

generate_from_raw(self, actions: list[Component[C] | CBlock], ctx: Context) -> list[ModelOutputThunk[C | str]]

FUNC `generate_from_raw`

generate_from_raw(self, actions: Sequence[Component[C] | CBlock], ctx: Context) -> list[ModelOutputThunk]

Generate completions for multiple actions without chat templating. Passes formatted prompt strings directly to the HuggingFace model’s generate method as a batch. Tool calling is not supported. Args:

actions: Actions to generate completions for.
ctx: The current generation context.
format: Optional Pydantic model for structured output decoding via llguidance.
model_options: Per-call model options.
tool_calls: Ignored; tool calling is not supported on this endpoint.

Returns:

list[ModelOutputThunk]: A list of model output thunks, one per action.

FUNC `cache_get`

cache_get(self, id: str | int) -> HFAloraCacheInfo | None

Retrieve a cached HFAloraCacheInfo entry by its key. Args:

id: The cache key to look up.

Returns:

HFAloraCacheInfo | None: The cached entry, or None if not found.

FUNC `cache_put`

cache_put(self, id: str | int, v: HFAloraCacheInfo)

Store an HFAloraCacheInfo entry in the cache under the given key. Args:

id: The cache key to store the entry under.
v: The cache entry containing KV cache state and associated generation metadata.

FUNC `base_model_name`

base_model_name(self)

Returns the base_model_id of the model used by the backend. For example, granite-3.3-8b-instruct for ibm-granite/granite-3.3-8b-instruct.

FUNC `add_adapter`

add_adapter(self, adapter: LocalHFAdapter)

Register a LoRA/aLoRA adapter with this backend so it can be loaded later. Downloads the adapter weights (via adapter.get_local_hf_path) and records the adapter in the backend’s registry. The adapter must not already be registered with a different backend. Args:

adapter: The adapter to register with this backend.

Raises:

Exception: If adapter has already been added to a different backend.

FUNC `load_adapter`

load_adapter(self, adapter_qualified_name: str)

Load a previously registered adapter into the underlying HuggingFace model. The adapter must have been registered via add_adapter first. Do not call this method while generation requests are in progress. Args:

adapter_qualified_name: The adapter.qualified_name of the adapter to load (i.e. "<name>_<adapter_type>")

Raises:

ValueError: If no adapter with the given qualified name has been added to this backend.

FUNC `unload_adapter`

unload_adapter(self, adapter_qualified_name: str)

Unload a previously loaded adapter from the underlying HuggingFace model. If the adapter is not currently loaded, a log message is emitted and the method returns without error. Args:

adapter_qualified_name: The adapter.qualified_name of the adapter to unload.

FUNC `list_adapters`

list_adapters(self) -> list[str]

List the qualified names of all adapters currently loaded in this backend. Returns:

list[str]: Qualified adapter names (i.e. adapter.qualified_name) for all adapters that have been loaded via load_adapter.

mellea

cli

​Classes

​CLASS HFAloraCacheInfo

​CLASS LocalHFBackend

​FUNC processing

​FUNC post_processing

​FUNC generate_from_raw

​FUNC generate_from_raw

​FUNC generate_from_raw

​FUNC cache_get

​FUNC cache_put

​FUNC base_model_name

​FUNC add_adapter

​FUNC load_adapter

​FUNC unload_adapter

​FUNC list_adapters

Classes

CLASS `HFAloraCacheInfo`

CLASS `LocalHFBackend`

FUNC `processing`

FUNC `post_processing`

FUNC `generate_from_raw`

FUNC `generate_from_raw`

FUNC `generate_from_raw`

FUNC `cache_get`

FUNC `cache_put`

FUNC `base_model_name`

FUNC `add_adapter`

FUNC `load_adapter`

FUNC `unload_adapter`

FUNC `list_adapters`