Skip to main content
A backend that uses the Huggingface Transformers library. The purpose of the Hugginface backend is to provide a setting for implementing experimental features. If you want a performance local backend, and do not need experimental features such as Span-based context or ALoras, consider using Ollama backends instead.

Classes

CLASS HFAloraCacheInfo

A dataclass for holding a KV cache and associated generation metadata. Used by LocalHFBackend to store intermediate model state that can be reused across generation requests via an LRU cache. Args:
  • kv_cache: The HuggingFace DynamicCache holding precomputed key/value tensors, or None if not available.
  • merged_token_ids: Token IDs corresponding to the cached prefix.
  • merged_attention: Attention mask for the cached prefix tokens.
  • q_end: Index of the last prompt token in the merged token sequence; defaults to -1.
  • scores: Optional logit scores from the generation step; defaults to None.
Attributes:
  • kv_cache: The cached key/value tensors.
  • merged_token_ids: Token IDs for the cached prefix.
  • merged_attention: Attention mask for the cached prefix.
  • q_end: End index of the prompt portion in merged token IDs.
  • scores: Logit scores from generation, or None.

CLASS LocalHFBackend

The LocalHFBackend uses Huggingface’s transformers library for inference, and uses a Formatter to convert Components into prompts. This backend also supports Activated LoRAs (ALoras)](https://arxiv.org/pdf/2504.12397). This backend is designed for running an HF model for small-scale inference locally on your machine. This backend is NOT designed for inference scaling on CUDA-enabled hardware. Args:
  • model_id: Used to load the model and tokenizer via HuggingFace Auto* classes.
  • formatter: Formatter for rendering components into prompts. Defaults to [TemplateFormatter](../formatters/template_formatter#class-templateformatter).
  • use_caches: If False, KV caching is disabled even if a Cache is provided.
  • cache: Caching strategy; defaults to SimpleLRUCache(0, on_evict=_cleanup_kv_cache).
  • custom_config: Override for tokenizer/model/device; if provided, model_id is not used for loading.
  • default_to_constraint_checking_alora: If False, aLoRA constraint checking is deactivated; mainly for benchmarking and debugging.
  • model_options: Default model options for generation requests.
Attributes:
  • to_mellea_model_opts_map: Mapping from HF-specific option names to Mellea [ModelOption](model_options#class-modeloption) sentinel keys.
  • from_mellea_model_opts_map: Mapping from Mellea sentinel keys to HF-specific option names.
Methods:

FUNC processing

processing(self, mot: ModelOutputThunk, chunk: str | GenerateDecoderOnlyOutput, input_ids)
Accumulate decoded text from a streaming chunk or full generation output. For streaming responses the chunk is an already-decoded string from AsyncTextIteratorStreamer; for non-streaming it is a GenerateDecoderOnlyOutput that is decoded here. Args:
  • mot: The output thunk being populated.
  • chunk: A decoded text chunk (streaming) or a full HuggingFace generation output object (non-streaming).
  • input_ids: The prompt token IDs used for decoding; required to slice off the prompt portion from the generated sequences.

FUNC post_processing

post_processing(self, mot: ModelOutputThunk, conversation: list[dict], _format: type[BaseModelSubclass] | None, tool_calls: bool, tools: dict[str, AbstractMelleaTool], seed, input_ids)
Finalize the output thunk after HuggingFace generation completes. Stores the KV cache for future reuse, parses tool calls if applicable, records token usage metrics, emits telemetry, and attaches the generate log. Args:
  • mot: The output thunk to finalize.
  • conversation: The chat conversation sent to the model, used for logging.
  • _format: The structured output format class used during generation, if any.
  • tool_calls: Whether tool calling was enabled for this request.
  • tools: Available tools, keyed by name.
  • seed: The random seed used during generation, or None.
  • input_ids: The prompt token IDs; used to compute token counts and for KV cache bookkeeping.

FUNC generate_from_raw

generate_from_raw(self, actions: list[Component[C]], ctx: Context) -> list[ModelOutputThunk[C]]

FUNC generate_from_raw

generate_from_raw(self, actions: list[Component[C] | CBlock], ctx: Context) -> list[ModelOutputThunk[C | str]]

FUNC generate_from_raw

generate_from_raw(self, actions: Sequence[Component[C] | CBlock], ctx: Context) -> list[ModelOutputThunk]
Generate completions for multiple actions without chat templating. Passes formatted prompt strings directly to the HuggingFace model’s generate method as a batch. Tool calling is not supported. Args:
  • actions: Actions to generate completions for.
  • ctx: The current generation context.
  • format: Optional Pydantic model for structured output decoding via llguidance.
  • model_options: Per-call model options.
  • tool_calls: Ignored; tool calling is not supported on this endpoint.
Returns:
  • list[ModelOutputThunk]: A list of model output thunks, one per action.

FUNC cache_get

cache_get(self, id: str | int) -> HFAloraCacheInfo | None
Retrieve a cached HFAloraCacheInfo entry by its key. Args:
  • id: The cache key to look up.
Returns:
  • HFAloraCacheInfo | None: The cached entry, or None if not found.

FUNC cache_put

cache_put(self, id: str | int, v: HFAloraCacheInfo)
Store an HFAloraCacheInfo entry in the cache under the given key. Args:
  • id: The cache key to store the entry under.
  • v: The cache entry containing KV cache state and associated generation metadata.

FUNC base_model_name

base_model_name(self)
Returns the base_model_id of the model used by the backend. For example, granite-3.3-8b-instruct for ibm-granite/granite-3.3-8b-instruct.

FUNC add_adapter

add_adapter(self, adapter: LocalHFAdapter)
Register a LoRA/aLoRA adapter with this backend so it can be loaded later. Downloads the adapter weights (via adapter.get_local_hf_path) and records the adapter in the backend’s registry. The adapter must not already be registered with a different backend. Args:
  • adapter: The adapter to register with this backend.
Raises:
  • Exception: If adapter has already been added to a different backend.

FUNC load_adapter

load_adapter(self, adapter_qualified_name: str)
Load a previously registered adapter into the underlying HuggingFace model. The adapter must have been registered via add_adapter first. Do not call this method while generation requests are in progress. Args:
  • adapter_qualified_name: The adapter.qualified_name of the adapter to load (i.e. "<name>_<adapter_type>")
Raises:
  • ValueError: If no adapter with the given qualified name has been added to this backend.

FUNC unload_adapter

unload_adapter(self, adapter_qualified_name: str)
Unload a previously loaded adapter from the underlying HuggingFace model. If the adapter is not currently loaded, a log message is emitted and the method returns without error. Args:
  • adapter_qualified_name: The adapter.qualified_name of the adapter to unload.

FUNC list_adapters

list_adapters(self) -> list[str]
List the qualified names of all adapters currently loaded in this backend. Returns:
  • list[str]: Qualified adapter names (i.e. adapter.qualified_name) for all adapters that have been loaded via load_adapter.