Classes
CLASS HFAloraCacheInfo
A dataclass for holding a KV cache and associated generation metadata.
Used by LocalHFBackend to store intermediate model state that can be
reused across generation requests via an LRU cache.
Args:
kv_cache: The HuggingFaceDynamicCacheholding precomputed key/value tensors, orNoneif not available.merged_token_ids: Token IDs corresponding to the cached prefix.merged_attention: Attention mask for the cached prefix tokens.q_end: Index of the last prompt token in the merged token sequence; defaults to-1.scores: Optional logit scores from the generation step; defaults toNone.
kv_cache: The cached key/value tensors.merged_token_ids: Token IDs for the cached prefix.merged_attention: Attention mask for the cached prefix.q_end: End index of the prompt portion in merged token IDs.scores: Logit scores from generation, orNone.
CLASS LocalHFBackend
The LocalHFBackend uses Huggingface’s transformers library for inference, and uses a Formatter to convert Components into prompts. This backend also supports Activated LoRAs (ALoras)](https://arxiv.org/pdf/2504.12397).
This backend is designed for running an HF model for small-scale inference locally on your machine.
This backend is NOT designed for inference scaling on CUDA-enabled hardware.
Args:
model_id: Used to load the model and tokenizer via HuggingFaceAuto*classes.formatter: Formatter for rendering components into prompts. Defaults to[TemplateFormatter](../formatters/template_formatter#class-templateformatter).use_caches: IfFalse, KV caching is disabled even if aCacheis provided.cache: Caching strategy; defaults toSimpleLRUCache(0, on_evict=_cleanup_kv_cache).custom_config: Override for tokenizer/model/device; if provided,model_idis not used for loading.default_to_constraint_checking_alora: IfFalse, aLoRA constraint checking is deactivated; mainly for benchmarking and debugging.model_options: Default model options for generation requests.
to_mellea_model_opts_map: Mapping from HF-specific option names to Mellea[ModelOption](model_options#class-modeloption)sentinel keys.from_mellea_model_opts_map: Mapping from Mellea sentinel keys to HF-specific option names.
FUNC processing
AsyncTextIteratorStreamer; for non-streaming it is a
GenerateDecoderOnlyOutput that is decoded here.
Args:
mot: The output thunk being populated.chunk: A decoded text chunk (streaming) or a full HuggingFace generation output object (non-streaming).input_ids: The prompt token IDs used for decoding; required to slice off the prompt portion from the generated sequences.
FUNC post_processing
mot: The output thunk to finalize.conversation: The chat conversation sent to the model, used for logging._format: The structured output format class used during generation, if any.tool_calls: Whether tool calling was enabled for this request.tools: Available tools, keyed by name.seed: The random seed used during generation, orNone.input_ids: The prompt token IDs; used to compute token counts and for KV cache bookkeeping.
FUNC generate_from_raw
FUNC generate_from_raw
FUNC generate_from_raw
generate method as a batch. Tool calling is not supported.
Args:
actions: Actions to generate completions for.ctx: The current generation context.format: Optional Pydantic model for structured output decoding via llguidance.model_options: Per-call model options.tool_calls: Ignored; tool calling is not supported on this endpoint.
- list[ModelOutputThunk]: A list of model output thunks, one per action.
FUNC cache_get
HFAloraCacheInfo entry by its key.
Args:
id: The cache key to look up.
- HFAloraCacheInfo | None: The cached entry, or
Noneif not found.
FUNC cache_put
HFAloraCacheInfo entry in the cache under the given key.
Args:
id: The cache key to store the entry under.v: The cache entry containing KV cache state and associated generation metadata.
FUNC base_model_name
granite-3.3-8b-instruct for ibm-granite/granite-3.3-8b-instruct.
FUNC add_adapter
adapter.get_local_hf_path) and records
the adapter in the backend’s registry. The adapter must not already be
registered with a different backend.
Args:
adapter: The adapter to register with this backend.
Exception: Ifadapterhas already been added to a different backend.
FUNC load_adapter
add_adapter first. Do not call
this method while generation requests are in progress.
Args:
adapter_qualified_name: Theadapter.qualified_nameof the adapter to load (i.e."<name>_<adapter_type>")
ValueError: If no adapter with the given qualified name has been added to this backend.
FUNC unload_adapter
adapter_qualified_name: Theadapter.qualified_nameof the adapter to unload.
FUNC list_adapters
- list[str]: Qualified adapter names (i.e.
adapter.qualified_name) for all adapters that have been loaded viaload_adapter.