Functions
FUNC compute_embeddings
corpus: PyArrow Table of documents as returned byread_corpus(). Should have the columns["id", "url", "title", "text"].embedding_model_name: Hugging Face model name for the model that computes embeddings. Also used for tokenizing.chunk_size: Maximum size of chunks to split documents into, in embedding model tokens; must be less than or equal to the embedding model’s maximum sequence length.overlap: Target overlap between adjacent chunks, in embedding model tokens. Actual begins and ends of chunks will be on sentence boundaries.
- PyArrow Table of chunks of the corpus, with schema
["id", "url", "title", "begin", "end", "text", "embedding"].
FUNC write_embeddings
compute_embeddings() to a directory of
Parquet files on local disk.
Args:
target_dir: Location where the files should be written (in a subdirectory).corpus_name: Corpus name used to generate the output directory name.embeddings: PyArrow Table produced bycompute_embeddings().chunks_per_partition: Number of document chunks to write to each Parquet partition file.
- Path to the directory where the Parquet files were written.
Classes
CLASS InMemoryRetriever
Simple retriever that keeps docs and embeddings in memory.
Args:
data_file_or_table: Parquet file of document snippets and embeddings, or an equivalent in-memory PyArrow Table. Should have columnsid,begin,end,text, andembedding.embedding_model_name: Name of the Sentence Transformers model to use for embeddings. Must match the model used to compute embeddings in the data file.
FUNC retrieve
query: Natural language query string.top_k: Number of top results to return.
- List of dicts with keys
doc_id,text, andscore.