Skip to main content
Various utility functions relating to the MTRAG benchmark data set.

Functions

FUNC download_mtrag_corpus

download_mtrag_corpus(target_dir: str, corpus_name: str) -> pathlib.Path
Download a corpus file from the MTRAG benchmark if the file hasn’t already present. Args:
  • target_dir: Location where the file should be written if not already present.
  • corpus_name: Should be one of "cloud", "clapnq", "fiqa", or "govt".
Returns:
  • Path to the downloaded (or cached) file.
Raises:
  • ValueError: If corpus_name is not one of the supported corpus names.

FUNC read_mtrag_corpus

read_mtrag_corpus(corpus_file: str | pathlib.Path) -> pa.Table
Read the documents from one of the MTRAG benchmark’s corpora. Args:
  • corpus_file: Location of the corpus data file.
Returns:
  • Documents from the corpus as a PyArrow table, with schema
  • ["id", "url", "title", "text"].
Raises:
  • TypeError: If the ID column cannot be identified or if no text column is present in the corpus file.

FUNC download_mtrag_embeddings

download_mtrag_embeddings(embedding_name: str, corpus_name: str, target_dir: str)
Download precomputed embeddings for a corpus in the MTRAG benchmark. Args:
  • embedding_name: Name of the SentenceTransformers embedding model used to create the embeddings.
  • corpus_name: Should be one of "cloud", "clapnq", "fiqa", or "govt".
  • target_dir: Location where Parquet files named "part_001.parquet", "part_002.parquet", etc. will be written.
Raises:
  • ValueError: If corpus_name is not one of the supported corpus names, or if no precomputed embeddings are found for the given corpus and embedding model combination.