Skip to main content

mellea.stdlib.chunking

ChunkingStrategy ABC and built-in implementations for streaming validation.

Classes

CLASS ChunkingStrategy

Abstract base class for text chunking strategies used in streaming validation.

A chunking strategy receives the full accumulated text so far and returns a list of complete chunks ready for downstream validation. Any trailing fragment that has not yet reached a chunk boundary is withheld — it is not included in the returned list. Each call is stateless and idempotent given the same input.

Performance: split() is called on every streaming delta, re-scanning the full accumulated text each time (O(n) in total accumulated length per call). The orchestrator tracks prev_chunk_count to extract only the new chunks. This keeps the chunker stateless and removes the need for reset() or deep-copy support, at the cost of re-scanning text already seen. For typical model outputs (a few KB) the cost is negligible; for very long streams, a stateful chunker that only processes the new delta would be more efficient.

End-of-stream contract: split() always withholds the trailing fragment. When the stream terminates, callers are responsible for processing any remainder: take the full accumulated text, identify everything after the last returned chunk boundary, and handle it appropriately (e.g. pass to a final validator or discard).

Note: this ABC operates on text streams only. Multi-modal output (audio segments, image regions) is not supported — the accumulated_text: str signatures on split and flush preclude it.

Methods:

FUNC split

split(self, accumulated_text: str) -> list[str]

Return complete chunks from accumulated_text, excluding any trailing fragment.

Args:

  • accumulated_text: The full text accumulated so far, including all previously seen tokens and the latest delta. Implementations that scan this string are O(n) in accumulated length per call. Stateful implementations that only process the new delta are possible but must never mutate state on self in place — use reassignment (self._buf = self._buf + [x]) so that copy()-based cloning in the orchestrator works correctly.

Returns:

  • A list of complete chunks. If no chunk boundary has been reached yet,
  • returns an empty list. Never includes the trailing incomplete fragment.

FUNC flush

flush(self, accumulated_text: str) -> list[str]

Return any trailing fragment that split withheld.

Called once by the orchestrator after the stream has ended naturally (not on early-exit cancellation). Gives the chunker a chance to release the final fragment that did not reach a terminator.

The default implementation returns an empty list — the trailing fragment is discarded. Built-in chunkers override this to return the withheld fragment as a single-element list when non-empty.

Args:

  • accumulated_text: The full accumulated text at stream end.

Returns:

  • The trailing fragment as [fragment] if it should be treated
  • as a final chunk, or an empty list to discard it.

CLASS SentenceChunker

Splits accumulated text on sentence boundaries.

Sentence boundaries are detected by ., !, or ?, optionally followed by a closing quote (straight or curly) or parenthesis, then whitespace. The final sentence is only returned once it is followed by whitespace or another sentence — a trailing fragment with no following whitespace is withheld. Abbreviations are a known edge case: they will be split on (simple regex, not NLP). Inter-sentence whitespace (including double-space or tab) is discarded and does not appear as leading whitespace in subsequent chunks.

Methods:

FUNC split

split(self, accumulated_text: str) -> list[str]

Return complete sentences from accumulated_text.

Args:

  • accumulated_text: The full text accumulated so far.

Returns:

  • Complete sentences detected so far. The trailing fragment (if any)
  • is withheld.

FUNC flush

flush(self, accumulated_text: str) -> list[str]

Return the trailing sentence fragment (if any) as a final chunk.

Trailing whitespace on the fragment is non-semantic for sentence boundaries and is dropped via rstrip. Leading whitespace is already removed by the loop's lstrip on each advance, so no lstrip is needed here. The result is the fragment's content only, consistent with how :meth:split returns sentences without trailing whitespace.

Args:

  • accumulated_text: The full accumulated text at stream end.

Returns:

  • A single-element list containing the trailing sentence fragment
  • with leading and trailing whitespace stripped, or an empty list
  • when there is no fragment (all content ended in a sentence
  • boundary or the input is empty/whitespace-only).

CLASS WordChunker

Splits accumulated text on whitespace boundaries.

Each word is a chunk. Trailing text not yet followed by whitespace is withheld.

Methods:

FUNC split

split(self, accumulated_text: str) -> list[str]

Return complete words from accumulated_text.

Args:

  • accumulated_text: The full text accumulated so far.

Returns:

  • All whitespace-delimited words except the trailing fragment (if any).
  • An empty list is returned when no whitespace boundary has been seen.

FUNC flush

flush(self, accumulated_text: str) -> list[str]

Return the trailing word fragment (if any) as a final chunk.

The trailing fragment is the text after the last whitespace run when the accumulated text does not end with whitespace. When it does end with whitespace, every word is already complete and no fragment is released.

Args:

  • accumulated_text: The full accumulated text at stream end.

Returns:

  • A single-element list containing the trailing word fragment, or
  • an empty list when the input ends with whitespace (every word
  • already complete) or is empty.

CLASS ParagraphChunker

Splits accumulated text on double-newline paragraph boundaries.

Two or more consecutive newline characters are treated as a paragraph separator. The trailing paragraph fragment (text not yet followed by \n\n) is withheld.

Note: only Unix-style \n\n separators are recognised. CRLF (\r\n\r\n) paragraph separators are not supported.

Methods:

FUNC split

split(self, accumulated_text: str) -> list[str]

Return complete paragraphs from accumulated_text.

Args:

  • accumulated_text: The full text accumulated so far.

Returns:

  • Complete paragraphs (separated by two or more newlines). The
  • trailing incomplete paragraph is withheld. Returns an empty list
  • if no paragraph boundary has been reached.

FUNC flush

flush(self, accumulated_text: str) -> list[str]

Return the trailing paragraph fragment (if any) as a final chunk.

Unlike :class:SentenceChunker.flush, the fragment is returned byte-for-byte without stripping. Internal whitespace — including a trailing single \n — can be semantically meaningful inside a paragraph (e.g. a list item or a deliberate line break), and a consumer validating paragraph content should see the fragment as it was withheld.

Args:

  • accumulated_text: The full accumulated text at stream end.

Returns:

  • A single-element list containing the trailing paragraph fragment
  • byte-for-byte, or an empty list when the input ends with a
  • paragraph boundary (\n\n or more) or is empty.