mellea.stdlib.chunking
ChunkingStrategy ABC and built-in implementations for streaming validation.
Classes
CLASS ChunkingStrategy
Abstract base class for text chunking strategies used in streaming validation.
A chunking strategy receives the full accumulated text so far and returns a list of complete chunks ready for downstream validation. Any trailing fragment that has not yet reached a chunk boundary is withheld — it is not included in the returned list. Each call is stateless and idempotent given the same input.
Performance: split() is called on every streaming delta, re-scanning
the full accumulated text each time (O(n) in total accumulated length per
call). The orchestrator tracks prev_chunk_count to extract only the new
chunks. This keeps the chunker stateless and removes the need for reset()
or deep-copy support, at the cost of re-scanning text already seen. For
typical model outputs (a few KB) the cost is negligible; for very long
streams, a stateful chunker that only processes the new delta would be more
efficient.
End-of-stream contract: split() always withholds the trailing fragment.
When the stream terminates, callers are responsible for processing any remainder:
take the full accumulated text, identify everything after the last returned
chunk boundary, and handle it appropriately (e.g. pass to a final validator
or discard).
Note: this ABC operates on text streams only. Multi-modal output (audio
segments, image regions) is not supported — the accumulated_text: str
signatures on split and flush preclude it.
Methods:
FUNC split
split(self, accumulated_text: str) -> list[str]
Return complete chunks from accumulated_text, excluding any trailing fragment.
Args:
accumulated_text: The full text accumulated so far, including all previously seen tokens and the latest delta. Implementations that scan this string are O(n) in accumulated length per call. Stateful implementations that only process the new delta are possible but must never mutate state onselfin place — use reassignment (self._buf = self._buf + [x]) so thatcopy()-based cloning in the orchestrator works correctly.
Returns:
- A list of complete chunks. If no chunk boundary has been reached yet,
- returns an empty list. Never includes the trailing incomplete fragment.
FUNC flush
flush(self, accumulated_text: str) -> list[str]
Return any trailing fragment that split withheld.
Called once by the orchestrator after the stream has ended naturally (not on early-exit cancellation). Gives the chunker a chance to release the final fragment that did not reach a terminator.
The default implementation returns an empty list — the trailing fragment is discarded. Built-in chunkers override this to return the withheld fragment as a single-element list when non-empty.
Args:
accumulated_text: The full accumulated text at stream end.
Returns:
- The trailing fragment as
[fragment]if it should be treated - as a final chunk, or an empty list to discard it.
CLASS SentenceChunker
Splits accumulated text on sentence boundaries.
Sentence boundaries are detected by ., !, or ?, optionally
followed by a closing quote (straight or curly) or parenthesis, then
whitespace. The final sentence is only returned once it is followed by
whitespace or another sentence — a trailing fragment with no following
whitespace is withheld. Abbreviations are a known edge case: they will
be split on (simple regex, not NLP). Inter-sentence whitespace (including
double-space or tab) is discarded and does not appear as leading whitespace
in subsequent chunks.
Methods:
FUNC split
split(self, accumulated_text: str) -> list[str]
Return complete sentences from accumulated_text.
Args:
accumulated_text: The full text accumulated so far.
Returns:
- Complete sentences detected so far. The trailing fragment (if any)
- is withheld.
FUNC flush
flush(self, accumulated_text: str) -> list[str]
Return the trailing sentence fragment (if any) as a final chunk.
Trailing whitespace on the fragment is non-semantic for sentence
boundaries and is dropped via rstrip. Leading whitespace is
already removed by the loop's lstrip on each advance, so no
lstrip is needed here. The result is the fragment's content
only, consistent with how :meth:split returns sentences without
trailing whitespace.
Args:
accumulated_text: The full accumulated text at stream end.
Returns:
- A single-element list containing the trailing sentence fragment
- with leading and trailing whitespace stripped, or an empty list
- when there is no fragment (all content ended in a sentence
- boundary or the input is empty/whitespace-only).
CLASS WordChunker
Splits accumulated text on whitespace boundaries.
Each word is a chunk. Trailing text not yet followed by whitespace is withheld.
Methods:
FUNC split
split(self, accumulated_text: str) -> list[str]
Return complete words from accumulated_text.
Args:
accumulated_text: The full text accumulated so far.
Returns:
- All whitespace-delimited words except the trailing fragment (if any).
- An empty list is returned when no whitespace boundary has been seen.
FUNC flush
flush(self, accumulated_text: str) -> list[str]
Return the trailing word fragment (if any) as a final chunk.
The trailing fragment is the text after the last whitespace run when the accumulated text does not end with whitespace. When it does end with whitespace, every word is already complete and no fragment is released.
Args:
accumulated_text: The full accumulated text at stream end.
Returns:
- A single-element list containing the trailing word fragment, or
- an empty list when the input ends with whitespace (every word
- already complete) or is empty.
CLASS ParagraphChunker
Splits accumulated text on double-newline paragraph boundaries.
Two or more consecutive newline characters are treated as a paragraph
separator. The trailing paragraph fragment (text not yet followed by \n\n)
is withheld.
Note: only Unix-style \n\n separators are recognised. CRLF
(\r\n\r\n) paragraph separators are not supported.
Methods:
FUNC split
split(self, accumulated_text: str) -> list[str]
Return complete paragraphs from accumulated_text.
Args:
accumulated_text: The full text accumulated so far.
Returns:
- Complete paragraphs (separated by two or more newlines). The
- trailing incomplete paragraph is withheld. Returns an empty list
- if no paragraph boundary has been reached.
FUNC flush
flush(self, accumulated_text: str) -> list[str]
Return the trailing paragraph fragment (if any) as a final chunk.
Unlike :class:SentenceChunker.flush, the fragment is returned
byte-for-byte without stripping. Internal whitespace — including
a trailing single \n — can be semantically meaningful inside
a paragraph (e.g. a list item or a deliberate line break), and a
consumer validating paragraph content should see the fragment as
it was withheld.
Args:
accumulated_text: The full accumulated text at stream end.
Returns:
- A single-element list containing the trailing paragraph fragment
- byte-for-byte, or an empty list when the input ends with a
- paragraph boundary (
\n\nor more) or is empty.