> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mellea.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# mellea.stdlib.chunking

> ChunkingStrategy ABC and built-in implementations for streaming validation.

export const SidebarFix = () => <script dangerouslySetInnerHTML={{
  __html: `
        (function () {
          const INTERVAL_MS = 500;

          const upgradeSidebar = () => {
            const links = document.querySelectorAll('a[href^="#"]');

            links.forEach((link) => {
              if (link.dataset.badged === "true") return;

              const rawText = (link.textContent || "").trim();

              // ========== FUNC ==========
              if (rawText.startsWith("FUNC ")) {
                const label = rawText.replace(/^FUNC\\s+/, "").trim();

                while (link.firstChild) link.removeChild(link.firstChild);

                // 👉 Make the whole link a single flex row & prevent wrapping
                link.style.display = "flex";
                link.style.alignItems = "center";
                link.style.whiteSpace = "nowrap";
                link.style.columnGap = "0.5rem";

                const badge = document.createElement("span");
                badge.style.marginRight = "0.5rem";
                badge.style.display = "inline-flex";
                badge.style.alignItems = "center";
                badge.style.borderRadius = "9999px";
                badge.style.padding = "0rem 0.6rem";
                badge.style.fontSize = "0.5rem";
                badge.style.fontWeight = "700";
                badge.style.letterSpacing = "0.05em";
                badge.style.backgroundColor = "rgba(48, 100, 227, 0.20)";
                badge.style.color = "#1D4ED8";

                badge.textContent = "FUNC";

                link.appendChild(badge);
                link.appendChild(document.createTextNode(label));
                link.dataset.badged = "true";
                return;
              }

              // ========== CLASS ==========
              if (rawText.startsWith("CLASS ")) {
                const label = rawText.replace(/^CLASS\\s+/, "").trim();

                while (link.firstChild) link.removeChild(link.firstChild);

                // 👉 Same flex / nowrap treatment for class links
                link.style.display = "flex";
                link.style.alignItems = "center";
                link.style.whiteSpace = "nowrap";
                link.style.columnGap = "0.5rem";

                const badge = document.createElement("span");
                badge.style.marginRight = "0.5rem";
                badge.style.display = "inline-flex";
                badge.style.alignItems = "center";
                badge.style.borderRadius = "9999px";
                badge.style.padding = "0rem 0.6rem";
                badge.style.fontSize = "0.5rem";
                badge.style.fontWeight = "700";
                badge.style.letterSpacing = "0.05em";
                badge.style.backgroundColor = "rgba(74, 222, 128, 0.20)";
                badge.style.color = "#15803D";

                badge.textContent = "CLASS";

                link.appendChild(badge);
                link.appendChild(document.createTextNode(label));
                link.dataset.badged = "true";
                return;
              }
            });
          };

          upgradeSidebar();
          setInterval(upgradeSidebar, INTERVAL_MS);
        })();
      `
}} />;

<SidebarFix />

ChunkingStrategy ABC and built-in implementations for streaming validation.

## Classes

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#4ADE8033]/20 text-[#15803D]">CLASS</span> `ChunkingStrategy` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L9" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

Abstract base class for text chunking strategies used in streaming validation.

A chunking strategy receives the full accumulated text so far and returns a
list of complete chunks ready for downstream validation. Any trailing fragment
that has not yet reached a chunk boundary is withheld — it is not included in
the returned list. Each call is stateless and idempotent given the same input.

**Performance:** `split()` is called on every streaming delta, re-scanning
the full accumulated text each time (O(n) in total accumulated length per
call).  The orchestrator tracks `prev_chunk_count` to extract only the new
chunks.  This keeps the chunker stateless and removes the need for `reset()`
or deep-copy support, at the cost of re-scanning text already seen.  For
typical model outputs (a few KB) the cost is negligible; for very long
streams, a stateful chunker that only processes the new delta would be more
efficient.

End-of-stream contract: `split()` always withholds the trailing fragment.
When the stream terminates, callers are responsible for processing any remainder:
take the full accumulated text, identify everything after the last returned
chunk boundary, and handle it appropriately (e.g. pass to a final validator
or discard).

Note: this ABC operates on text streams only. Multi-modal output (audio
segments, image regions) is not supported — the `accumulated_text: str`
signatures on `split` and `flush` preclude it.

<div className="h-8" />

**Methods:**

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

#### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#3064E3]/20 text-[#1D4ED8]">FUNC</span> `split` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L38" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

```python theme={null}
split(self, accumulated_text: str) -> list[str]
```

Return complete chunks from accumulated\_text, excluding any trailing fragment.

**Args:**

* `accumulated_text`: The full text accumulated so far, including all
  previously seen tokens and the latest delta.  Implementations
  that scan this string are O(n) in accumulated length per call.
  Stateful implementations that only process the new delta are
  possible but must never mutate state on `self` in place —
  use reassignment (`self._buf = self._buf + [x]`) so that
  `copy()`-based cloning in the orchestrator works correctly.

**Returns:**

* A list of complete chunks. If no chunk boundary has been reached yet,
* returns an empty list. Never includes the trailing incomplete fragment.

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

#### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#3064E3]/20 text-[#1D4ED8]">FUNC</span> `flush` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L56" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

```python theme={null}
flush(self, accumulated_text: str) -> list[str]
```

Return any trailing fragment that `split` withheld.

Called once by the orchestrator after the stream has ended naturally
(not on early-exit cancellation).  Gives the chunker a chance to
release the final fragment that did not reach a terminator.

The default implementation returns an empty list — the trailing
fragment is discarded.  Built-in chunkers override this to return
the withheld fragment as a single-element list when non-empty.

**Args:**

* `accumulated_text`: The full accumulated text at stream end.

**Returns:**

* The trailing fragment as `[fragment]` if it should be treated
* as a final chunk, or an empty list to discard it.

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#4ADE8033]/20 text-[#15803D]">CLASS</span> `SentenceChunker` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L92" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

Splits accumulated text on sentence boundaries.

Sentence boundaries are detected by `.`, `!`, or `?`, optionally
followed by a closing quote (straight or curly) or parenthesis, then
whitespace. The final sentence is only returned once it is followed by
whitespace or another sentence — a trailing fragment with no following
whitespace is withheld. Abbreviations are a known edge case: they will
be split on (simple regex, not NLP). Inter-sentence whitespace (including
double-space or tab) is discarded and does not appear as leading whitespace
in subsequent chunks.

<div className="h-8" />

**Methods:**

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

#### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#3064E3]/20 text-[#1D4ED8]">FUNC</span> `split` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L105" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

```python theme={null}
split(self, accumulated_text: str) -> list[str]
```

Return complete sentences from accumulated\_text.

**Args:**

* `accumulated_text`: The full text accumulated so far.

**Returns:**

* Complete sentences detected so far. The trailing fragment (if any)
* is withheld.

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

#### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#3064E3]/20 text-[#1D4ED8]">FUNC</span> `flush` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L136" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

```python theme={null}
flush(self, accumulated_text: str) -> list[str]
```

Return the trailing sentence fragment (if any) as a final chunk.

Trailing whitespace on the fragment is non-semantic for sentence
boundaries and is dropped via `rstrip`.  Leading whitespace is
already removed by the loop's `lstrip` on each advance, so no
`lstrip` is needed here.  The result is the fragment's content
only, consistent with how :meth:`split` returns sentences without
trailing whitespace.

**Args:**

* `accumulated_text`: The full accumulated text at stream end.

**Returns:**

* A single-element list containing the trailing sentence fragment
* with leading and trailing whitespace stripped, or an empty list
* when there is no fragment (all content ended in a sentence
* boundary or the input is empty/whitespace-only).

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#4ADE8033]/20 text-[#15803D]">CLASS</span> `WordChunker` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L167" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

Splits accumulated text on whitespace boundaries.

Each word is a chunk. Trailing text not yet followed by whitespace is
withheld.

<div className="h-8" />

**Methods:**

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

#### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#3064E3]/20 text-[#1D4ED8]">FUNC</span> `split` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L174" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

```python theme={null}
split(self, accumulated_text: str) -> list[str]
```

Return complete words from accumulated\_text.

**Args:**

* `accumulated_text`: The full text accumulated so far.

**Returns:**

* All whitespace-delimited words except the trailing fragment (if any).
* An empty list is returned when no whitespace boundary has been seen.

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

#### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#3064E3]/20 text-[#1D4ED8]">FUNC</span> `flush` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L206" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

```python theme={null}
flush(self, accumulated_text: str) -> list[str]
```

Return the trailing word fragment (if any) as a final chunk.

The trailing fragment is the text after the last whitespace run when
the accumulated text does not end with whitespace.  When it does end
with whitespace, every word is already complete and no fragment is
released.

**Args:**

* `accumulated_text`: The full accumulated text at stream end.

**Returns:**

* A single-element list containing the trailing word fragment, or
* an empty list when the input ends with whitespace (every word
* already complete) or is empty.

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#4ADE8033]/20 text-[#15803D]">CLASS</span> `ParagraphChunker` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L233" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

Splits accumulated text on double-newline paragraph boundaries.

Two or more consecutive newline characters are treated as a paragraph
separator. The trailing paragraph fragment (text not yet followed by `\n\n`)
is withheld.

Note: only Unix-style `\n\n` separators are recognised. CRLF
(`\r\n\r\n`) paragraph separators are not supported.

<div className="h-8" />

**Methods:**

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

#### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#3064E3]/20 text-[#1D4ED8]">FUNC</span> `split` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L244" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

```python theme={null}
split(self, accumulated_text: str) -> list[str]
```

Return complete paragraphs from accumulated\_text.

**Args:**

* `accumulated_text`: The full text accumulated so far.

**Returns:**

* Complete paragraphs (separated by two or more newlines). The
* trailing incomplete paragraph is withheld. Returns an empty list
* if no paragraph boundary has been reached.

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />

#### <span className="ml-2 inline-flex items-center rounded-full px-2 py-1 text-[0.7rem] font-bold tracking-wide bg-[#3064E3]/20 text-[#1D4ED8]">FUNC</span> `flush` <sup><a href="https://github.com/generative-computing/mellea/blob/v0.6.0/mellea/stdlib/chunking.py#L267" target="_blank"><Icon icon="github" style="width: 14px; height: 14px;" /></a></sup>

```python theme={null}
flush(self, accumulated_text: str) -> list[str]
```

Return the trailing paragraph fragment (if any) as a final chunk.

Unlike :class:`SentenceChunker.flush`, the fragment is returned
byte-for-byte without stripping.  Internal whitespace — including
a trailing single `\n` — can be semantically meaningful inside
a paragraph (e.g. a list item or a deliberate line break), and a
consumer validating paragraph content should see the fragment as
it was withheld.

**Args:**

* `accumulated_text`: The full accumulated text at stream end.

**Returns:**

* A single-element list containing the trailing paragraph fragment
* byte-for-byte, or an empty list when the input ends with a
* paragraph boundary (`\n\n` or more) or is empty.

<div className="w-full h-px bg-gray-200 dark:bg-gray-700 my-4" />
