Skip to content

feat(chunking): repeat table headers on continuation chunks#4298

Open
cragwolfe wants to merge 8 commits intomainfrom
crag/codex-draft-preview-headers-copy
Open

feat(chunking): repeat table headers on continuation chunks#4298
cragwolfe wants to merge 8 commits intomainfrom
crag/codex-draft-preview-headers-copy

Conversation

@cragwolfe
Copy link
Contributor

Behavior summary

Before

  • Oversized table chunks only preserved headers in the first chunk; continuation chunks could lose column context.
  • Table header semantics (<thead> / <th>) were not retained as explicit row-level metadata after compactification.

After

  • Added repeat_table_headers (default True) to chunking APIs and strategy plumbing:
    • chunk_elements(..., repeat_table_headers=...)
    • chunk_by_title(..., repeat_table_headers=...)
    • add_chunking_strategy(...) forwarded args/docs
  • _TableChunker now detects contiguous leading header rows and repeats them on non-initial continuation chunks.
  • Repeated header rows are prepended to both continuation chunk text and text_as_html.
  • First chunk behavior remains unchanged relative to legacy output.
  • Added a guardrail: if a repeated header row would consume more than half the chunk window, splitter falls back to legacy non-repeating behavior.

Invariants

  • No body-row drop, duplication, or reordering across emitted continuation chunks.
  • Opt-out behavior (repeat_table_headers=False) matches legacy table splitting behavior.
  • Chunk windows still respect max-size constraints, including near-boundary continuation windows.
  • Only contiguous leading header rows are repeated; later non-leading header-like rows are not promoted.

Edge cases covered

  • No headers, single leading header row, multiple leading header rows.
  • Header detection from both <thead> and <th> rows.
  • Exact-fit and near-boundary continuation sizing.
  • Cascading repetition across 3+ continuation chunks.
  • Pathologically large header rows trigger safe fallback to non-repeating behavior.
  • Strategy-path forwarding validated through partition_html(..., chunking_strategy="by_title").

Test evidence

  • uv run --no-sync pytest -q test_unstructured/chunking/test_dispatch.py (6 passed)
  • uv run --no-sync pytest -q test_unstructured/chunking/test_base.py -k "Describe_TableChunker" (26 passed)
  • uv run --no-sync pytest -q test_unstructured/chunking/test_title.py::test_add_chunking_strategy_forwards_repeat_table_headers (1 passed)
  • uv run --no-sync pytest -q test_unstructured/chunking/test_title.py -k "repeat_table_headers" (5 passed)
  • uv run --with python-docx pytest -q test_unstructured/chunking/test_basic.py -k "repeat_table_headers" (4 passed)
  • uv run --no-sync pytest -q test_unstructured/common/test_html_table.py (26 passed)

authored by codex

@cragwolfe cragwolfe marked this pull request as ready for review March 25, 2026 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant