Skip to content

fix: improve multi-column layout sorting for academic papers (#4104)#4283

Open
Gopesh111 wants to merge 1 commit intoUnstructured-IO:mainfrom
Gopesh111:fix-academic-sorting
Open

fix: improve multi-column layout sorting for academic papers (#4104)#4283
Gopesh111 wants to merge 1 commit intoUnstructured-IO:mainfrom
Gopesh111:fix-academic-sorting

Conversation

@Gopesh111
Copy link

This PR addresses the reading order issues in multi-column documents (specifically academic papers) as reported in #4104.

Key Changes:

Hybrid Sorting Logic: Introduced sort_page_elements_columns in sorting.py to bin elements into Top (Header/Title), Bottom (Footer), and Middle (Body) zones.

Column-Aware Binning: Body elements are now split into Left and Right columns based on page mid-point, preventing the 'Z-pattern' reading order.

Noise-Resistant XY-Cut: Updated xycut.py with increased min_gap (10px for X, 2px for Y) and min_value thresholds. This allows the parser to ignore scanning noise and correctly identify narrow gutters between columns in research papers.

Verification:
Tested with the NAACL 2025 findings paper. Verified that the sequence now correctly follows: Title -> Abstract -> Intro (Left Col) -> Intro (Right Col) -> Footer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant