Clarification on Metadata, Semantics, Description, Storage & Tree-Based Retrieval in PageIndex #171

rakeshofficework01-oss · 2026-03-18T06:54:33Z

rakeshofficework01-oss
Mar 18, 2026

Hi Team,
I’m currently exploring PageIndex based on the official documentation under Doc Search (metadata, semantics, and description). I have a few clarifications to ensure I’m implementing it correctly:

Metadata Approach
From the documentation, it mentions setting up SQL tables and storing doc_id along with metadata.
• Should we create and manage this SQL database ourselves?
• Or does PageIndex provide or internally manage any metadata storage and filtering layer?
Semantics Approach
For semantic search:
• Does PageIndex internally handle chunking, embedding, and indexing of documents?
• Or are we expected to generate embeddings and store them in our own vector database?
Description-Based Retrieval
For the description method:
• Is the correct approach to generate document descriptions using an LLM and rely on those descriptions for document selection before querying?
• Does PageIndex use these descriptions internally for routing queries to the right documents?
Document Processing Difference
I observed that:
• When I process a PDF locally (open-source code), some documents fail to extract content (especially scanned PDFs)
• But when uploading the same file via PageIndex (as per docs), it works correctly
Could you clarify:
• Does PageIndex use additional internal parsing/OCR mechanisms not available in the open-source repo?
• Or is there a recommended setup to achieve similar results locally?
Tree Search / Retrieval Strategy
I noticed references to tree-based approaches:
• What exactly is “LLM tree search” in PageIndex?
Document Storage & Tree Structure
I have some questions about how PageIndex stores and processes documents internally:
• When we upload a document, where is the document actually stored?
• The system creates a tree structure (ToC) — where is this structure stored?
• Is the LLM reasoning performed over this tree to identify relevant nodes (based on summaries), and then navigate to the exact page?
Additionally:
• Once the relevant node/page is identified, where does PageIndex fetch the actual page content from?
• Is the page content stored internally by PageIndex, or retrieved dynamically during query time?
I’d appreciate clarification on these points to better understand the internal architecture and align with the intended usage.
Thanks!

Manoj-Gujare · 2026-03-25T17:52:47Z

Manoj-Gujare
Mar 25, 2026

Hi @rakeshofficework01-oss
Great set of questions these touch the core design differences between PageIndex and traditional RAG systems. Let me clarify each point precisely.

Metadata Approach

PageIndex is not a full storage or metadata management system.

You are responsible for managing document-level metadata (SQL/NoSQL, doc_id mapping, etc.)
PageIndex internally maintains references between nodes and pages, but:
- higher-level concerns (tags, permissions, filtering) are handled outside

Think of PageIndex as a retrieval layer, not a database.

Semantics Approach

This is where PageIndex differs significantly from traditional RAG.

PageIndex follows a vectorless-first approach
It does NOT rely on:
- manual chunking
- embedding generation
- vector similarity search (by default)

Instead:

documents are parsed into a hierarchical structure
each node (section/page) is summarized using an LLM

Semantics come from summaries + structure, not embeddings.

Note:

Hybrid setups (adding embeddings externally) are still possible if needed.

Description-Based Retrieval

Descriptions are not just a filtering step - they are central to retrieval.

Each node in the document tree has an LLM-generated summary
During retrieval:
- the LLM reads these summaries
- reasons about relevance
- selects the most appropriate nodes

This is a coarse-to-fine retrieval mechanism:
summaries → node selection → content fetch

Document Processing Differences

Your observation is expected.

Local/open-source pipelines:
- rely on standard parsers
- may fail on scanned PDFs
Hosted/API versions:
- likely include OCR + layout-aware parsing
- may use vision-based models for better extraction

To improve local performance:

integrate OCR (PaddleOCR is generally stronger for complex layouts)
use layout-aware parsing (hi_res strategies)

Tree Search / LLM Tree Search

“LLM tree search” replaces traditional nearest-neighbor retrieval.

Documents are represented as a hierarchy:
- document → sections → pages

Retrieval flow:

LLM inspects summaries (like a Table of Contents)
Selects relevant nodes
Traverses down the tree
Retrieves exact content

This is reasoning-driven retrieval instead of similarity search.

Document Storage & Tree Structure

Raw documents:
- stored externally (filesystem, object storage, etc.)
Tree structure:
- stored as a JSON-like hierarchy (summaries + references)
Retrieval:
1. LLM selects relevant nodes via summaries
2. System fetches corresponding pages/content dynamically

Content is retrieved on demand after node selection.

Summary

Traditional RAG:

chunk → embed → vector search

PageIndex:

structure → summarize → LLM reasoning → targeted retrieval

This shift (from similarity search to reasoning over structure) is the key design idea behind PageIndex.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Metadata, Semantics, Description, Storage & Tree-Based Retrieval in PageIndex #171

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarification on Metadata, Semantics, Description, Storage & Tree-Based Retrieval in PageIndex #171

Uh oh!

rakeshofficework01-oss Mar 18, 2026

Replies: 1 comment

Uh oh!

Manoj-Gujare Mar 25, 2026

rakeshofficework01-oss
Mar 18, 2026

Manoj-Gujare
Mar 25, 2026