Clarification on Metadata, Semantics, Description, Storage & Tree-Based Retrieval in PageIndex #171
Replies: 1 comment
-
| Hi @rakeshofficework01-oss
PageIndex is not a full storage or metadata management system.
Think of PageIndex as a retrieval layer, not a database.
This is where PageIndex differs significantly from traditional RAG.
Instead:
Semantics come from summaries + structure, not embeddings. Note:
Descriptions are not just a filtering step - they are central to retrieval.
This is a coarse-to-fine retrieval mechanism:
Your observation is expected.
To improve local performance:
“LLM tree search” replaces traditional nearest-neighbor retrieval.
Retrieval flow:
This is reasoning-driven retrieval instead of similarity search.
Content is retrieved on demand after node selection. Summary Traditional RAG:
PageIndex:
This shift (from similarity search to reasoning over structure) is the key design idea behind PageIndex. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Team,
I’m currently exploring PageIndex based on the official documentation under Doc Search (metadata, semantics, and description). I have a few clarifications to ensure I’m implementing it correctly:
From the documentation, it mentions setting up SQL tables and storing doc_id along with metadata.
• Should we create and manage this SQL database ourselves?
• Or does PageIndex provide or internally manage any metadata storage and filtering layer?
For semantic search:
• Does PageIndex internally handle chunking, embedding, and indexing of documents?
• Or are we expected to generate embeddings and store them in our own vector database?
For the description method:
• Is the correct approach to generate document descriptions using an LLM and rely on those descriptions for document selection before querying?
• Does PageIndex use these descriptions internally for routing queries to the right documents?
I observed that:
• When I process a PDF locally (open-source code), some documents fail to extract content (especially scanned PDFs)
• But when uploading the same file via PageIndex (as per docs), it works correctly
Could you clarify:
• Does PageIndex use additional internal parsing/OCR mechanisms not available in the open-source repo?
• Or is there a recommended setup to achieve similar results locally?
I noticed references to tree-based approaches:
• What exactly is “LLM tree search” in PageIndex?
I have some questions about how PageIndex stores and processes documents internally:
• When we upload a document, where is the document actually stored?
• The system creates a tree structure (ToC) — where is this structure stored?
• Is the LLM reasoning performed over this tree to identify relevant nodes (based on summaries), and then navigate to the exact page?
Additionally:
• Once the relevant node/page is identified, where does PageIndex fetch the actual page content from?
• Is the page content stored internally by PageIndex, or retrieved dynamically during query time?
I’d appreciate clarification on these points to better understand the internal architecture and align with the intended usage.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions