This repository contains a Python script for processing PDF documents (specifically SEC Form 10-K filings) using Graph Retrieval Augmented Generation (GraphRAG) techniques. The implementation uses Neo4j for graph storage and OpenAI for embeddings and language model capabilities.
- graphrag_script.py: Main script for processing PDFs and querying the knowledge graph
- data/: Directory containing sample data
- form10k-clean_short/: Directory containing sample Form 10-K PDF files
- cik-10k-urls_short_list.csv: Mapping of CUSIPs to CIKs and Form 10-K URLs
- Company_Financial_Statements.csv: Sample financial statement data
- Asset_Manager_Holdings.csv: Sample asset manager holdings data
- Direct PDF loading using
PdfLoaderfrom neo4j_graphrag - Text chunking with
FixedSizeSplitterfor optimal context size - Embedding generation with
TextChunkEmbedder - Metadata extraction (CIK, CUSIP) from PDF content
- Graph creation in Neo4j with proper relationships
- Support for various retrieval methods (vector, keyword, graph, hybrid)
- Loading of company financial data and asset manager holdings from CSV files
- Python 3.8+
- Neo4j AuraDB instance or local Neo4j server
- OpenAI API key
- Required Python packages (see below)
-
Clone this repository:
git clone https://github.com/alisoncossette/GraphRAG_Ebook.git cd GraphRAG_Ebook -
Install required packages:
pip install neo4j>=5.14.0 neo4j-graphrag>=0.2.0 langchain>=0.1.0 python-dotenv>=1.0.0 requests>=2.31.0 openai>=1.12.0 tqdm>=4.66.0 pypdf pdfplumber pandas -
Create a
.envfile in the root directory with the following content:NEO4J_URI=neo4j+s://your-instance-id.databases.neo4j.io NEO4J_USERNAME=neo4j NEO4J_PASSWORD=your-password OPENAI_API_KEY=your-openai-api-key PDF_PATH=data/form10k-clean_short
usage: graphrag_script.py [-h] [--pdf PDF] [--dir DIR] [--query QUERY] [--retriever {vector,keyword,graph,hybrid}] [--load] [--load-companies] [--load-holdings] [--use-lexical-graph] GraphRAG PDF Processing and Querying Tool options: -h, --help show this help message and exit --pdf PDF Path to a single PDF file to process --dir DIR Directory containing PDF files to process --query QUERY Query the knowledge graph --retriever {vector,keyword,graph,hybrid} Retriever type to use for querying (default: hybrid) --load Load all PDFs from the default directory --load-companies Load company financials from CSV --load-holdings Load asset manager holdings from CSV --use-lexical-graph Use lexical graph builder instead of default python graphrag_script.py --pdf "data/form10k-clean_short/0000320193-23-000106.pdf" python graphrag_script.py --load python graphrag_script.py --load --use-lexical-graph python graphrag_script.py --load-companies python graphrag_script.py --load-holdings python graphrag_script.py --query "What was Apple's revenue in 2023?" --retriever hybrid - PDF Loading: The script loads PDF documents using the
PdfLoadercomponent. - Text Extraction: Text is extracted from the PDFs using PyPDF2 or pdfplumber.
- Metadata Extraction: CIK and CUSIP identifiers are extracted using regex patterns.
- Text Chunking: The extracted text is split into manageable chunks using
FixedSizeSplitter. - Embedding Generation: Text chunks are embedded using OpenAI embeddings.
- Graph Creation: The script creates Document, CIK, CUSIP, and Chunk nodes in Neo4j.
- Relationship Establishment: Relationships are created between nodes (e.g., Document-CIK, Document-CUSIP, Chunk-Document).
- CSV Data Loading: Company financial data and asset manager holdings can be loaded from CSV files.
- Querying: The knowledge graph can be queried using various retrieval methods.
- Document: Nodes representing PDF documents
- CIK: Nodes representing Central Index Key identifiers
- CUSIP: Nodes representing Committee on Uniform Securities Identification Procedures identifiers
- Chunk: Nodes containing text chunks with embeddings
- Company: Nodes representing companies with financial data
- AssetManager: Nodes representing asset management firms
- HOLDS: Relationships between asset managers and companies they invest in
# Neo4j Connection Details NEO4J_URI=neo4j+s://your-instance-id.databases.neo4j.io NEO4J_USERNAME=neo4j NEO4J_PASSWORD=your-secure-password # OpenAI API Key OPENAI_API_KEY=sk-your-openai-api-key # PDF Directory Path PDF_PATH=data/form10k-clean_short This project is licensed under the MIT License - see the LICENSE file for details.