This project provides a command-line interface to interact with a graph database representing Wikipedia's Main Topic Classification categories. It allows efficient querying and exploration of the category hierarchy.
Link to Wikipedia classifications
- Technologies Used
- Architecture
- Prerequisites
- Installation
- Setup
- Design and Implementation
- Usage
- Results
- Self-Evaluation
- Contributing
- Support
- License
- Neo4j (version 5.20.0)
- Python (version 3.12.3)
- Neo4j Database: Stores the graph data representing Wikipedia classifications.
- Python Scripts:
import_data.py: Imports data from CSV files into Neo4j.utils.py: Provides utility functions for database operations.goals.py: Defines functions for various database queries.dbcli.py: Command-line interface for interacting with the database.
- Configuration:
config.py: Stores database connection details and other settings.
- Data Import:
import_data.pyprocesses and imports data fromtaxonomy_iw.csv.gzinto Neo4j. - Database Interaction:
dbcli.pyexecutes user commands using functions fromgoals.py. - Utility Functions:
utils.pymanages database connections and query execution.
- Python 3.12.3
- Neo4j server
- Virtual environment
- Required Python packages: neo4j, pandas, tqdm (optional)
sudo apt update && sudo apt upgrade -y wget -O - https://debian.neo4j.com/neotechnology.gpg.key | sudo apt-key add - echo 'deb https://debian.neo4j.com stable 4.x' | sudo tee /etc/apt/sources.list.d/neo4j.list sudo apt update sudo apt install neo4j -yVerify installation: neo4j --version
sudo apt install python3 python3-venv python3-pip -y python3 -m venv myenv source myenv/bin/activate pip install neo4j pandas tqdm- Download taxonomy_iw.csv.gz to the project directory.
- Download project files: config.py, utils.py, import_data.py, goals.py, dbcli.py.
- Navigate to the project directory:
cd dbproject - Activate the virtual environment:
source myenv/bin/activate - Start Neo4j server:
sudo systemctl enable neo4j sudo systemctl start neo4j - Set up Neo4j browser:
- Open
http://localhost:7474in your web browser. - Set a password for the default
neo4juser.
- Open
- Update
config.pywith your Neo4j credentials. - Import data:
python import_data.py
-
Schema Design:
- Nodes: Categories with
nameproperty - Relationships:
HAS_SUBCATEGORYbetween parent-child nodes - Unique constraint on
nameproperty - Index on
nameproperty for all nodes
- Nodes: Categories with
-
Data Import:
- Batch processing with multi-threading (4 cores)
- Error handling and retries
- Progress tracking with tqdm
-
Query Functions:
- Implemented in
goals.py - Use Cypher queries along with Asynchronous Processing for efficient graph traversal
- Yield results for streaming
- Implemented in
-
Command Line Interface:
dbcli.pyprovides a user-friendly interface- Executes queries and streams results
Activate the virtual environment and run:
python dbcli.py <goal_number> [arguments]Available goals:
- Find children of a node:
python dbcli.py 1 <node_name> - Count children of a node:
python dbcli.py 2 <node_name> - Find grandchildren of a node:
python dbcli.py 3 <node_name> - Find parents of a node:
python dbcli.py 4 <node_name> - Count parents of a node:
python dbcli.py 5 <node_name> - Find grandparents of a node:
python dbcli.py 6 <node_name> - Count unique nodes:
python dbcli.py 7 - Find root nodes:
python dbcli.py 8 - Find nodes with most children:
python dbcli.py 9 - Find nodes with least children:
python dbcli.py 10 - Rename a node:
python dbcli.py 11 <old_name> <new_name> - Find paths between nodes:
python dbcli.py 12 <start_node> <end_node> [search_depth]
Detailed query results can be found in the Results folder.
Initial implementation faced performance issues with complex queries. Improvements made:
- Search Depth Limit: Introduced a parameter to limit search depth, preventing exploration of irrelevant paths.
- Asynchronous Processing: Implemented parallel path-finding from child nodes of the start node to the end node.
Performance comparison for the query from "Centuries" to "2020s_anime_films":
The optimized version with a default depth of 10 along with Asynchronous Processing efficiently identifies relevant paths quickly, balancing depth and time efficiency. Custom depth can be set for more extensive searches when time is not a constraint.
Contributions to improve this project are welcome. Please submit pull requests or open issues in the project repository.
For assistance or inquiries, please open an issue in the project's issue tracker or contact Pritam.Chakraborty1@outlook.com.
This project is licensed under the MIT License. See the LICENSE file for details.
