Wikipedia Graph Database Project

Introduction

This project provides a command-line interface to interact with a graph database representing Wikipedia's Main Topic Classification categories. It allows efficient querying and exploration of the category hierarchy.

Link to Wikipedia classifications

Technologies Used

Neo4j (version 5.20.0)
Python (version 3.12.3)

Architecture

Components

Neo4j Database: Stores the graph data representing Wikipedia classifications.
Python Scripts:
- import_data.py: Imports data from CSV files into Neo4j.
- utils.py: Provides utility functions for database operations.
- goals.py: Defines functions for various database queries.
- dbcli.py: Command-line interface for interacting with the database.
Configuration:
- config.py: Stores database connection details and other settings.

Data Flow

Data Import: import_data.py processes and imports data from taxonomy_iw.csv.gz into Neo4j.
Database Interaction: dbcli.py executes user commands using functions from goals.py.
Utility Functions: utils.py manages database connections and query execution.

Prerequisites

Python 3.12.3
Neo4j server
Virtual environment
Required Python packages: neo4j, pandas, tqdm (optional)

Installation

Installing Neo4j (Ubuntu)

sudo apt update && sudo apt upgrade -y wget -O - https://debian.neo4j.com/neotechnology.gpg.key | sudo apt-key add - echo 'deb https://debian.neo4j.com stable 4.x' | sudo tee /etc/apt/sources.list.d/neo4j.list sudo apt update sudo apt install neo4j -y

Verify installation: neo4j --version

Installing Python and Setting Up Environment

sudo apt install python3 python3-venv python3-pip -y python3 -m venv myenv source myenv/bin/activate pip install neo4j pandas tqdm

Setup

Download taxonomy_iw.csv.gz to the project directory.
Download project files: config.py, utils.py, import_data.py, goals.py, dbcli.py.
Navigate to the project directory: cd dbproject
Activate the virtual environment: source myenv/bin/activate

Start Neo4j server:

sudo systemctl enable neo4j sudo systemctl start neo4j

Set up Neo4j browser:
- Open http://localhost:7474 in your web browser.
- Set a password for the default neo4j user.
Update config.py with your Neo4j credentials.
Import data: python import_data.py

Design and Implementation

Schema Design:
- Nodes: Categories with name property
- Relationships: HAS_SUBCATEGORY between parent-child nodes
- Unique constraint on name property
- Index on name property for all nodes
Data Import:
- Batch processing with multi-threading (4 cores)
- Error handling and retries
- Progress tracking with tqdm
Query Functions:
- Implemented in goals.py
- Use Cypher queries along with Asynchronous Processing for efficient graph traversal
- Yield results for streaming
Command Line Interface:
- dbcli.py provides a user-friendly interface
- Executes queries and streams results

Usage

Activate the virtual environment and run:

python dbcli.py <goal_number> [arguments]

Available goals:

Find children of a node: python dbcli.py 1 <node_name>
Count children of a node: python dbcli.py 2 <node_name>
Find grandchildren of a node: python dbcli.py 3 <node_name>
Find parents of a node: python dbcli.py 4 <node_name>
Count parents of a node: python dbcli.py 5 <node_name>
Find grandparents of a node: python dbcli.py 6 <node_name>
Count unique nodes: python dbcli.py 7
Find root nodes: python dbcli.py 8
Find nodes with most children: python dbcli.py 9
Find nodes with least children: python dbcli.py 10
Rename a node: python dbcli.py 11 <old_name> <new_name>
Find paths between nodes: python dbcli.py 12 <start_node> <end_node> [search_depth]

Results

Detailed query results can be found in the Results folder.

Self-Evaluation

Optimization of Goal 12 (Find all paths between two given nodes)

Initial implementation faced performance issues with complex queries. Improvements made:

Search Depth Limit: Introduced a parameter to limit search depth, preventing exploration of irrelevant paths.
Asynchronous Processing: Implemented parallel path-finding from child nodes of the start node to the end node.

Performance comparison for the query from "Centuries" to "2020s_anime_films":

The optimized version with a default depth of 10 along with Asynchronous Processing efficiently identifies relevant paths quickly, balancing depth and time efficiency. Custom depth can be set for more extensive searches when time is not a constraint.

Contributing

Contributions to improve this project are welcome. Please submit pull requests or open issues in the project repository.

Support

For assistance or inquiries, please open an issue in the project's issue tracker or contact Pritam.Chakraborty1@outlook.com.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Results		Results
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Optimization-Comparison.png		Optimization-Comparison.png
README.md		README.md
taxonomy_iw.csv.gz		taxonomy_iw.csv.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wikipedia Graph Database Project

Introduction

Table of Contents

Technologies Used

Architecture

Components

Data Flow

Prerequisites

Installation

Installing Neo4j (Ubuntu)

Installing Python and Setting Up Environment

Setup

Design and Implementation

Usage

Results

Self-Evaluation

Optimization of Goal 12 (Find all paths between two given nodes)

Contributing

Support

License

About

Uh oh!

Languages

License

atpritam/graph-database-project

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Graph Database Project

Introduction

Table of Contents

Technologies Used

Architecture

Components

Data Flow

Prerequisites

Installation

Installing Neo4j (Ubuntu)

Installing Python and Setting Up Environment

Setup

Design and Implementation

Usage

Results

Self-Evaluation

Optimization of Goal 12 (Find all paths between two given nodes)

Contributing

Support

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages