Skip-gram Word2Vec with Negative Sampling

This repository contains a PyTorch implementation of the Skip-gram with Negative Sampling (SGNS) model from scratch, trained on the enwik8 dataset (a cleaned Wikipedia corpus).

Project Overview

The goal is to learn distributed word representations (embeddings) that capture semantic relationships. The implementation follows the original Word2Vec formulation by Mikolov et al. (2013).

Key Features:

Model: Skip-gram with Negative Sampling.
Dataset: enwik8 (cleaned English Wikipedia).
Training: Mini-batch stochastic gradient descent with Adam optimizer.
Evaluation: Cosine similarity comparison with pretrained Gensim (Google News) vectors, word analogy tasks, and bias detection.

Folder Structure

. ├── data/ │ ├── enwik8/ │ │ └── enwik8 # Raw input data (wiki text) │ └── wiki_clean.txt # Preprocessed text file (generated by preprocess.py) ├── results/ # Stores trained embeddings, logs, and evaluation results │ ├── embeddings.pt │ ├── idx2word.pkl │ ├── word2idx.pkl │ └── cosine_similarity.txt ├── src/ │ ├── analogy.py # Word analogy evaluation script │ ├── bias.py # Bias detection script │ ├── dataset.py # Dataset loading, vocabulary building, and negative sampling distribution │ ├── evaluate_similarity.py # Cosine similarity evaluation vs Gensim │ ├── model.py # SkipGramNS PyTorch model definition │ ├── preprocess.py # Data cleaning script (lowercasing, removing bad chars) │ └── train.py # Main training loop ├── test/ │ ├── test_dataset.py # Unit tests for dataset │ └── test_model.py # Unit tests for model ├── README.md # Project documentation ├── requirements.txt # Python dependencies └── Report.pdf # Detailed project report

Setup and Installation

Clone the repository (if not already present).
Install dependencies:
```
pip install -r requirements.txt
```

Usage

1. Data Preprocessing

The raw data is expected to be in data/enwik8/enwik8. Run the preprocessing script to clean the text and generate data/wiki_clean.txt.

python src/preprocess.py

2. Training

Train the Skip-gram model. This will generate embeddings and save them to the results/ directory.

python src/train.py

Hyperparameters (default in train.py):
- Embedding Dimension: 100
- Window Size: 5
- Negative Samples: 5
- Epochs: 3

3. Evaluation

After training, you can run the evaluation scripts.

Cosine Similarity Comparison: Compares learned embeddings with pretrained Google News vectors using gensim. Note: This script will download the word2vec-google-news-300 model (~1.6GB) if not already present via gensim.downloader.

python src/evaluate_similarity.py

Word Analogy: Evaluates vector arithmetic (e.g., King - Man + Woman = Queen).

python src/analogy.py

Bias Detection: Analyzes social biases in the learned embeddings (e.g., gender bias in professions).

python src/bias.py

Results

After training, the embeddings are saved as results/embeddings.pt along with vocabulary mappings. The model successfully captures semantic relationships despite being trained on a relatively small corpus compared to industrial models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skip-gram Word2Vec with Negative Sampling

Project Overview

Folder Structure

Setup and Installation

Usage

1. Data Preprocessing

2. Training

3. Evaluation

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
results		results
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Report.pdf		Report.pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Skip-gram Word2Vec with Negative Sampling

Project Overview

Folder Structure

Setup and Installation

Usage

1. Data Preprocessing

2. Training

3. Evaluation

Results

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages