IMDB Sentiment Analysis using LSTM

Deep Learning Assignment - KIIT University

A deep learning project that classifies movie reviews as positive or negative using LSTM (Long Short-Term Memory) neural networks.

Overview

This project implements a sentiment analysis model using Bidirectional LSTM networks to classify IMDB movie reviews. The model processes text sequences and predicts whether a review expresses positive or negative sentiment.

Key Features:

Binary sentiment classification (Positive/Negative)
Bidirectional LSTM architecture for better context understanding
Word embeddings for semantic representation
Comprehensive evaluation metrics
Visualization of training progress
Sample predictions on custom reviews

Requirements

Python Version

Python 3.8 or higher

Required Packages

tensorflow>=2.10.0 numpy>=1.21.0 pandas>=1.3.0 matplotlib>=3.4.0 seaborn>=0.11.0 scikit-learn>=1.0.0 jupyter>=1.0.0

Hardware Recommendations

Minimum: 8GB RAM, CPU-based training (~30-45 minutes)
Recommended: 16GB RAM, GPU support for faster training (~5-10 minutes)

Installation

Step 1: Clone or Download the Project

cd /path/to/lstm-project

Step 2: Create a Virtual Environment (Recommended)

# Create virtual environment python -m venv venv # Activate virtual environment # On macOS/Linux: source venv/bin/activate # On Windows: venv\Scripts\activate

Step 3: Install Required Packages

# Install all dependencies pip install tensorflow numpy pandas matplotlib seaborn scikit-learn jupyter # Or if you have a requirements.txt: pip install -r requirements.txt

Step 4: Verify Installation

python -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)"

How to Run

Option 1: Jupyter Notebook (Recommended)

Launch Jupyter Notebook:
```
jupyter notebook
```
Open the notebook:
- Navigate to imdb_sentiment_analysis.ipynb
- Click to open
Run the cells:
- Click "Cell" → "Run All" to execute all cells
- Or run cells individually with Shift + Enter
View outputs:
- Training progress and metrics will be displayed inline
- Plots and visualizations will appear below each cell

Option 2: JupyterLab

# Install JupyterLab if not already installed pip install jupyterlab # Launch JupyterLab jupyter lab # Open imdb_sentiment_analysis.ipynb

Option 3: VS Code with Jupyter Extension

Open VS Code
Install the Jupyter extension
Open imdb_sentiment_analysis.ipynb
Click "Run All" or run cells individually

Project Structure

lstm-project/ │ ├── imdb_sentiment_analysis.ipynb # Main Jupyter notebook with complete implementation ├── README.md # This file - project documentation ├── requirements.txt # Python package dependencies (optional) └── (Generated after running notebook) └── imdb_sentiment_lstm_model.h5 # Saved trained model

Model Architecture

Network Design

The model uses a Bidirectional LSTM architecture:

1. Embedding Layer (10,000 → 128) - Converts word indices to dense vectors - Vocabulary size: 10,000 most frequent words - Embedding dimension: 128 2. Dropout Layer (20%) - Regularization to prevent overfitting 3. Bidirectional LSTM Layer (128 units) - Processes sequences in both directions - Returns sequences for stacking 4. Dropout Layer (20%) 5. LSTM Layer (64 units) - Second recurrent layer for feature extraction 6. Dropout Layer (20%) 7. Dense Layer (64 units, ReLU) - Fully connected layer 8. Dropout Layer (20%) 9. Output Layer (1 unit, Sigmoid) - Binary classification output

Model Hyperparameters

Parameter	Value
Vocabulary Size	10,000 words
Sequence Length	250 words
Embedding Dimension	128
LSTM Units (Bi-LSTM)	128
LSTM Units (2nd layer)	64
Batch Size	128
Epochs	10 (with early stopping)
Optimizer	Adam
Loss Function	Binary Cross-Entropy

Training Strategy

Validation Split: 20% of training data
Early Stopping: Monitors validation loss (patience=3)
Learning Rate Reduction: Reduces LR on plateau (factor=0.5, patience=2)
Dropout: 20% dropout rate for regularization

Results

Expected Performance

The model typically achieves:

Test Accuracy: 85-88%
Precision: 0.85-0.88
Recall: 0.85-0.88
F1-Score: 0.85-0.88

Outputs Included

Dataset Exploration
- Class distribution visualization
- Sequence length distribution
- Sample decoded reviews
Training Curves
- Accuracy over epochs (training & validation)
- Loss over epochs (training & validation)
Evaluation Metrics
- Confusion matrix heatmap
- Classification report (precision, recall, F1-score)
- Detailed performance statistics
Sample Predictions
- Correct predictions with confidence scores
- Incorrect predictions for error analysis
- Custom review predictions

Dataset Information

IMDB Movie Review Dataset

Source: Keras Datasets / TensorFlow Datasets
Total Samples: 50,000 reviews
Training Set: 25,000 reviews
Test Set: 25,000 reviews
Classes: Binary (Positive/Negative)
Class Balance: 50% positive, 50% negative

Dataset Characteristics

Average Review Length: ~230 words
Vocabulary: 88,585 unique words (using top 10,000)
Format: Pre-tokenized integer sequences
Language: English

Preprocessing

Tokenization: Words converted to integer indices
Vocabulary Limiting: Top 10,000 most frequent words
Padding: Sequences padded/truncated to 250 words
Padding Strategy: Post-padding (zeros added at end)

Assignment Details

Tasks Completed

✅ Dataset Loading

Loaded IMDB dataset via Keras
Explored dataset statistics and distribution
Displayed sample reviews

✅ Text Preprocessing

Tokenized text data (word to integer mapping)
Padded sequences to uniform length (250 words)
Created vocabulary of top 10,000 words

✅ Model Building

Built Bidirectional LSTM model
Included embedding layer (128 dimensions)
Used sigmoid activation for binary classification

✅ Training

Split data into train/validation/test sets
Compiled with binary cross-entropy loss
Used Adam optimizer
Trained for up to 10 epochs with early stopping

✅ Evaluation

Evaluated on test set
Reported accuracy, loss, precision, recall, F1-score
Generated confusion matrix
Provided sample predictions

Submission Checklist

✅ Jupyter Notebook (.ipynb) with complete implementation
✅ Proper comments explaining each step
✅ Training/validation curves
✅ Evaluation metrics and confusion matrix
✅ Sample predictions
✅ README file with setup instructions

Troubleshooting

Common Issues

1. TensorFlow Installation Fails

# Try installing TensorFlow CPU version explicitly pip install tensorflow-cpu

2. Out of Memory Error

# Reduce batch size in the notebook batch_size = 64 # Instead of 128

3. Jupyter Kernel Dies During Training

Close other applications
Reduce model complexity or batch size
Use CPU instead of GPU if GPU memory is limited

4. Missing Word Index

# If word_index fails to load, the notebook handles this automatically # The dataset will still work for training and evaluation

Model Improvements (Future Work)

Potential enhancements to improve performance:

Pre-trained Embeddings: Use GloVe or Word2Vec embeddings
Attention Mechanism: Add attention layers for better focus
Ensemble Methods: Combine multiple models
Data Augmentation: Synonym replacement, back-translation
Hyperparameter Tuning: Grid search or Bayesian optimization
Transformer Models: Experiment with BERT or similar architectures

References

Dataset: IMDB Dataset - Keras
TensorFlow Documentation: https://www.tensorflow.org/
LSTM Paper: Hochreiter & Schmidhuber (1997)
Sentiment Analysis: Stanford NLP Sentiment

Author

Aniruddha Mukherjee KIIT University Deep Learning Assignment

Submission Information

Submit to: sunil.sawantfcs@kiit.ac.in Files to Submit:

imdb_sentiment_analysis.ipynb
README.md

License

This project is created for educational purposes as part of a university assignment.

Acknowledgments

KIIT University Faculty
TensorFlow/Keras Development Team
Andrew Maas et al. for the IMDB dataset

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
Sentiment-Analysis-LSTM-Project.zip		Sentiment-Analysis-LSTM-Project.zip
imdb_sentiment_analysis.ipynb		imdb_sentiment_analysis.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

IMDB Sentiment Analysis using LSTM

Table of Contents

Overview

Requirements

Python Version

Required Packages

Hardware Recommendations

Installation

Step 1: Clone or Download the Project

Step 2: Create a Virtual Environment (Recommended)

Step 3: Install Required Packages

Step 4: Verify Installation

How to Run

Option 1: Jupyter Notebook (Recommended)

Option 2: JupyterLab

Option 3: VS Code with Jupyter Extension

Project Structure

Model Architecture

Network Design

Model Hyperparameters

Training Strategy

Results

Expected Performance

Outputs Included

Dataset Information

IMDB Movie Review Dataset

Dataset Characteristics

Preprocessing

Assignment Details

Tasks Completed

Submission Checklist

Troubleshooting

Common Issues

Model Improvements (Future Work)

References

Author

Submission Information

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages