Skip to content

kiit-code/lstm-project

Repository files navigation

IMDB Sentiment Analysis using LSTM

Deep Learning Assignment - KIIT University

A deep learning project that classifies movie reviews as positive or negative using LSTM (Long Short-Term Memory) neural networks.


Table of Contents


Overview

This project implements a sentiment analysis model using Bidirectional LSTM networks to classify IMDB movie reviews. The model processes text sequences and predicts whether a review expresses positive or negative sentiment.

Key Features:

  • Binary sentiment classification (Positive/Negative)
  • Bidirectional LSTM architecture for better context understanding
  • Word embeddings for semantic representation
  • Comprehensive evaluation metrics
  • Visualization of training progress
  • Sample predictions on custom reviews

Requirements

Python Version

  • Python 3.8 or higher

Required Packages

tensorflow>=2.10.0 numpy>=1.21.0 pandas>=1.3.0 matplotlib>=3.4.0 seaborn>=0.11.0 scikit-learn>=1.0.0 jupyter>=1.0.0

Hardware Recommendations

  • Minimum: 8GB RAM, CPU-based training (~30-45 minutes)
  • Recommended: 16GB RAM, GPU support for faster training (~5-10 minutes)

Installation

Step 1: Clone or Download the Project

cd /path/to/lstm-project

Step 2: Create a Virtual Environment (Recommended)

# Create virtual environment python -m venv venv # Activate virtual environment # On macOS/Linux: source venv/bin/activate # On Windows: venv\Scripts\activate

Step 3: Install Required Packages

# Install all dependencies pip install tensorflow numpy pandas matplotlib seaborn scikit-learn jupyter # Or if you have a requirements.txt: pip install -r requirements.txt

Step 4: Verify Installation

python -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)"

How to Run

Option 1: Jupyter Notebook (Recommended)

  1. Launch Jupyter Notebook:

    jupyter notebook
  2. Open the notebook:

    • Navigate to imdb_sentiment_analysis.ipynb
    • Click to open
  3. Run the cells:

    • Click "Cell" → "Run All" to execute all cells
    • Or run cells individually with Shift + Enter
  4. View outputs:

    • Training progress and metrics will be displayed inline
    • Plots and visualizations will appear below each cell

Option 2: JupyterLab

# Install JupyterLab if not already installed pip install jupyterlab # Launch JupyterLab jupyter lab # Open imdb_sentiment_analysis.ipynb

Option 3: VS Code with Jupyter Extension

  1. Open VS Code
  2. Install the Jupyter extension
  3. Open imdb_sentiment_analysis.ipynb
  4. Click "Run All" or run cells individually

Project Structure

lstm-project/ │ ├── imdb_sentiment_analysis.ipynb # Main Jupyter notebook with complete implementation ├── README.md # This file - project documentation ├── requirements.txt # Python package dependencies (optional) └── (Generated after running notebook) └── imdb_sentiment_lstm_model.h5 # Saved trained model 

Model Architecture

Network Design

The model uses a Bidirectional LSTM architecture:

1. Embedding Layer (10,000 → 128) - Converts word indices to dense vectors - Vocabulary size: 10,000 most frequent words - Embedding dimension: 128 2. Dropout Layer (20%) - Regularization to prevent overfitting 3. Bidirectional LSTM Layer (128 units) - Processes sequences in both directions - Returns sequences for stacking 4. Dropout Layer (20%) 5. LSTM Layer (64 units) - Second recurrent layer for feature extraction 6. Dropout Layer (20%) 7. Dense Layer (64 units, ReLU) - Fully connected layer 8. Dropout Layer (20%) 9. Output Layer (1 unit, Sigmoid) - Binary classification output 

Model Hyperparameters

Parameter Value
Vocabulary Size 10,000 words
Sequence Length 250 words
Embedding Dimension 128
LSTM Units (Bi-LSTM) 128
LSTM Units (2nd layer) 64
Batch Size 128
Epochs 10 (with early stopping)
Optimizer Adam
Loss Function Binary Cross-Entropy

Training Strategy

  • Validation Split: 20% of training data
  • Early Stopping: Monitors validation loss (patience=3)
  • Learning Rate Reduction: Reduces LR on plateau (factor=0.5, patience=2)
  • Dropout: 20% dropout rate for regularization

Results

Expected Performance

The model typically achieves:

  • Test Accuracy: 85-88%
  • Precision: 0.85-0.88
  • Recall: 0.85-0.88
  • F1-Score: 0.85-0.88

Outputs Included

  1. Dataset Exploration

    • Class distribution visualization
    • Sequence length distribution
    • Sample decoded reviews
  2. Training Curves

    • Accuracy over epochs (training & validation)
    • Loss over epochs (training & validation)
  3. Evaluation Metrics

    • Confusion matrix heatmap
    • Classification report (precision, recall, F1-score)
    • Detailed performance statistics
  4. Sample Predictions

    • Correct predictions with confidence scores
    • Incorrect predictions for error analysis
    • Custom review predictions

Dataset Information

IMDB Movie Review Dataset

  • Source: Keras Datasets / TensorFlow Datasets
  • Total Samples: 50,000 reviews
  • Training Set: 25,000 reviews
  • Test Set: 25,000 reviews
  • Classes: Binary (Positive/Negative)
  • Class Balance: 50% positive, 50% negative

Dataset Characteristics

  • Average Review Length: ~230 words
  • Vocabulary: 88,585 unique words (using top 10,000)
  • Format: Pre-tokenized integer sequences
  • Language: English

Preprocessing

  1. Tokenization: Words converted to integer indices
  2. Vocabulary Limiting: Top 10,000 most frequent words
  3. Padding: Sequences padded/truncated to 250 words
  4. Padding Strategy: Post-padding (zeros added at end)

Assignment Details

Tasks Completed

Dataset Loading

  • Loaded IMDB dataset via Keras
  • Explored dataset statistics and distribution
  • Displayed sample reviews

Text Preprocessing

  • Tokenized text data (word to integer mapping)
  • Padded sequences to uniform length (250 words)
  • Created vocabulary of top 10,000 words

Model Building

  • Built Bidirectional LSTM model
  • Included embedding layer (128 dimensions)
  • Used sigmoid activation for binary classification

Training

  • Split data into train/validation/test sets
  • Compiled with binary cross-entropy loss
  • Used Adam optimizer
  • Trained for up to 10 epochs with early stopping

Evaluation

  • Evaluated on test set
  • Reported accuracy, loss, precision, recall, F1-score
  • Generated confusion matrix
  • Provided sample predictions

Submission Checklist

  • ✅ Jupyter Notebook (.ipynb) with complete implementation
  • ✅ Proper comments explaining each step
  • ✅ Training/validation curves
  • ✅ Evaluation metrics and confusion matrix
  • ✅ Sample predictions
  • ✅ README file with setup instructions

Troubleshooting

Common Issues

1. TensorFlow Installation Fails

# Try installing TensorFlow CPU version explicitly pip install tensorflow-cpu

2. Out of Memory Error

# Reduce batch size in the notebook batch_size = 64 # Instead of 128

3. Jupyter Kernel Dies During Training

  • Close other applications
  • Reduce model complexity or batch size
  • Use CPU instead of GPU if GPU memory is limited

4. Missing Word Index

# If word_index fails to load, the notebook handles this automatically # The dataset will still work for training and evaluation

Model Improvements (Future Work)

Potential enhancements to improve performance:

  1. Pre-trained Embeddings: Use GloVe or Word2Vec embeddings
  2. Attention Mechanism: Add attention layers for better focus
  3. Ensemble Methods: Combine multiple models
  4. Data Augmentation: Synonym replacement, back-translation
  5. Hyperparameter Tuning: Grid search or Bayesian optimization
  6. Transformer Models: Experiment with BERT or similar architectures

References


Author

Aniruddha Mukherjee KIIT University Deep Learning Assignment


Submission Information

Submit to: sunil.sawantfcs@kiit.ac.in Files to Submit:

  • imdb_sentiment_analysis.ipynb
  • README.md

License

This project is created for educational purposes as part of a university assignment.


Acknowledgments

  • KIIT University Faculty
  • TensorFlow/Keras Development Team
  • Andrew Maas et al. for the IMDB dataset

About

KIIT Movie Review Sentiment Analysis Project

Resources

Stars

Watchers

Forks

Contributors