Descriptive Alt Text

Sentiment Analysis on Movie Reviews

Overview
TThis project focuses on analyzing sentiments (positive or negative) from the IMDB dataset containing 50,000 movie reviews. The workflow includes text cleaning, preprocessing, feature engineering, and visualization to uncover patterns in the data. Using techniques like Bag of Words (BoW) and PCA, the study highlights how natural language processing (NLP) methods can transform unstructured text data into meaningful insights.

Objective
To analyze and predict sentiments (positive or negative) from movie reviews in the IMDB dataset of 50,000 reviews using advanced NLP techniques.

Steps and Techniques
1 . Data Cleaning :
  • Removing duplicates: Ensured data integrity by dropping duplicate entries.
  • Lowercasing: Converted all reviews to lowercase for uniformity.
  • Removing HTML tags and URLs: Used regex to clean extraneous HTML content and links from the reviews.
  • Expanding contractions: Transformed abbreviated words (e.g., "can't" to "cannot") to improve text readability.
  • Spelling corrections: Corrected spelling errors using TextBlob.
2 . Text Preprocessing :
  • Punctuation removal: Stripped out punctuation to reduce noise.
  • Tokenization: Split text into individual words using NLTK's word_tokenize.
  • Stopword removal: Filtered out common stopwords like "and", "the", and "is" to focus on meaningful words.
3 . Feature Engineering :
  • Character and word lengths: Added features for the length of reviews to identify patterns in text length versus sentiment.
  • N-grams: Extracted and analyzed bigrams and trigrams to capture context in word sequences.
4 . Visualization :
  • Word Clouds: Generated insightful word clouds for both positive and negative sentiments to highlight frequently used terms.
  • Distribution Plots: Analyzed the relationship between sentiment and review length.
5 . Modeling :
  • Bag of Words (BoW): Created a feature matrix using unigrams, bigrams, and trigrams with CountVectorizer.
  • Dimensionality Reduction: Reduced feature dimensions with PCA and visualized results with scatterplots.
6 . Insights :
  • Positive reviews tended to use longer sentences with specific terms associated with positive sentiments.
  • Negative reviews highlighted dissatisfaction through shorter, direct words.

Conclusion
The case study provided valuable insights into the linguistic patterns of movie reviews. It demonstrated the efficacy of NLP techniques in transforming raw text data into actionable sentiment predictions.

Code link :
Click here to access the code