Skip to content

This project employs advanced natural language processing and machine learning to analyze Amazon reviews, predicting sentiment shifts. It assists users and businesses in understanding and anticipating public perceptions of products, guiding development, and enhancing decision-making within the e-commerce landscape.

License

Notifications You must be signed in to change notification settings

anandr07/Sentiment-prediction-from-Amazon-reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentiment Prediction from Amazon Reviews

In the vast landscape of e-commerce, understanding customer sentiment is crucial for businesses seeking to enhance user experiences and optimize product offerings. Amazon, being one of the world's largest online marketplaces, accumulates an immense volume of user-generated reviews.This project employs advanced natural language processing and machine learning to analyze Amazon reviews, predicting sentiment shifts. It assists users and businesses in understanding and anticipating public perceptions of products, guiding development, and enhancing decision-making within the e-commerce landscape.

Table of Contents

  1. Problem Statement
  2. About DataSet
  3. Performance Metric
  4. Performance Metric
  5. Load the Data and Perform Data Analysis
  6. Distribution of data points among output classes
  7. Feature Engineering
  8. Splitting into Train and Test Data
  9. Models used:
  10. Results

Problem Statement:

In the ever-expanding realm of e-commerce, businesses grapple with the challenge of distilling meaningful insights from the vast troves of customer-generated content, particularly in the form of reviews on platforms like Amazon. The problem at hand lies in deciphering the sentiments expressed within these reviews, ranging from glowing endorsements to pointed criticisms. Understanding customer sentiment is pivotal for businesses seeking to enhance product offerings, improve customer experiences, and stay competitive in the dynamic marketplace.The challenge further extends to the sheer volume and diversity of textual data. As the number of reviews grows exponentially, manual analysis becomes impractical. Traditional methods fall short in efficiently extracting sentiments at scale, necessitating the implementation of advanced technologies such as Natural Language Processing (NLP) and Machine Learning (ML).Doing so will empower businesses with actionable insights derived from customer sentiments, enabling them to make informed decisions, improve products and services, and ultimately thrive in the highly competitive landscape of e-commerce.

About DataSet

This dataset consists of reviews of fine foods from Amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

  • Contents database.csv : Contains the table 'Reviews' Reviews.csv : Pulled from the corresponding SQLite table named Reviews in database.sqlite
  • Data includes: Reviews from Oct 1999 - Oct 2012 568,454 reviews 256,059 users 74,258 products 260 users with > 50 reviews

Performance Metric

Metric(s):

  • Log-Loss
  • Binary Confusion Matrix0

Importing Needed Libraries and accessing other py files(feature-extraction)

The project initiates data analysis and machine learning by importing essential Python libraries, including feature extraction, data visualization, and algorithms. It accesses specific functionalities from the 'feature_extraction' and 'ml_algorithms' modules for further use.

Load the Data and Perform Data Analysis

Read CSV file into a Pandas DataFrame, display the first five rows and provide information about the dataset. It identifies the column on the basis of scores and dropping value above needed threshold and removing the duplicated values. The dataset initially has 568,454 reviews, and for the fast calculation we used 250,000 and after dropping rows with thresholdlimit its left with 230478 review and after removing the duplicate value 182,285 404,287 reviews. There are many duplicate value and and score above 3, dropping those rows.

Distribution of data points among output classes

  • Distribution of Duplicate and Non-duplicate reviews

  • Number of reviews above score given thresholds Analyzing the dataset reveals 250000 unique reviews. About 27.08% of reviews appear more than once.

  • Checking for Duplicates

  • cleaning the sentence along with text and words

Feature Engineering

Data preprocessor

  • dropping
  • assigning the value
  • removing the duplicate
  • checking using numerator and demonitor
  • removing the punctuation,HTML tag ,URL and Non-Alpha numeric
  • remove the stop words
  • tokenization of text
  • snowball stemmer for steming of word
  • keeping only aplha-numeric.
  • splitting sentence into words
  • double check for alphanumeric

Feature Extraction after pre-processing.

Featurization (NLP and Fuzzy Features) Definition:

  • Token: You get a token by splitting sentence a space
  • Stop_Word: stop words as per NLTK.
  • Word: A token that is not a stop_word

Some Additional Feature

  • Bag of words
  • Uni,Bi and tri grams
  • Tf-Idf Vectorization
  • diff_chars
  • word2vec Model
  • Average word2vec
  • Tf-Idf Word2vec

Due to lack of Computation Power the models are trained on 250,000 Rows.

  • Checks if there are any NA (missing) values in the DataFrame after converting features to numeric format. If present, it prints "NA Values Present"; otherwise, it prints "No NA Values Present." It then displays the number of NaN values in each column after the conversion. Additionally, it converts the target variable y_true to a list of integers and shows the first few rows of the DataFrame.

Comparing the Original Text and the processed text

image

Get the top 10 words most similar words to "quality"

image

Splitting into Train and Test Data

Train Data : 70% Test Data : 30%

Models used:

KNN on Bag of words

image image image

KNN on tfidf

image image image

Naive Bayes on Bag of Words

image

Naive Bayes on Tf-Idf

image

SGD Classifier on Bag of Words

image

SGD Classifier on Tf-Idf

image

SGD Classifier on word2vec

image

Results

  • The SGD Classifier on tf-idf give best accuracy among all the model used .
  • Best Hyperparameters: {'alpha': 1.9211659757411964e-06, 'eta0': 0.01}
  • AUC Score (CV): 0.9582384628250535 Accuracy (CV): 0.9266463021839869
  • AUC Score (Train): 0.9995024371625396 Accuracy (Train): 0.9932857742341586

image

About

This project employs advanced natural language processing and machine learning to analyze Amazon reviews, predicting sentiment shifts. It assists users and businesses in understanding and anticipating public perceptions of products, guiding development, and enhancing decision-making within the e-commerce landscape.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published