Sentiment Prediction from Amazon Reviews

In the vast landscape of e-commerce, understanding customer sentiment is crucial for businesses seeking to enhance user experiences and optimize product offerings. Amazon, being one of the world's largest online marketplaces, accumulates an immense volume of user-generated reviews.This project employs advanced natural language processing and machine learning to analyze Amazon reviews, predicting sentiment shifts. It assists users and businesses in understanding and anticipating public perceptions of products, guiding development, and enhancing decision-making within the e-commerce landscape.

Problem Statement:

In the ever-expanding realm of e-commerce, businesses grapple with the challenge of distilling meaningful insights from the vast troves of customer-generated content, particularly in the form of reviews on platforms like Amazon. The problem at hand lies in deciphering the sentiments expressed within these reviews, ranging from glowing endorsements to pointed criticisms. Understanding customer sentiment is pivotal for businesses seeking to enhance product offerings, improve customer experiences, and stay competitive in the dynamic marketplace.The challenge further extends to the sheer volume and diversity of textual data. As the number of reviews grows exponentially, manual analysis becomes impractical. Traditional methods fall short in efficiently extracting sentiments at scale, necessitating the implementation of advanced technologies such as Natural Language Processing (NLP) and Machine Learning (ML).Doing so will empower businesses with actionable insights derived from customer sentiments, enabling them to make informed decisions, improve products and services, and ultimately thrive in the highly competitive landscape of e-commerce.

About DataSet

This dataset consists of reviews of fine foods from Amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Contents database.csv : Contains the table 'Reviews' Reviews.csv : Pulled from the corresponding SQLite table named Reviews in database.sqlite
Data includes: Reviews from Oct 1999 - Oct 2012 568,454 reviews 256,059 users 74,258 products 260 users with > 50 reviews

Performance Metric

Metric(s):

Log-Loss
Binary Confusion Matrix0

Importing Needed Libraries and accessing other py files(feature-extraction)

The project initiates data analysis and machine learning by importing essential Python libraries, including feature extraction, data visualization, and algorithms. It accesses specific functionalities from the 'feature_extraction' and 'ml_algorithms' modules for further use.

Load the Data and Perform Data Analysis

Read CSV file into a Pandas DataFrame, display the first five rows and provide information about the dataset. It identifies the column on the basis of scores and dropping value above needed threshold and removing the duplicated values. The dataset initially has 568,454 reviews, and for the fast calculation we used 250,000 and after dropping rows with thresholdlimit its left with 230478 review and after removing the duplicate value 182,285 404,287 reviews. There are many duplicate value and and score above 3, dropping those rows.

Distribution of data points among output classes

Distribution of Duplicate and Non-duplicate reviews
Number of reviews above score given thresholds Analyzing the dataset reveals 250000 unique reviews. About 27.08% of reviews appear more than once.
Checking for Duplicates
cleaning the sentence along with text and words

Feature Engineering

Data preprocessor

dropping
assigning the value
removing the duplicate
checking using numerator and demonitor
removing the punctuation,HTML tag ,URL and Non-Alpha numeric
remove the stop words
tokenization of text
snowball stemmer for steming of word
keeping only aplha-numeric.
splitting sentence into words
double check for alphanumeric

Feature Extraction after pre-processing.

Featurization (NLP and Fuzzy Features) Definition:

Token: You get a token by splitting sentence a space
Stop_Word: stop words as per NLTK.
Word: A token that is not a stop_word

Some Additional Feature

Bag of words
Uni,Bi and tri grams
Tf-Idf Vectorization
diff_chars
word2vec Model
Average word2vec
Tf-Idf Word2vec

Due to lack of Computation Power the models are trained on 250,000 Rows.

Checks if there are any NA (missing) values in the DataFrame after converting features to numeric format. If present, it prints "NA Values Present"; otherwise, it prints "No NA Values Present." It then displays the number of NaN values in each column after the conversion. Additionally, it converts the target variable y_true to a list of integers and shows the first few rows of the DataFrame.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
ml_algorithms		ml_algorithms
src		src
static		static
templates		templates
.gitignore		.gitignore
CountVectorizer.pkl		CountVectorizer.pkl
LICENSE		LICENSE
Model(SGDC).pkl		Model(SGDC).pkl
README.md		README.md
app.py		app.py
model.py		model.py
temp.py		temp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Prediction from Amazon Reviews

Table of Contents

Problem Statement:

About DataSet

Performance Metric

Importing Needed Libraries and accessing other py files(feature-extraction)

Load the Data and Perform Data Analysis

Distribution of data points among output classes

Feature Engineering

Data preprocessor

Feature Extraction after pre-processing.

Some Additional Feature

Due to lack of Computation Power the models are trained on 250,000 Rows.

Comparing the Original Text and the processed text

Get the top 10 words most similar words to "quality"

Splitting into Train and Test Data

Models used:

KNN on Bag of words

KNN on tfidf

Naive Bayes on Bag of Words

Naive Bayes on Tf-Idf

SGD Classifier on Bag of Words

SGD Classifier on Tf-Idf

SGD Classifier on word2vec

Results

About

Releases

Packages

Contributors 2

Languages

License

anandr07/Sentiment-prediction-from-Amazon-reviews

Folders and files

Latest commit

History

Repository files navigation

Sentiment Prediction from Amazon Reviews

Table of Contents

Problem Statement:

About DataSet

Performance Metric

Importing Needed Libraries and accessing other py files(feature-extraction)

Load the Data and Perform Data Analysis

Distribution of data points among output classes

Feature Engineering

Data preprocessor

Feature Extraction after pre-processing.

Some Additional Feature

Due to lack of Computation Power the models are trained on 250,000 Rows.

Comparing the Original Text and the processed text

Get the top 10 words most similar words to "quality"

Splitting into Train and Test Data

Models used:

KNN on Bag of words

KNN on tfidf

Naive Bayes on Bag of Words

Naive Bayes on Tf-Idf

SGD Classifier on Bag of Words

SGD Classifier on Tf-Idf

SGD Classifier on word2vec

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages