In the vast landscape of e-commerce, understanding customer sentiment is crucial for businesses seeking to enhance user experiences and optimize product offerings. Amazon, being one of the world's largest online marketplaces, accumulates an immense volume of user-generated reviews.This project employs advanced natural language processing and machine learning to analyze Amazon reviews, predicting sentiment shifts. It assists users and businesses in understanding and anticipating public perceptions of products, guiding development, and enhancing decision-making within the e-commerce landscape.
- Problem Statement
- About DataSet
- Performance Metric
- Performance Metric
- Load the Data and Perform Data Analysis
- Distribution of data points among output classes
- Feature Engineering
- Splitting into Train and Test Data
- Models used:
- Results
In the ever-expanding realm of e-commerce, businesses grapple with the challenge of distilling meaningful insights from the vast troves of customer-generated content, particularly in the form of reviews on platforms like Amazon. The problem at hand lies in deciphering the sentiments expressed within these reviews, ranging from glowing endorsements to pointed criticisms. Understanding customer sentiment is pivotal for businesses seeking to enhance product offerings, improve customer experiences, and stay competitive in the dynamic marketplace.The challenge further extends to the sheer volume and diversity of textual data. As the number of reviews grows exponentially, manual analysis becomes impractical. Traditional methods fall short in efficiently extracting sentiments at scale, necessitating the implementation of advanced technologies such as Natural Language Processing (NLP) and Machine Learning (ML).Doing so will empower businesses with actionable insights derived from customer sentiments, enabling them to make informed decisions, improve products and services, and ultimately thrive in the highly competitive landscape of e-commerce.
This dataset consists of reviews of fine foods from Amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.
- Contents database.csv : Contains the table 'Reviews' Reviews.csv : Pulled from the corresponding SQLite table named Reviews in database.sqlite
- Data includes: Reviews from Oct 1999 - Oct 2012 568,454 reviews 256,059 users 74,258 products 260 users with > 50 reviews
Metric(s):
- Log-Loss
- Binary Confusion Matrix0
The project initiates data analysis and machine learning by importing essential Python libraries, including feature extraction, data visualization, and algorithms. It accesses specific functionalities from the 'feature_extraction' and 'ml_algorithms' modules for further use.
Read CSV file into a Pandas DataFrame, display the first five rows and provide information about the dataset. It identifies the column on the basis of scores and dropping value above needed threshold and removing the duplicated values. The dataset initially has 568,454 reviews, and for the fast calculation we used 250,000 and after dropping rows with thresholdlimit its left with 230478 review and after removing the duplicate value 182,285 404,287 reviews. There are many duplicate value and and score above 3, dropping those rows.
-
Distribution of Duplicate and Non-duplicate reviews
-
Number of reviews above score given thresholds Analyzing the dataset reveals 250000 unique reviews. About 27.08% of reviews appear more than once.
-
Checking for Duplicates
-
cleaning the sentence along with text and words
- dropping
- assigning the value
- removing the duplicate
- checking using numerator and demonitor
- removing the punctuation,HTML tag ,URL and Non-Alpha numeric
- remove the stop words
- tokenization of text
- snowball stemmer for steming of word
- keeping only aplha-numeric.
- splitting sentence into words
- double check for alphanumeric
Featurization (NLP and Fuzzy Features) Definition:
- Token: You get a token by splitting sentence a space
- Stop_Word: stop words as per NLTK.
- Word: A token that is not a stop_word
- Bag of words
- Uni,Bi and tri grams
- Tf-Idf Vectorization
- diff_chars
- word2vec Model
- Average word2vec
- Tf-Idf Word2vec
- Checks if there are any NA (missing) values in the DataFrame after converting features to numeric format. If present, it prints "NA Values Present"; otherwise, it prints "No NA Values Present." It then displays the number of NaN values in each column after the conversion. Additionally, it converts the target variable y_true to a list of integers and shows the first few rows of the DataFrame.
Train Data : 70% Test Data : 30%
- The SGD Classifier on tf-idf give best accuracy among all the model used .
- Best Hyperparameters: {'alpha': 1.9211659757411964e-06, 'eta0': 0.01}
- AUC Score (CV): 0.9582384628250535 Accuracy (CV): 0.9266463021839869
- AUC Score (Train): 0.9995024371625396 Accuracy (Train): 0.9932857742341586