Amazon product reviews in four categories: books, dvd, electronics, and kitchen & housewares.
1000 positive, 1000 negative and various unlabeled reviews per category.
Data is available here.
├── sorted_data_acl/
├── books/
│ ├── negative.review
│ ├── positive.review
│ ├── unlabeled.review
├── dvd/
│ ├── negative.review
│ ├── positive.review
│ ├── unlabeled.review
├── electronics/
│ ├── negative.review
│ ├── positive.review
│ ├── unlabeled.review
|── kitchen_&_houswares/
├── negative.review
├── positive.review
├── unlabeled.review
This will create a reviews_forEmbedding.txt file in each category folder. The file will contain all reviews (positive, negative and unlabeled) of that categories with one sentence of a review per line. The sentences do not contain any special characters or any punctuation.
This will merge all the above files into one file and store them in the sorted_data_acl/all/ folder.
This will create word embeddings for each category (including all) of the reviews using GloVe. In particular, this creates the following 4 files in each category folder:
- reviews.vocab: word count per category in the format word -> count
- reviews.cooccur: cooccurance matrix of words
- reviews.cooccur.shuf: sorted cooccurence matrix
- reviews.vectors.txt: word embeddings per category in the format word -> vector
This will create Python dictionaries in the format word -> vector from the files reviews.vectors.txt and store it in the files reviews.vectors.pkl.
This will create a reviews_positive.txt, ratings_positive.txt, reviews_negative.txt and ratings_revative.txt files in each category folder. The files will contain the respective reviews and ratings with one review/rating per line.
This will merge the preprocessed reviews from all four categories into the all/ folder.
This will transfrom the text reviews into embedded reviews by converting each word into a vector using the dictionaries from previous steps. The resulting matrices will be stores in reviews_positive.npy and reviews_negative.npy.
This will train a neural network to classify the sentiments in each category.