A large-scale sentiment analysis dataset in French language
All scraping functions are defined in allocine_scraper.py. With these, extracting reviews from the whole Allociné.fr website can be done with only a few lines of code (here limiting to 30 reviews per movie):
ROOT_URL = "http://www.allocine.fr"
MAX_REVIEWS_PER_MOVIE = 30
urls = get_film_urls(ROOT_URL)
dic = get_film_reviews(ROOT_URL, batch_urls, MAX_REVIEWS_PER_MOVIE)
Getting all the data at once can become awfully time-consuming.
The scrape_allocine.ipynb notebook points out my strategy, which is to process batches of URLs instead of the full list.
I also used this notebook to generate intermediate .pickle
files that compile all the extracted data in a pandas DataFrame
.
The following results and images were generated by the create_dataset.ipynb notebook.
User ratings range from 0.5 to 5 with a step of 0.5 between each. As we can see, on the following graph, there are more positive reviews than negative, with a significant peak at 4.
In order to build a binary sentiment analysis dataset, we need to assign to each review its corresponding polarity.
In the notebook, all reviews with a rating <= 2 are labeled as negative
, while those with a rating >= 4 are positive
.
In-between reviews are considered neutral
.
With these empirical thresholding values, we obtain the following distribution :
The following graph depicts the length (number of characters) distribution of reviews. We can see that most reviews are condensed before 5000, and that there is a large tail of long reviews.
In the notebook, only the reviews with less than 2000 characters are kept. This process actually removes 6% of the data, which leads to the following distribution :
In order to build the dataset, we then randomly sample 100k negative
and 100k positive
reviews from the initial data, and split them into three subsets : train (80% of data), validation (10%) and test (10%).
We make sure that these subsets contain disjoint sets of movies, while being as balanced as possible.
The final results of the create_dataset.ipynb notebook are as follow:
The resulting dataset is then exported as .jsonl
files, as well as a .pickle
file, and archived into data.tar.bz2.
Théophile Blard – 📧 [email protected]
If you use this work (code or dataset), please cite as:
Théophile Blard, French sentiment analysis with BERT, (2020), GitHub repository, https://github.com/TheophileBlard/french-sentiment-analysis-with-bert