Skip to content

Latest commit

 

History

History

allocine_dataset

Allociné Dataset

A large-scale sentiment analysis dataset in French language

Scraping

All scraping functions are defined in allocine_scraper.py. With these, extracting reviews from the whole Allociné.fr website can be done with only a few lines of code (here limiting to 30 reviews per movie):

ROOT_URL = "http://www.allocine.fr"
MAX_REVIEWS_PER_MOVIE = 30

urls = get_film_urls(ROOT_URL)
dic = get_film_reviews(ROOT_URL, batch_urls, MAX_REVIEWS_PER_MOVIE)

Getting all the data at once can become awfully time-consuming. The scrape_allocine.ipynb notebook points out my strategy, which is to process batches of URLs instead of the full list. I also used this notebook to generate intermediate .pickle files that compile all the extracted data in a pandas DataFrame.

Exploring data

The following results and images were generated by the create_dataset.ipynb notebook.

Rating counts

User ratings range from 0.5 to 5 with a step of 0.5 between each. As we can see, on the following graph, there are more positive reviews than negative, with a significant peak at 4.

In order to build a binary sentiment analysis dataset, we need to assign to each review its corresponding polarity. In the notebook, all reviews with a rating <= 2 are labeled as negative, while those with a rating >= 4 are positive. In-between reviews are considered neutral. With these empirical thresholding values, we obtain the following distribution :

Reviews length

The following graph depicts the length (number of characters) distribution of reviews. We can see that most reviews are condensed before 5000, and that there is a large tail of long reviews.

In the notebook, only the reviews with less than 2000 characters are kept. This process actually removes 6% of the data, which leads to the following distribution :

Building the dataset

In order to build the dataset, we then randomly sample 100k negative and 100k positive reviews from the initial data, and split them into three subsets : train (80% of data), validation (10%) and test (10%). We make sure that these subsets contain disjoint sets of movies, while being as balanced as possible. The final results of the create_dataset.ipynb notebook are as follow:

The resulting dataset is then exported as .jsonl files, as well as a .pickle file, and archived into data.tar.bz2.

Author

Théophile Blard – 📧 [email protected]

If you use this work (code or dataset), please cite as:

Théophile Blard, French sentiment analysis with BERT, (2020), GitHub repository, https://github.com/TheophileBlard/french-sentiment-analysis-with-bert