Coding competition focused on gender inequalities in cinema. Organized by DataForGood and Eleven
This team was awarded the first place
Given basic data about a movie (IMDb id), determine whether it will pass the Bechdel test or not. The classes to predict are as such:
Level | Description |
---|---|
0 | (None of the above) |
1 | There are two women characters in the movie. |
2 | They talk to each other. |
3 | They talk to each other about something other than a man. |
- We gathered data from The Movie DataBase (TMDB) to get info about actors, producers, writers and such.
- We used the movies' posters to determine the number of women on them and the relative size of women on the posters using DeepFace and more especially using RetinaNet Backend and FaceNet128.
- We used audio recognition on youtube videos of the movies' trailers to get the proportion of women's speech in them. Audio was scrapped using fixed pytube and the audio segmentation was done with inaSpeechSegmenter.
- We used NLP on the movies' synopsis and PCA analysis to get insights from the plots.
We used XGBoost and hyperparameters tuning in sklearn for the final model.
Confusion matrix of the XGBClassifier
on the test set (20% of the 8000+ movies):
Most relevant features:
Feature name | Description | Importance |
---|---|---|
PCA_0 |
First vector in the PCA of the NLP analysis of the synopsis | 0.05241 |
Is_War |
Dummy variable - Is the movie a war movie ? | 0.04633 |
writers_female |
Number of female writers | 0.04234 |
cast_female |
Number of female actors | 0.03934 |
area_women |
Proportional area of womens' faces in the poster | 0.03709 |
Is_Horror |
Dummy variable - Is the movie a horror movie ? | 0.03567 |
nb_women |
Number of women on the poster | 0.02856 |
Is_Romance |
Dummy variable - Is the movie a romance movie ? | 0.02456 |
cast_male |
Number of male actors | 0.02075 |