This repository contains all the code for the student project in TDT4310 Intelligent Text Analytics and Language Processing. The goal of this project is to develop a classifier that is able to detect bots and gender based on 100 tweets and also classifies the gender of human users.
This project is using the Bots and Gender Profiling 2019 dataset. For copyright reasons, the dataset is not included in this public repository and can be found here: https://pan.webis.de/clef19/pan19-web/author-profiling.html
the words are tokenize and embedded via Global Vector word representation using the matrix pretrained on 2Bn tweets from standford. The pretrained data is automatically downloaded when training is ran for the first time. Stopwords and punctuation are not removed as they are part of the glove dataset. Some tokens like users, number, hashtags and links are generalized to labels present in the glove matrix. There are 25-200 dimensions in the glove files available, but the file for the 200d matrix is so big that the code is likely to crash due to insufficient memory. As an alternative, words can be embedded using a pytorch's embedding layer. This allows an arbitrary number of dimensions and ensures that all words are embedded, but increases training time a lot.
Following the approach of this paper by Wei and Nguyen, a bidirectional LSTM with 3 layers and 100 units is used to classify the input (all tweets of one user) to bot, female or male. There are two models available, one for using GloVe inputs and one for embedding the tokens during traing. Both models can also be used for binary classificatiion tasks such as bot/human and female/male.
The hyper paremeters and number of dimensions can be specified in the main script that also starts training and plots the loss. It is also possible to train the model on the Spanish dataset.