TDT4310-Project

This repository contains all the code for the student project in TDT4310 Intelligent Text Analytics and Language Processing. The goal of this project is to develop a classifier that is able to detect bots and gender based on 100 tweets and also classifies the gender of human users.

Data

This project is using the Bots and Gender Profiling 2019 dataset. For copyright reasons, the dataset is not included in this public repository and can be found here: https://pan.webis.de/clef19/pan19-web/author-profiling.html

Preprocessing

the words are tokenize and embedded via Global Vector word representation using the matrix pretrained on 2Bn tweets from standford. The pretrained data is automatically downloaded when training is ran for the first time. Stopwords and punctuation are not removed as they are part of the glove dataset. Some tokens like users, number, hashtags and links are generalized to labels present in the glove matrix. There are 25-200 dimensions in the glove files available, but the file for the 200d matrix is so big that the code is likely to crash due to insufficient memory. As an alternative, words can be embedded using a pytorch's embedding layer. This allows an arbitrary number of dimensions and ensures that all words are embedded, but increases training time a lot.

Model

Following the approach of this paper by Wei and Nguyen, a bidirectional LSTM with 3 layers and 100 units is used to classify the input (all tweets of one user) to bot, female or male. There are two models available, one for using GloVe inputs and one for embedding the tokens during traing. Both models can also be used for binary classificatiion tasks such as bot/human and female/male.

Training

The hyper paremeters and number of dimensions can be specified in the main script that also starts training and plots the loss. It is also possible to train the model on the Spanish dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.vscode		.vscode
checkpoints		checkpoints
.gitignore		.gitignore
README.md		README.md
dataloader.py		dataloader.py
keras.py		keras.py
main.py		main.py
model.py		model.py
preprocess.py		preprocess.py
trainer.py		trainer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TDT4310-Project

Data

Preprocessing

Model

Training

About

Releases

Packages

Languages

nilsplettenberg/TDT4310-Project

Folders and files

Latest commit

History

Repository files navigation

TDT4310-Project

Data

Preprocessing

Model

Training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages