Skip to content

Dataset One

Seid Muhie Yimam edited this page Sep 22, 2023 · 2 revisions

Exploring Amharic Hate Speech Data Collection and Classification Approaches

RANLP-2023 Dataset: This dataset is collected with WebAnno annotation tool in a controlled environment for the paper entitled Exploring Amharic Hate Speech Data Collection and Classification Approaches. and presented in Recent Advances in Natural Languages Processing-RANLP2023, Varna, Bulgaria. The dataset consists of 15.1k tweets, each tweet annotated with two native speakers, and the gold labels are determined with more experienced curators.

Abstract

In this paper, we present a study of efficient data selection and annotation strategies for Amharic hate speech. We also build various classification models and investigate the challenges of hate speech data selection, annotation, and classification for the Amharic language. From over 18 million tweets in our Twitter corpus, 15.1k tweets are annotated by two independent native speakers, and a Cohen's kappa score of 0.48 is achieved. A third annotator, a curator, is also employed to decide on the final gold labels. We use both classical machine learning and deep learning approaches, which include fine-tuning AmFLAIR and AmRoBERTa contextual embedding models. AmFLAIR achieves the best performance among all the models with an F1 score of 72%. We publicly release the annotation guidelines, keywords/lexicon entries, datasets, models, and associated scripts with a permissive license.

Clone this wiki locally