Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
There are two ways to replicate our findings:
- Replicating every step from basic cleaning to modelling
- Use the cleaned and preprocessed data for modelling (recommended)
preprocessing.R
- (output: jstor_corpus.rds)
corpus_cleaning.R
- (output: jstor_df_trim_22_02.rds)
snip_window.R
- (output: jstor_snipped.csv)
- data for classification: jstor_df_snipped_TO_USE.csv
Annotated files for training and validation:
- jstor_class_first.csv
- jstor_class_first.txt
Preparing training and validation files
fasttext_about_missing.R
- (outputs: jstor.train, jstor.valid)
Training classification model
first_level_train.py
- (output: jstor_model.bin)
Evaluating model
- confusion_matrix.py
Predicting first level
first_level_pred.py
- (output: jstor_first_output.csv)
Annotated files for training and validation:
- jstor_class_second.csv
- jstor_class_second.txt
Preparing training and validation files
fasttext_imputation.R
- (outputs: jstor.train, jstor.valid)
Training classification model
second_level_train.py
- (output: jstor_model.bin)
Evaluating model
confusion_matrix.py
Predicting second level
second_level_pred.py
- (output: jstor_second_output.csv)
Annotated files for training and validation:
- jstor_class_third.csv
- jstor_class_third.txt
Preparing training and validation files
fasttext_advanced.R
- (outputs: jstor.train, jstor.valid)
Training classification model
third_level_train.py
- (output: jstor_model.bin)
Evaluating model
confusion_matrix.py
Predicting third level
third_level_pred.py
- (output: jstor_third_output.csv)
Annotated files for training and validation:
- jstor_class_third_deletion.csv
- jstor_class_third_deletion.txt
Preparing training and validation files
fasttext_deletion.R
- (outputs: jstor.train, jstor.valid)
Training classification model
third_level_train.py
- (output: jstor_model.bin)
Evaluating model
confusion_matrix.py
Predicting third level
third_level_pred.py
- (output: jstor_third_deletion_output.csv)
log_models.R