You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not sure, but it looks like ClassifierReborn uses either a Gaussian or Multinomial model for classification. I think a Complement model is better for our purposes. Essentially, a complement model also uses the absence of features to make its classification while the other models only look at present features.
Also, complement models work better with skewed training datasets. The boundaries generated by other models get biased based on the frequency of a classification. The paper linked here goes into more of the advantages.
For testing, I randomly divided the sample dataset into quartiles. I used the first 3 quartiles for training and the last for classification testing with the above results. I just ran this once, but better metrics are when you aggregate 4 runs with each quartile being used as the classification set.
100% precision means that everything classified as relevant was actually cancer relevant. Recall means that ~15% of the cancer relevant articles were classified as not relevant. Accuracy is real low because our data set is very skewed. That's why its not included in F1 scores.
Results are very good already, and I just tokenized individual words in the combined title and abstract. Could be improved if we do some pre-processing for things like chemical formulas or switched to an n-gram tokenization process. If we consistently get a precision of 1, we can automatically classify those papers as cancer relevant and only have to adjudicate the remaining 15%.
I'm not sure, but it looks like ClassifierReborn uses either a Gaussian or Multinomial model for classification. I think a Complement model is better for our purposes. Essentially, a complement model also uses the absence of features to make its classification while the other models only look at present features.
Also, complement models work better with skewed training datasets. The boundaries generated by other models get biased based on the frequency of a classification. The paper linked here goes into more of the advantages.
https://github.com/id774/naivebayes
The text was updated successfully, but these errors were encountered: