Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to classifier that supports Complement Model #2

Open
dgourd opened this issue Oct 9, 2019 · 1 comment
Open

Switch to classifier that supports Complement Model #2

dgourd opened this issue Oct 9, 2019 · 1 comment

Comments

@dgourd
Copy link
Collaborator

dgourd commented Oct 9, 2019

I'm not sure, but it looks like ClassifierReborn uses either a Gaussian or Multinomial model for classification. I think a Complement model is better for our purposes. Essentially, a complement model also uses the absence of features to make its classification while the other models only look at present features.

Also, complement models work better with skewed training datasets. The boundaries generated by other models get biased based on the frequency of a classification. The paper linked here goes into more of the advantages.

https://github.com/id774/naivebayes

@dgourd
Copy link
Collaborator Author

dgourd commented Oct 9, 2019

{:Accuracy=>0.2127193721157409, :Precision=>1.0, :Recall=>0.8509634820105108, :F1=>0.9194816540477577}

For testing, I randomly divided the sample dataset into quartiles. I used the first 3 quartiles for training and the last for classification testing with the above results. I just ran this once, but better metrics are when you aggregate 4 runs with each quartile being used as the classification set.

100% precision means that everything classified as relevant was actually cancer relevant. Recall means that ~15% of the cancer relevant articles were classified as not relevant. Accuracy is real low because our data set is very skewed. That's why its not included in F1 scores.

Results are very good already, and I just tokenized individual words in the combined title and abstract. Could be improved if we do some pre-processing for things like chemical formulas or switched to an n-gram tokenization process. If we consistently get a precision of 1, we can automatically classify those papers as cancer relevant and only have to adjudicate the remaining 15%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant