This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is the source code repo for project FAIR
pip install nltk numpy scikit-learn scikit-image matplotlib torchtext
# requirements from pytorch-transformers/wiki
pip install transformers pymediawiki
Get pre-defined wikipedia categories (we call it candidate categories/candidate list). These categories are the ones we want to use to summarize/label a given abstract/paper (We also mannually reviewed the list and removed categories that are not relavent).
For finding similar and related topics:
- get a ClinicalBERT embeddings for each categories (in the candidate categories)
- given a category, retrievel the most similar categories via calculating the cosine similarity between each categories
- get a ClinicalBERT embeddings for each categories (in the candidate categories)
For labelling a paper:
- get unigram, bigram and trigram in the abstract (step 2).
- save ngrams that also show up in the candidate list (step 2).
- get all nouns in the abstract (step 3).
- retrieve the related categories of nouns, and save the related categories that also show up in the candidate list (step 3).
- combine lists from step b and c (step 4).
PPlus_classifier contains two models for PROGRESS-Plus classifiers.