This package has been developed to annotate mentions of pathogens in the scientific literature.
Pathogen identification and characterization algorithms have been developed using readbiomed-pathogens-dataset.
Identification of pathogens is built on dictionaries using National Center for Biotechnology Information (NCBI) resources and ConceptMapper. This package has code to generate dictionaries and use them for annotation.
Characterization of pathogens relies on machine learning algorithms. There is python code to train and evaluate traditional machine learning methods and deep learning models including longformer.
This package has been tested with Java 11 and Maven 3.6.3.
Prior to installing this package, you need to manually install MTIMLExtention and readbiomed-ncbi-pathogen-dataset-generation.
Then, once it is cloned, run mvn install
from the directory it was cloned into.
The libraries required for running the python code can be installed using the requirements.txt
as explained below.
For optimal performance depending on hardware setup, consider the best installation options for pytorch.
pip install -r requirements.txt
The class PathogenCharacterizationAnnotator defines the UIMA annotator for the characterization of pathogens.
An example that uses the PathogenCharacterizationAnnotator is available here. The ground truth is available from the manual annotation from this repository.
mvn exec:java -Dexec.mainClass="readbiomed.annotators.characterization.PathogenExperimenter" -Dexec.args="[ConceptMapper_Dictionary] [ground_truth_csv] [articles-txt-format]"
A dictionary can be build using the class DictionaryBuilder. It needs the ncbitaxon.owl file available from the OBO repositories. The output is an XML dictionary suitable for UIMA's ConceptMapper.
mvn exec:java -Dexec.mainClass="readbiomed.annotators.dictionary.pathogens.build.DictionaryBuilder" -Dexec.args="[NCBI_owl_file] [ConceptMapper_dictionary_output_file] [taxonomic_pathogens]"
MTIMLExtension implements some text classifiers using fast and memory efficient methods for training and annotation in java.
It extends the MTIML package.
Training and testing classifiers is explained in great detail here.
The training process generates a serialised classifier in a compressed file.
The location of the file can be specified in the pathogen characterization class.
The java code needs to be recompiled running mvn install
as explained in the installation step.
There are several data sets that have been used for pathogen characterization.
To train a classifier to predict if a citation discusses pathogens is available here.
To train a classifier to predict pathogen relevance a large data set is available here.
To train and use the BERT like models, set the constants as needed. BERT like models are connected to java classes of the pathogen annotator using a server/client architecture. Once a model is trained using BERT or longformer, a server needs to be started. The same data sets as used in the MTIMLExtension can be used to train the BERT like classifiers.
By default port 5000 is used, the server code can be updated to change it.
If the port is changed, the java annotator needs to be updated and the java code recompiled running mvn install
as explained in the installation step.
If you use this work in your research, remember to cite it:
@article{jimeno2023classifying,
title={Classifying literature mentions of biological pathogens as experimentally studied using natural language processing},
author={Jimeno Yepes, Antonio and Verspoor, Karin},
journal={Journal of Biomedical Semantics},
year={2023},
volume={14},
number={1},
doi={10.1186/s13326-023-00282-y},
url={https://doi.org/10.1186/s13326-023-00282-y}
}