Background. Discover relevant biomedical interactions in the literature is crucial for enhancing biology research. It has an essential role in studying the different processes and interactions reported that affect the biological process (e.g., genome, metabolome, and transcriptome). Therefore, the objective of this work is twofold: reduce the manual effort required to curate and review the existing biochemical interactions reported in the gluten-related bibliome while proposing a novel relation extraction deep learning approach that assists in a real curation task by learning from the previous decisions of the curators.
Methods. Compared to previous works, the main contribution of this work lies in proposing a deep learning model that incorporates a novel vector-space that combine (i) high-level lexical and syntactic inference features as Wordnets and Health-related domain ontologies, (ii) unsupervised domain syntactic and semantic resources as word embeddings, (iii) semantical and sentence structure knowledge (e.g., part of speech, negation information, verb information), (iv) abbreviation resolution support, (v) several state-of-the-art Named-entity recognition methods, and (vi) different feature construction and optimization techniques to support a semi-automatic curation workflow.
Results.The application of the semi-automatic curation workflow over a classified set of 2,451 relevant gluten-related documents produces a total of 8,349 relevant relations and 471,813 irrelevant relations of the next relation categories: (i) Related health issue, (ii) Improve, (iii) Aggravate, (iv) Stimulation, (v) Inhibition, (vi) Activation, (vii) Deactivation, (viii) Downregulation, (ix) Upregulation, (x) increase symptoms, (xi) decrease symptoms, (xii) weak relation and (xiii) no effect. Therefore, the mean achieved F-score for the different relation categories established was 0.731, with the lowest F.score at 0.47 (with 200 positive identified relations) and the highest F.score at 0.929 (with 2,129 positive identified relations).
Experimental results showed that the presented workflow is an excellent approach for a semi-automatic RE task. It was able to obtain satisfactory results in the early stages of a real-world curation task and saved manual annotation efforts by learning from the decisions made by manual curators. On the other hand, the presented sentence vector-space can be integrated into several state-of-the-art machine learning models to recognize relevant relations with satisfactory results. Finally, this work highlights the benefit of use domain knowledge as ontologies and entity recognizers to improve the automatic recognition of health-related interactions in the literature.
- AbbreviationDictionary_domain: Contains the abbreviation resolution rules
- englishCommonStopWords: A simple list of common stop words used to annotate the sentences
- lexicon: The full parsed lexicon used to discover relevant entities, and the normalized identified words
- relation_rulesV10.json: Relation probabilities taking into account the possible entities that interact
- relationDatasetBaseline: Curated relation dataset. Identified negative and positive sentences that contains (or not) a possible relevant interaction
- relationDataset: Pre-processed curated relation dataset before applying the proposed vector-space construction techniques. (contains the sentence text, the recognized annotations by the different named entity recognisers, the verb information, the PoS information... in order to avoid the use of third parties libraries as Dnorm,LINNAEUS, Stanford CoreNLP, etc to process the baseline sentence dataset)
- stopWords: A curated list of the task stop words
- word2vec_MyDictionary: TSV serialization of the Word2Vec space used to normalize the sentences
This folder contains the workflow and the code blocks implemented in the Rapidminer tool as an example to carried out the explained steps proposed.
- Sentence processing
- Vector-space transformation (One hot encoding, Normalisation, etc)
- Feature selection
- Deep learning construction (use the H2O library https://docs.rapidminer.com/latest/studio/operators/modeling/predictive/neural_nets/deep_learning.html)
- Deep learning layer grid search
- Deep learning L1 and dropout optimization
- Deep learning cross Validation evaluation
- WordNet: https://wordnet.princeton.edu/