Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verifying Experimental Analysis Design #28

Open
danich1 opened this issue Nov 1, 2017 · 1 comment
Open

Verifying Experimental Analysis Design #28

danich1 opened this issue Nov 1, 2017 · 1 comment
Labels

Comments

@danich1
Copy link
Contributor

danich1 commented Nov 1, 2017

I talked with @dhimmel yesterday and we came up with a design for determining whether or not adding input from a deep learning model (LSTM) is beneficial for predicting relationships between Diseases and Genes.

Background:

project overview

Within the image above we have all disease-gene pair mappings where some edges are mentioned in pubmed abstracts (noted by the black dashes) and majority of edges aren’t mentioned at all. The edges in green are considered true edges as they are currently contained in hetnet v1 and the other edges (not highlighted) have the potential to be a true Disease-Gene relationship. We aim to classify each edge as either positive (True edge) or negative (False edge), under the hypothesis that using NLP and deep learning (Long short term memory networks or LSTM for short) will provide better accuracy than standard methods.

Analysis Design:
To test this hypothesis we plan to use the following design:

Categories: Prior Co-occurrences Natural Language Processing (NLP)
1 Model 1 Model with sentences 1 Model with Sentences
1 Model w/o Sentences 1 Model w/o Sentences
Literature unaware LSTM unaware LSTM aware

The prior category is where we plan to use a model to classify each disease-gene edge without using any information from biomedical literature (hence literature unaware). The co-occureence category is where we plan to use a model that combines the prior category model with information obtained from biomedical literature i.e. (expected number of sentences that mentions a given disease-gene pair, the p-value for each disease-gene edge, how many unique abstracts that mention a given disease-gene pair etc.) To note this model doesn’t use the LSTM and just relies on the features extracted from the literature itself. A challenge for this will be handling the edges that aren’t mentioned within the literature itself. (Model w/o Sentences) Lastly, the NLP category combines the other two models and adds input from a deep learning model (probability that a sentence is evidence for a true disease-gene relationship). We expect to see the NLP category model outperform the models from the other two categories.

Challenges:

  1. What is a fair prior model to use for this analysis?
  2. What do we do about edges that are in hetnet, but aren’t mentioned in literature? How can we classify these edges?
@dhimmel
Copy link
Collaborator

dhimmel commented Nov 3, 2017

Great summary of our brainstorm @danich1!

What is a fair prior model to use for this analysis?

The prior should just be the probability that the disease is associated with the gene based only on the degree of the gene and disease (in the training network). See this notebook, which computes these prior probabilities and should only need minimal modifications. Note that for this analysis you won't be fitting any classifier model... you will use the prior probability directly to rank the observations for the ROC curve.

What do we do about edges that are in hetnet, but aren’t mentioned in literature? How can we classify these edges?

Relationships without any sentences will still have some features:

  • a prior probability as computed above
  • marginal sentence counts (how many documents/sentences does the gene appear in / how many documents/sentences does the disease appear in)?
  • potentially a co-occurrence p-value of 1

As a result, for the NLP predictions, you will have to fit a fallback model for observations with no sentences. Therefore, not all predictions in the NLP stage will use NLP info (only the ones that have sentences). Of course, an important limitation of NLP is that it only works for observations with sentences and the ROC curve should reflect that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants