Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity Mention Linker #3388

Merged
merged 58 commits into from
Feb 8, 2024
Merged

Entity Mention Linker #3388

merged 58 commits into from
Feb 8, 2024

Conversation

helpmefindaname
Copy link
Collaborator

via @mariosaenger in #3180

"The main contribution is a entity linking model which uses dense (transformer-based) embeddings and (optionally) sparse character-based representations, for normalizing an entity mention to specific identifiers in a knowledge base / dictionary. To this end, the model embeds the entity mention text and all concept names from the knowledge base and outputs the k best-matching concepts based on embedding similarity."

for each "gene", "disease", "chemical" & "species" I created and uploaded a model to hf,
Those models can be loaded via EntityMentionLinker.load("bio-{label_type}" or EntityMentionLinker.load("bio-{label_type}-exact-match" The first represents the recommended default configuration (note for species it is currently exact match due to lack of alternative, while the latter represents the most simple model to use.

I suppose the recommendation of models will change soon, but would recommend to not make this part of this PR, but rather change it afterwards.

Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @helpmefindaname and @WangXII!

It was a bit hard to put together the code to actually tag a sentence, for several reasons:

  • It is not obvious that one first needs to run a biomedical NER tagger
  • The entity linker models have a cryptic entity_tag_type which needs to be overwritten
  • The predictions are just integer codes, and it was not obvious to me to which knowledge base they refer to
  • If I interpret correctly, the tested model seems to get even simple things wrong

Here is the code used for testing:

from flair.data import Sentence
from flair.datasets import NCBI_GENE_HUMAN_DICTIONARY
from flair.models import EntityMentionLinker
from flair.nn import Classifier

# Example sentence
sentence = Sentence("We observed A2M, CDKN1A and alpha-2-macroglobulin in the specimen.")

# instantiate NER tagger for genes
ner_tagger = Classifier.load("hunflair-gene")

# instantate Gene linker (the entity_label_type needs to be set to "ner")
gene_linker = EntityMentionLinker.load("bio-gene")
gene_linker.entity_label_type = "ner"

# use both taggers to predict
ner_tagger.predict(sentence)
gene_linker.predict(sentence)

# interpret results using dictionary
dictionary = NCBI_GENE_HUMAN_DICTIONARY()

print(sentence)

for entity in sentence.get_labels("gene"):
    print(entity)
    link = dictionary[entity.value]
    print(f" -> linked to: '{link.concept_name}'")

In this snippet, though mentioned explicitly, the genes A2M and alpha-2-macroglobulin are linked to some random entries. The "exact match" model performs only marginally better.

@WangXII - am I doing something wrong or why is the accuracy of the linking in such cases so low?

Some suggestions / questions:

  1. You could prepare MultitaskModels for each type (gene, disease, etc.) that combine the necessary NER tagger and linker, and correctly set the entity_tag_type. This would allow users to easily instantiate a single model that directly works.
  2. The label_type of all linking models could be "link" - not sure what is gained from having different label_type names for different biomedical NER classes.
  3. Is there some way of including the dictionary into the linker and preparing convenience functions to interpret the links?

flair/models/entity_mention_linking.py Show resolved Hide resolved
flair/datasets/entity_linking.py Show resolved Hide resolved
flair/models/entity_mention_linking.py Outdated Show resolved Hide resolved
@WangXII
Copy link
Collaborator

WangXII commented Jan 2, 2024

Thanks for creating out the updated pull request @helpmefindaname and for pointing out the low accuracy @alanakbik!

I've looked at the low accuracies and we indeed had a bug with linking to the correct knowledge base identifiers. This part should be fixed now and A2M and alpha-2-macroglobulin point to the correct NCBI Gene identifier number 2.

@alanakbik
Copy link
Collaborator

@WangXII thanks for the update! Is there a way I can compute evaluation numbers in Flair? i.e. load a gold dataset, load the model, make predictions and evaluate? Could you post a snippet for this?

@WangXII
Copy link
Collaborator

WangXII commented Jan 3, 2024

@sg-wbi can answer this the best. I think we have yet to update the evaluation script to the revised Flair API

@alanakbik
Copy link
Collaborator

@sg-wbi can you share an evaluation script? I would like to use it to test the models for accuracy before merging.

@sg-wbi
Copy link
Collaborator

sg-wbi commented Jan 4, 2024

@sg-wbi can you share an evaluation script? I would like to use it to test the models for accuracy before merging.

Unfortunately this is not straightforward. When we developed this our efforts in the integration into flair stopped at the model level. This is becasue we had a very specific use case. At the current state our evaluation scripts require multiple preprocessing steps.

We (@mariosaenger, @WangXII) will get back to you asap with a workable solution (to your suggestions as well).

@sg-wbi
Copy link
Collaborator

sg-wbi commented Jan 12, 2024

@alanakbik

Tests for accuracy

Here's the script that will give you accuracy results for 3 commonly used datasets.

from collections import defaultdict

from datasets import load_dataset

from flair.models import EntityMentionLinker
from flair.models.entity_mention_linking import BioSynEntityPreprocessor

ENTITY_TYPE_TO_MODEL = {
    "diseases": "dmis-lab/biosyn-sapbert-ncbi-disease",
    "chemical": "dmis-lab/biosyn-sapbert-bc5cdr-chemical",
    "genes": "dmis-lab/biosyn-sapbert-bc2gn",
}

ENTITY_TYPE_TO_DATASET = {"diseases": "ncbi_disease", "chemical": "bc5cdr", "genes": "gnormplus"}

ENITY_TYPE_TO_DICTIONARY = {"diseases": "ctd-diseases", "chemical": "ctd-chemicals", "genes": "ncbi-gene"}


def main():
    for entity_type, model in ENTITY_TYPE_TO_MODEL.items():
        ds_name = ENTITY_TYPE_TO_DATASET[entity_type]
        dictionary = ENITY_TYPE_TO_DICTIONARY[entity_type]

        ds = load_dataset(f"bigbio/{ds_name}", f"{ds_name}_bigbio_kb", trust_remote_code=True)
        print(f"Loaded corpus: `{ds_name}`")

        annotations = [a for d in ds["test"] for a in d["entities"]]
        if ds_name == "bc5cdr":
            annotations = [a for a in annotations if a["type"].lower() == entity_type]

        mention_to_uids = defaultdict(list)
        uid_to_link = {}
        for a in annotations:
            # skip mentions without normalization
            if len(a["normalized"]) == 0:
                continue

            if ds_name == "gnormplus":
                # no prefix for NCBI Gene
                uid_to_link[a["id"]] = [n["db_id"] for n in a["normalized"]]
            else:
                uid_to_link[a["id"]] = [":".join((n["db_name"], n["db_id"])) for n in a["normalized"]]

            for t in a["text"]:
                mention_to_uids[t].append(a["id"])

        linker = EntityMentionLinker.build(
            model,
            entity_type,
            dictionary_name_or_path=dictionary,
            hybrid_search=True,
            preprocessor=BioSynEntityPreprocessor(),
            batch_size=1024,
        )

        mentions = sorted(linker.preprocessor.process_mention(m) for m in set(mention_to_uids))
        results = linker.candidate_generator.search(entity_mentions=mentions, top_k=1)

        hits = 0
        total = 0
        for m, r in zip(mentions, results):
            for uid in mention_to_uids[m]:
                y_true = uid_to_link[uid]
                y_pred, _ = r[0]
                total += 1
                if y_pred in y_true:
                    hits += 1

        accuracy = round(hits / total * 100, 2)

        print(f"EVALUATION |  MODEL: `{model}`, CORPUS: `{ds_name}`,  ACCURACY@1: {accuracy}")


if __name__ == "__main__":
    main()

You should get the following results

EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-ncbi-disease`, CORPUS: `ncbi_disease`,  ACCURACY@1: 84.3
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc5cdr-chemical`, CORPUS: `bc5cdr`,  ACCURACY@1: 93.85
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc2gn`, CORPUS: `gnormplus`,  ACCURACY@1: 74.26

NOTE: the script reports accuracy on the gold mentions of the dataset (i.e. no NER) since this is how these models are commonly evaluated. Testing that EntityMentionLinker can correctly handle the output of a Classifier are in "test_biomedical_entity_linking.py"

NOTE: links to fetch models from the huggingface hub need to be updated. Let us know where we should put pre-trained models.

Suggestions

Is there some way of including the dictionary into the linker and preparing convenience functions to interpret the links?

Here we still have to find a nice solution. As you and @mariosaenger discussed last week - we may opt to build a distinct NENLabel class which inherits from Label and extends it by introducing two additional attributes: dictionary and canonical_name.

The label_type of all linking models could be "link" - not sure what is gained from having different label_type names for different biomedical NER classes.

For this, we adapted the implementation according to the approach done in the RelationClassifier model. Users are now able to specifically define on which label and entity types the linker model should be applied, e.g. by passing "ner" for all labels or {"ner": ["genes", "diseases"]} for restricting the linker to certain entity types for a given label type.

You could prepare MultitaskModels for each type (gene, disease, etc.) that combine the necessary NER tagger and linker, and correctly set the entity_tag_type.

Since this PR is already quite substantial, if that's ok with you, we would provide bundled models which correctly link the new NER and NEN models together when all code changes are done.

@helpmefindaname
Copy link
Collaborator Author

@sg-wbi

I am working on integrating you evaluation script into this PR.
While debugging, I think I found a bug:

        ....

-       mentions = sorted(linker.preprocessor.process_mention(m) for m in set(mention_to_uids))
-       results = linker.candidate_generator.search(entity_mentions=mentions, top_k=1)
+       mentions = sorted(mention_to_uids.keys())
+       preproc_mentions = [linker.preprocessor.process_mention(m) for m in mentions]
+       results = linker.candidate_generator.search(entity_mentions=preproc_mentions, top_k=1)

        hits = 0
        total = 0
        for m, r in zip(mentions, results):
        ....

Imo the before ignored all labels for mentiones that were preprocessed (e.g. the hard ones).

With the initial script I get the following results:

EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-ncbi-disease`, CORPUS: `ncbi_disease`,  ACCURACY@1: 83.91
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc5cdr-chemical`, CORPUS: `bc5cdr`,  ACCURACY@1: 94.0
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc2gn`, CORPUS: `gnormplus`,  ACCURACY@1: 69.31

after the change I get some worse results, with the exception of the genes corpus, that improves a lot:

EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-ncbi-disease`, CORPUS: `ncbi_disease`,  ACCURACY@1: 76.35
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc5cdr-chemical`, CORPUS: `bc5cdr`,  ACCURACY@1: 84.62
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc2gn`, CORPUS: `gnormplus`,  ACCURACY@1: 79.58

I also noticed, that some labels in the dataset do not exist in the dictionary and therefore cannot be predicted right.
I found the following upper bounds on the detection of the respective dictionary-dataset pairs:

EVALUATION |  NOT_IN_DATA: 81, MAX_ACCURACY@1: 91.56
EVALUATION |  NOT_IN_DATA: 271, MAX_ACCURACY@1: 95.06
EVALUATION |  NOT_IN_DATA: 227, MAX_ACCURACY@1: 92.93

Can you comfirm this?

@sg-wbi
Copy link
Collaborator

sg-wbi commented Jan 22, 2024

I am working on integrating you evaluation script into this PR.

Thanks for taking care of this.

While debugging, I think I found a bug:
Imo the before ignored all labels for mentiones that were preprocessed (e.g. the hard ones).

Yes you are right. In my version only mentions which do not change after preprocessing were evaluated.

I also noticed, that some labels in the dataset do not exist in the dictionary and therefore cannot be predicted right.
Can you comfirm this?

Yes this is to be expected. Dictionary labels change over time (become obsolete/are merged). Some of these corpora were created 10 years ago

Mario Sänger and others added 19 commits February 4, 2024 13:15
…t basic text and Ab3P pre-processing to the new structure; fix bug in Ab3P abbreviation detection
…m text and (2) entity / concept names from an knowledge base or ontology
- improve name consistency

- make code more pythonic

- dictionaries always do lazy loading

- consistency in dictionary parsing: always yield (cui,name)

- clean up loading w/ CONSTANTS (easily swap models)

- allow access to sparse and dense search
- yet better naming

- add batched search

- fix dicionary loading
- predict only on mentions of give entity type
- fix mypy typing

- fix typos

- update docstrings

- rm faiss from requirements

- better naming

- allow user to specify annotation layer in predict

- allow no mentions
- better naming

- unique cache name
- add option to time search

- change error to warning if pre-trained model is not hybrid

- check if there are mentions to predict
def __init__(
self,
candidates: Iterable[EntityCandidate],
dataset_name: Optional[str] = None,
Copy link
Collaborator

@alanakbik alanakbik Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments what dataset_name is.

Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Justrequests for more docstrings in some new classes.

Additionally, the evaluate method is quite bare-bones, but I have no good idea how to better reuse existing evaluation code for more informative evaluation. So this issue should not block a merge.

return InMemoryEntityLinkingDictionary(list(self._idx_to_candidates.values()), self._dataset_name)


class InMemoryEntityLinkingDictionary(EntityLinkingDictionary):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment explaining this class.

@@ -1760,3 +2210,398 @@ def __init__(
banned_sentences=banned_sentences,
sample_missing_splits=sample_missing_splits,
)


class BigbioCorpus(Corpus, abc.ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment to explain this class

return FlairDatapointDataset(all_sentences)


class BIGBIO_NCBI_DISEASE(BigbioCorpus):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for each of the BigbioCorpus subclasses please add a brief description of the dataset and link.

yield unified_example


class BIGBIO_BC5CDR_CHEMICAL(BigbioCorpus):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for each of the BigbioCorpus subclasses please add a brief description of the dataset and link.

yield data


class BIGBIO_GNORMPLUS(BigbioCorpus):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for each of the BigbioCorpus subclasses please add a brief description of the dataset and link.

log.error("-" * 80)
Path(flair.cache_root / "models" / model_folder).rmdir() # remove folder again if not valid
raise
model_path = hf_download(model_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change!

Comment on lines +1131 to +1141
self,
data_points: Union[List[Sentence], Dataset],
gold_label_type: str,
out_path: Optional[Union[str, Path]] = None,
embedding_storage_mode: str = "none",
mini_batch_size: int = 32,
main_evaluation_metric: Tuple[str, str] = ("accuracy", "f1-score"),
exclude_labels: List[str] = [],
gold_label_dictionary: Optional[Dictionary] = None,
return_loss: bool = True,
k: int = 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of these parameters are unused (like out_path, embedding_storage_mode, etc.). This is a limitation of our current evaluate signature of the Model class.

Some possibilities (not in this PR):

  • Have the EntityMentionLinker inherit from Classifier instead of Model and reuse its evaluate function. This would also entail adapting the predict method and determining how many of these parameters make sense for an untrained model. For instance, is there a batch size?
  • Implement functionality for as many parameters as possible. For instance, we could implement something for out_path.

@sg-wbi
Copy link
Collaborator

sg-wbi commented Feb 7, 2024

@alanakbik I have added the requested docstrings/comments.

@alanakbik
Copy link
Collaborator

@sg-wbi @helpmefindaname @mariosaenger thanks a lot for adding this major new feature to Flair!

@alanakbik alanakbik merged commit 17e2895 into master Feb 8, 2024
1 check passed
@alanakbik alanakbik deleted the bf/bio-entity-normalization branch February 8, 2024 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants