Entity Mention Linker #3388

helpmefindaname · 2023-12-24T13:34:03Z

"The main contribution is a entity linking model which uses dense (transformer-based) embeddings and (optionally) sparse character-based representations, for normalizing an entity mention to specific identifiers in a knowledge base / dictionary. To this end, the model embeds the entity mention text and all concept names from the knowledge base and outputs the k best-matching concepts based on embedding similarity."

for each "gene", "disease", "chemical" & "species" I created and uploaded a model to hf,
Those models can be loaded via EntityMentionLinker.load("bio-{label_type}" or EntityMentionLinker.load("bio-{label_type}-exact-match" The first represents the recommended default configuration (note for species it is currently exact match due to lack of alternative, while the latter represents the most simple model to use.

I suppose the recommendation of models will change soon, but would recommend to not make this part of this PR, but rather change it afterwards.

alanakbik

Thanks for this PR @helpmefindaname and @WangXII!

It was a bit hard to put together the code to actually tag a sentence, for several reasons:

It is not obvious that one first needs to run a biomedical NER tagger
The entity linker models have a cryptic entity_tag_type which needs to be overwritten
The predictions are just integer codes, and it was not obvious to me to which knowledge base they refer to
If I interpret correctly, the tested model seems to get even simple things wrong

Here is the code used for testing:

from flair.data import Sentence
from flair.datasets import NCBI_GENE_HUMAN_DICTIONARY
from flair.models import EntityMentionLinker
from flair.nn import Classifier

# Example sentence
sentence = Sentence("We observed A2M, CDKN1A and alpha-2-macroglobulin in the specimen.")

# instantiate NER tagger for genes
ner_tagger = Classifier.load("hunflair-gene")

# instantate Gene linker (the entity_label_type needs to be set to "ner")
gene_linker = EntityMentionLinker.load("bio-gene")
gene_linker.entity_label_type = "ner"

# use both taggers to predict
ner_tagger.predict(sentence)
gene_linker.predict(sentence)

# interpret results using dictionary
dictionary = NCBI_GENE_HUMAN_DICTIONARY()

print(sentence)

for entity in sentence.get_labels("gene"):
    print(entity)
    link = dictionary[entity.value]
    print(f" -> linked to: '{link.concept_name}'")

In this snippet, though mentioned explicitly, the genes A2M and alpha-2-macroglobulin are linked to some random entries. The "exact match" model performs only marginally better.

@WangXII - am I doing something wrong or why is the accuracy of the linking in such cases so low?

Some suggestions / questions:

You could prepare MultitaskModels for each type (gene, disease, etc.) that combine the necessary NER tagger and linker, and correctly set the entity_tag_type. This would allow users to easily instantiate a single model that directly works.
The label_type of all linking models could be "link" - not sure what is gained from having different label_type names for different biomedical NER classes.
Is there some way of including the dictionary into the linker and preparing convenience functions to interpret the links?

flair/models/entity_mention_linking.py

flair/datasets/entity_linking.py

flair/models/entity_mention_linking.py

WangXII · 2024-01-02T12:08:17Z

Thanks for creating out the updated pull request @helpmefindaname and for pointing out the low accuracy @alanakbik!

I've looked at the low accuracies and we indeed had a bug with linking to the correct knowledge base identifiers. This part should be fixed now and A2M and alpha-2-macroglobulin point to the correct NCBI Gene identifier number 2.

alanakbik · 2024-01-02T15:12:47Z

@WangXII thanks for the update! Is there a way I can compute evaluation numbers in Flair? i.e. load a gold dataset, load the model, make predictions and evaluate? Could you post a snippet for this?

WangXII · 2024-01-03T14:11:18Z

@sg-wbi can answer this the best. I think we have yet to update the evaluation script to the revised Flair API

alanakbik · 2024-01-04T15:45:57Z

@sg-wbi can you share an evaluation script? I would like to use it to test the models for accuracy before merging.

sg-wbi · 2024-01-04T16:22:56Z

@sg-wbi can you share an evaluation script? I would like to use it to test the models for accuracy before merging.

Unfortunately this is not straightforward. When we developed this our efforts in the integration into flair stopped at the model level. This is becasue we had a very specific use case. At the current state our evaluation scripts require multiple preprocessing steps.

We (@mariosaenger, @WangXII) will get back to you asap with a workable solution (to your suggestions as well).

sg-wbi · 2024-01-12T16:19:33Z

@alanakbik

Tests for accuracy

Here's the script that will give you accuracy results for 3 commonly used datasets.

from collections import defaultdict

from datasets import load_dataset

from flair.models import EntityMentionLinker
from flair.models.entity_mention_linking import BioSynEntityPreprocessor

ENTITY_TYPE_TO_MODEL = {
    "diseases": "dmis-lab/biosyn-sapbert-ncbi-disease",
    "chemical": "dmis-lab/biosyn-sapbert-bc5cdr-chemical",
    "genes": "dmis-lab/biosyn-sapbert-bc2gn",
}

ENTITY_TYPE_TO_DATASET = {"diseases": "ncbi_disease", "chemical": "bc5cdr", "genes": "gnormplus"}

ENITY_TYPE_TO_DICTIONARY = {"diseases": "ctd-diseases", "chemical": "ctd-chemicals", "genes": "ncbi-gene"}


def main():
    for entity_type, model in ENTITY_TYPE_TO_MODEL.items():
        ds_name = ENTITY_TYPE_TO_DATASET[entity_type]
        dictionary = ENITY_TYPE_TO_DICTIONARY[entity_type]

        ds = load_dataset(f"bigbio/{ds_name}", f"{ds_name}_bigbio_kb", trust_remote_code=True)
        print(f"Loaded corpus: `{ds_name}`")

        annotations = [a for d in ds["test"] for a in d["entities"]]
        if ds_name == "bc5cdr":
            annotations = [a for a in annotations if a["type"].lower() == entity_type]

        mention_to_uids = defaultdict(list)
        uid_to_link = {}
        for a in annotations:
            # skip mentions without normalization
            if len(a["normalized"]) == 0:
                continue

            if ds_name == "gnormplus":
                # no prefix for NCBI Gene
                uid_to_link[a["id"]] = [n["db_id"] for n in a["normalized"]]
            else:
                uid_to_link[a["id"]] = [":".join((n["db_name"], n["db_id"])) for n in a["normalized"]]

            for t in a["text"]:
                mention_to_uids[t].append(a["id"])

        linker = EntityMentionLinker.build(
            model,
            entity_type,
            dictionary_name_or_path=dictionary,
            hybrid_search=True,
            preprocessor=BioSynEntityPreprocessor(),
            batch_size=1024,
        )

        mentions = sorted(linker.preprocessor.process_mention(m) for m in set(mention_to_uids))
        results = linker.candidate_generator.search(entity_mentions=mentions, top_k=1)

        hits = 0
        total = 0
        for m, r in zip(mentions, results):
            for uid in mention_to_uids[m]:
                y_true = uid_to_link[uid]
                y_pred, _ = r[0]
                total += 1
                if y_pred in y_true:
                    hits += 1

        accuracy = round(hits / total * 100, 2)

        print(f"EVALUATION |  MODEL: `{model}`, CORPUS: `{ds_name}`,  ACCURACY@1: {accuracy}")


if __name__ == "__main__":
    main()

You should get the following results

EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-ncbi-disease`, CORPUS: `ncbi_disease`,  ACCURACY@1: 84.3
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc5cdr-chemical`, CORPUS: `bc5cdr`,  ACCURACY@1: 93.85
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc2gn`, CORPUS: `gnormplus`,  ACCURACY@1: 74.26

NOTE: the script reports accuracy on the gold mentions of the dataset (i.e. no NER) since this is how these models are commonly evaluated. Testing that EntityMentionLinker can correctly handle the output of a Classifier are in "test_biomedical_entity_linking.py"

NOTE: links to fetch models from the huggingface hub need to be updated. Let us know where we should put pre-trained models.

Suggestions

Is there some way of including the dictionary into the linker and preparing convenience functions to interpret the links?

Here we still have to find a nice solution. As you and @mariosaenger discussed last week - we may opt to build a distinct NENLabel class which inherits from Label and extends it by introducing two additional attributes: dictionary and canonical_name.

The label_type of all linking models could be "link" - not sure what is gained from having different label_type names for different biomedical NER classes.

For this, we adapted the implementation according to the approach done in the RelationClassifier model. Users are now able to specifically define on which label and entity types the linker model should be applied, e.g. by passing "ner" for all labels or {"ner": ["genes", "diseases"]} for restricting the linker to certain entity types for a given label type.

You could prepare MultitaskModels for each type (gene, disease, etc.) that combine the necessary NER tagger and linker, and correctly set the entity_tag_type.

Since this PR is already quite substantial, if that's ok with you, we would provide bundled models which correctly link the new NER and NEN models together when all code changes are done.

helpmefindaname · 2024-01-20T01:50:13Z

@sg-wbi

I am working on integrating you evaluation script into this PR.
While debugging, I think I found a bug:

        ....

-       mentions = sorted(linker.preprocessor.process_mention(m) for m in set(mention_to_uids))
-       results = linker.candidate_generator.search(entity_mentions=mentions, top_k=1)
+       mentions = sorted(mention_to_uids.keys())
+       preproc_mentions = [linker.preprocessor.process_mention(m) for m in mentions]
+       results = linker.candidate_generator.search(entity_mentions=preproc_mentions, top_k=1)

        hits = 0
        total = 0
        for m, r in zip(mentions, results):
        ....

Imo the before ignored all labels for mentiones that were preprocessed (e.g. the hard ones).

With the initial script I get the following results:

EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-ncbi-disease`, CORPUS: `ncbi_disease`,  ACCURACY@1: 83.91
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc5cdr-chemical`, CORPUS: `bc5cdr`,  ACCURACY@1: 94.0
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc2gn`, CORPUS: `gnormplus`,  ACCURACY@1: 69.31

after the change I get some worse results, with the exception of the genes corpus, that improves a lot:

EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-ncbi-disease`, CORPUS: `ncbi_disease`,  ACCURACY@1: 76.35
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc5cdr-chemical`, CORPUS: `bc5cdr`,  ACCURACY@1: 84.62
EVALUATION |  MODEL: `dmis-lab/biosyn-sapbert-bc2gn`, CORPUS: `gnormplus`,  ACCURACY@1: 79.58

I also noticed, that some labels in the dataset do not exist in the dictionary and therefore cannot be predicted right.
I found the following upper bounds on the detection of the respective dictionary-dataset pairs:

EVALUATION |  NOT_IN_DATA: 81, MAX_ACCURACY@1: 91.56
EVALUATION |  NOT_IN_DATA: 271, MAX_ACCURACY@1: 95.06
EVALUATION |  NOT_IN_DATA: 227, MAX_ACCURACY@1: 92.93

Can you comfirm this?

sg-wbi · 2024-01-22T14:03:40Z

I am working on integrating you evaluation script into this PR.

Thanks for taking care of this.

While debugging, I think I found a bug:
Imo the before ignored all labels for mentiones that were preprocessed (e.g. the hard ones).

Yes you are right. In my version only mentions which do not change after preprocessing were evaluated.

I also noticed, that some labels in the dataset do not exist in the dictionary and therefore cannot be predicted right.
Can you comfirm this?

Yes this is to be expected. Dictionary labels change over time (become obsolete/are merged). Some of these corpora were created 10 years ago

…t basic text and Ab3P pre-processing to the new structure; fix bug in Ab3P abbreviation detection

…m text and (2) entity / concept names from an knowledge base or ontology

- improve name consistency - make code more pythonic - dictionaries always do lazy loading - consistency in dictionary parsing: always yield (cui,name) - clean up loading w/ CONSTANTS (easily swap models) - allow access to sparse and dense search

- yet better naming - add batched search - fix dicionary loading

- predict only on mentions of give entity type

- fix mypy typing - fix typos - update docstrings - rm faiss from requirements - better naming - allow user to specify annotation layer in predict - allow no mentions

- fix typo

- better naming - unique cache name

- add option to time search - change error to warning if pre-trained model is not hybrid - check if there are mentions to predict

- preprocessing: ensure no empty strings after processing - preprocessing: ensure Ab3P works - generator: separate sparse and dense search - generator: constant with sparse weight for pre-trained models

…y usage

- normalize entity types: diseases->disease, genes-gene - predict: compatibility with Classifier.load('hunflair').label_type

- fix(preprocessing): rm path from a3bp-preprocessor state

alanakbik · 2024-02-07T05:09:53Z

flair/datasets/entity_linking.py

+    def __init__(
+        self,
+        candidates: Iterable[EntityCandidate],
+        dataset_name: Optional[str] = None,


Please add comments what dataset_name is.

alanakbik

Looks good! Justrequests for more docstrings in some new classes.

Additionally, the evaluate method is quite bare-bones, but I have no good idea how to better reuse existing evaluation code for more informative evaluation. So this issue should not block a merge.

alanakbik · 2024-02-07T05:12:00Z

flair/datasets/entity_linking.py

+        return InMemoryEntityLinkingDictionary(list(self._idx_to_candidates.values()), self._dataset_name)
+
+
+class InMemoryEntityLinkingDictionary(EntityLinkingDictionary):


Please add comment explaining this class.

alanakbik · 2024-02-07T05:14:32Z

flair/datasets/entity_linking.py

@@ -1760,3 +2210,398 @@ def __init__(
            banned_sentences=banned_sentences,
            sample_missing_splits=sample_missing_splits,
        )
+
+
+class BigbioCorpus(Corpus, abc.ABC):


Please add comment to explain this class

alanakbik · 2024-02-07T05:15:38Z

flair/datasets/entity_linking.py

+        return FlairDatapointDataset(all_sentences)
+
+
+class BIGBIO_NCBI_DISEASE(BigbioCorpus):


Also for each of the BigbioCorpus subclasses please add a brief description of the dataset and link.

alanakbik · 2024-02-07T05:15:44Z

flair/datasets/entity_linking.py

+                yield unified_example
+
+
+class BIGBIO_BC5CDR_CHEMICAL(BigbioCorpus):


Also for each of the BigbioCorpus subclasses please add a brief description of the dataset and link.

alanakbik · 2024-02-07T05:15:51Z

flair/datasets/entity_linking.py

+            yield data
+
+
+class BIGBIO_GNORMPLUS(BigbioCorpus):


Also for each of the BigbioCorpus subclasses please add a brief description of the dataset and link.

alanakbik · 2024-02-07T05:18:56Z

flair/models/sequence_tagger_model.py

-                log.error("-" * 80)
-                Path(flair.cache_root / "models" / model_folder).rmdir()  # remove folder again if not valid
-                raise
+            model_path = hf_download(model_name)


Good change!

alanakbik · 2024-02-07T05:46:00Z

flair/models/entity_mention_linking.py

+        self,
+        data_points: Union[List[Sentence], Dataset],
+        gold_label_type: str,
+        out_path: Optional[Union[str, Path]] = None,
+        embedding_storage_mode: str = "none",
+        mini_batch_size: int = 32,
+        main_evaluation_metric: Tuple[str, str] = ("accuracy", "f1-score"),
+        exclude_labels: List[str] = [],
+        gold_label_dictionary: Optional[Dictionary] = None,
+        return_loss: bool = True,
+        k: int = 1,


Many of these parameters are unused (like out_path, embedding_storage_mode, etc.). This is a limitation of our current evaluate signature of the Model class.

Some possibilities (not in this PR):

Have the EntityMentionLinker inherit from Classifier instead of Model and reuse its evaluate function. This would also entail adapting the predict method and determining how many of these parameters make sense for an untrained model. For instance, is there a batch size?

Implement functionality for as many parameters as possible. For instance, we could implement something for out_path.

sg-wbi · 2024-02-07T17:26:53Z

@alanakbik I have added the requested docstrings/comments.

alanakbik · 2024-02-08T11:15:32Z

@sg-wbi @helpmefindaname @mariosaenger thanks a lot for adding this major new feature to Flair!

helpmefindaname force-pushed the bf/bio-entity-normalization branch from 2fae0a2 to 3c9d412 Compare December 24, 2023 13:37

helpmefindaname requested a review from alanakbik December 24, 2023 20:28

alanakbik reviewed Dec 30, 2023

View reviewed changes

flair/models/entity_mention_linking.py Show resolved Hide resolved

flair/datasets/entity_linking.py Show resolved Hide resolved

flair/models/entity_mention_linking.py Outdated Show resolved Hide resolved

Mario Sänger and others added 19 commits February 4, 2024 13:15

Initial version (already adapted to recent Flair API changes)

a01e060

Revise mention text pre-processing: define general interface and adap…

22bc93a

…t basic text and Ab3P pre-processing to the new structure; fix bug in Ab3P abbreviation detection

Refactor entity linking model structure

acf5fb6

Update documentation

56c89ba

Introduce separate methods for pre-processing (1) entity mentions fro…

51fe951

…m text and (2) entity / concept names from an knowledge base or ontology

Fix formatting

19f74fb

feat(test): biomedical entity linking

e5342d5

fix(test): hold on w/ automatic tests for now

4ef5924

fix(bionel): start major refactoring

abc42b5

- improve name consistency - make code more pythonic - dictionaries always do lazy loading - consistency in dictionary parsing: always yield (cui,name) - clean up loading w/ CONSTANTS (easily swap models) - allow access to sparse and dense search

fix(bionel): major refactor

bdc3e8a

- yet better naming - add batched search - fix dicionary loading

fix(bionel): assign entity type

48e8ae7

- predict only on mentions of give entity type

fix(biencoder): set sparse encoder and weight

e6b57eb

fix(bionel): address comments

0d3cec2

- fix mypy typing - fix typos - update docstrings - rm faiss from requirements - better naming - allow user to specify annotation layer in predict - allow no mentions

fix(candidate_generator): container for search result

99a109f

fix(predict): default annotation layer iff not provided by use

301988e

- fix typo

fix(label): scores can be >= or <=

c14d6ce

fix(candidate): parametrize database name

c66789a

feat(candidate_generator): cache sparse encoder

70c0c7d

- better naming - unique cache name

fix(candidate_generator): minor improvements

37a2458

- add option to time search - change error to warning if pre-trained model is not hybrid - check if there are mentions to predict

Xing Wang and others added 19 commits February 4, 2024 13:15

fixed selection of knowledge base identifiers for entity_mention_linking

a4a27dc

fix(dictionary): corrections in file parsing

6689297

fix: preprocessing, candidate generator, linker

c2c53a9

- preprocessing: ensure no empty strings after processing - preprocessing: ensure Ab3P works - generator: separate sparse and dense search - generator: constant with sparse weight for pre-trained models

feat(tests): test preprocessing

1efe956

Minor fix in pre-processing pipeline

5da7494

Add support for label and entity type definition + fix tests

4cd86b0

fix: formatting and type checking

aaa5f39

add batchsize to prediction instead of only embedding to reduce memor…

def9905

…y usage

add evaluation function

03092c2

fix(linker): extraction of entity mentions in predict

bf88d0a

- normalize entity types: diseases->disease, genes-gene - predict: compatibility with Classifier.load('hunflair').label_type

fix(predict): ensure mentions extraction works with legacy classifier

7b824a1

- fix(preprocessing): rm path from a3bp-preprocessor state

chore: update tests

c8cc120

fix(tests): normalized entity type name

94aaca1

fix(logging): deprecated logger.warn

75ea402

improve typing and run black

d9805be

add datasets for nel-bioner evaluation

8ab67d2

mark heavy test as integration test

654ed8d

add metadata to labels for cnadidate names

efb1018

fix black ruff and mypy

44c1413

helpmefindaname force-pushed the bf/bio-entity-normalization branch from 3c905d5 to 44c1413 Compare February 4, 2024 12:21

make test more memory efficient by only loading the smallest model

d7fa7dd

alanakbik reviewed Feb 7, 2024

View reviewed changes

alanakbik requested changes Feb 7, 2024

View reviewed changes

chore(docs): add docstrings fro datasets

ba833f0

fix(bigbio): better naming & fix ruff

44d73a2

alanakbik approved these changes Feb 8, 2024

View reviewed changes

alanakbik merged commit 17e2895 into master Feb 8, 2024
1 check passed

alanakbik deleted the bf/bio-entity-normalization branch February 8, 2024 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity Mention Linker #3388

Entity Mention Linker #3388

helpmefindaname commented Dec 24, 2023

alanakbik left a comment •

edited

Loading

WangXII commented Jan 2, 2024 •

edited

Loading

alanakbik commented Jan 2, 2024

WangXII commented Jan 3, 2024

alanakbik commented Jan 4, 2024

sg-wbi commented Jan 4, 2024

sg-wbi commented Jan 12, 2024

helpmefindaname commented Jan 20, 2024

sg-wbi commented Jan 22, 2024 •

edited

Loading

alanakbik Feb 7, 2024 •

edited

Loading

alanakbik left a comment

alanakbik Feb 7, 2024

alanakbik Feb 7, 2024

alanakbik Feb 7, 2024

alanakbik Feb 7, 2024

alanakbik Feb 7, 2024

alanakbik Feb 7, 2024

alanakbik Feb 7, 2024

sg-wbi commented Feb 7, 2024

alanakbik commented Feb 8, 2024

		return InMemoryEntityLinkingDictionary(list(self._idx_to_candidates.values()), self._dataset_name)


		class InMemoryEntityLinkingDictionary(EntityLinkingDictionary):

		return FlairDatapointDataset(all_sentences)


		class BIGBIO_NCBI_DISEASE(BigbioCorpus):

		yield unified_example


		class BIGBIO_BC5CDR_CHEMICAL(BigbioCorpus):

Entity Mention Linker #3388

Entity Mention Linker #3388

Conversation

helpmefindaname commented Dec 24, 2023

alanakbik left a comment • edited Loading

Choose a reason for hiding this comment

WangXII commented Jan 2, 2024 • edited Loading

alanakbik commented Jan 2, 2024

WangXII commented Jan 3, 2024

alanakbik commented Jan 4, 2024

sg-wbi commented Jan 4, 2024

sg-wbi commented Jan 12, 2024

Tests for accuracy

Suggestions

helpmefindaname commented Jan 20, 2024

sg-wbi commented Jan 22, 2024 • edited Loading

alanakbik Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

alanakbik left a comment

Choose a reason for hiding this comment

alanakbik Feb 7, 2024

Choose a reason for hiding this comment

alanakbik Feb 7, 2024

Choose a reason for hiding this comment

alanakbik Feb 7, 2024

Choose a reason for hiding this comment

alanakbik Feb 7, 2024

Choose a reason for hiding this comment

alanakbik Feb 7, 2024

Choose a reason for hiding this comment

alanakbik Feb 7, 2024

Choose a reason for hiding this comment

alanakbik Feb 7, 2024

Choose a reason for hiding this comment

sg-wbi commented Feb 7, 2024

alanakbik commented Feb 8, 2024

alanakbik left a comment •

edited

Loading

WangXII commented Jan 2, 2024 •

edited

Loading

sg-wbi commented Jan 22, 2024 •

edited

Loading

alanakbik Feb 7, 2024 •

edited

Loading