-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add biomedical entity normalization #3180
Conversation
…t basic text and Ab3P pre-processing to the new structure; fix bug in Ab3P abbreviation detection
…m text and (2) entity / concept names from an knowledge base or ontology
- improve name consistency - make code more pythonic - dictionaries always do lazy loading - consistency in dictionary parsing: always yield (cui,name) - clean up loading w/ CONSTANTS (easily swap models) - allow access to sparse and dense search
- yet better naming - add batched search - fix dicionary loading
- predict only on mentions of give entity type
- fix mypy typing - fix typos - update docstrings - rm faiss from requirements - better naming - allow user to specify annotation layer in predict - allow no mentions
import faiss | ||
except ImportError as error: | ||
raise ImportError( | ||
f"You need to install to run the biomedical entity linking: `pip faiss faiss-cpu=={FAISS_VERSION}`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Install command should be "pip install faiss-cpu.." (instead of "pip faiss faiss-cpu").
Moreover, I would recommend to adjust the warning and refer to the GPU version of faiss too, i.e. add "pip install faiss-gpu..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the option to place the index on the GPU.
Large dictionaries require a lot of GPU RAM and unless we offer some compression it does not make too much sense.
We can leave it to as a next feature in a separate PR.
- better naming - unique cache name
- add option to time search - change error to warning if pre-trained model is not hybrid - check if there are mentions to predict
This PR implements a named entity recognition model focussing on the biomedical domain.
The main contribution is a entity linking model which uses dense (transformer-based) embeddings and (optionally) sparse character-based representations, for normalizing an entity mention to specific identifiers in a knowledge base / dictionary. To this end, the model embeds the entity mention text and all concept names from the knowledge base and outputs the k best-matching concepts based on embedding similarity.