- Corrected training with
train_in_candidates
set to False.
- We did an architectural change and switched to a biencoder model. This changes our task flow and dataprep. This new model uses less CPU storage and uses the standard BERT architecture. Our entity encoder now takes a textual input of an entity that contains its title, description, KG relationships, and types.
- To support larger files for dumping predictions over, we support adding an
entity_emb_file
to the model (extracted fromextract_all_entities.py
. This will make evaluation faster. Further, we addeddump_preds_num_data_splits
to split a file before dumping. As each file pass gets a new dataload object, this can mitiage any torch dataloader memory issues that happens over large files. - Renamed
eval_accumulation_steps
todump_preds_accumulation_steps
. - Removed option to
dump_embs
. Users should usedump_preds
instead. The output file will haveentity_ids
attribute that will index into the extracted entity embeddings. - Restructured our
entity_db
data for faster loading. It uses Tries rather than jsons to store the data for read only mode. The KG relations are not backwards compatible. - Moved to character spans for input data. Added utils.preprocessing.convert_to_char_spans as a helper function to convert from word offsets to character offsets.
BOOTLEG_STRIP
andBOOTLEG_LOWER
environment variables forget_lnrm
.extract_all_entities.py
as a way to extract all entity embeddings. These entity embeddings can be used in eval and be used downstream. Uses can useget_eid
from theEntityProfile
to extract the row id for a specific entity.
- Fixed -1 command line argparse error
- Adjusted requirements
- Tutorial to generate contextualized entity embeddings that perform better downstream
- Bump version of Pydantic to 1.7.4
- Corrected how custom candidates were handled in the BootlegAnnotator when using
extracted_examples
- Fixed memory leak in BooltegAnnotator due to missing
torch.no_grad()
- Support for
min_alias_len
toextract_mentions
and theBootlegAnnotator
. return_embs
flag to pass intoBootlegAnnotator
that will return the contextualized embeddings of the entity (using keyembs
) and entity candidates (using keycand_embs
).
- Removed condition that aliases for eval must appear in candidate lists. We now allow for eval to not have known aliases and always mark these as incorrect. When dumping predictions, these get "-1" candidates and null probabilities.
- Corrected
fit_to_profile
to rebuild the title embeddings for the new entities.
Note
If upgrading to 1.0.1 from 1.0.0, you will need to re-download our models given the links in the README.md. We altered what keys were saved in the state dict, but the model weights are unchanged.
data_config.print_examples_prep
flag to toggle data example printing during data prep.data_config.dump_preds_accumulation_steps
to support subbatching dumping of predictings. We save outputs to separate files of size approximatelydata_config.dump_preds_accumulation_steps*data_config.eval_batch_size
and merge into a final file at the end.- Entity Profile API. See the docs. This allows for modifying entity metadata as well as adding and removing entities. We profile methods for refitting a model with a new profile for immediate inference, no finetuning needed.
- Support for not using multiprocessing if use sets
data_config.dataset_threads
to be 1. - Added better argument parsing to check for arguments that were misspelled or otherwise wouldn't trigger anything.
- Code is now Flake8 compatible.
- Fixed readthedocs so the BootlegAnnotator was loaded correctly.
- Fixed logging in BootlegAnnotator.
- Fixed
use_exact_path
argument in Emmental.
We did a major rewrite of our entire codebase and moved to using Emmental for training. Emmental allows for each multi-task training, FP16, and support for both DataParallel and DistributedDataParallel.
The overall functionality of Bootleg remains unchanged. We still support the use of an annotator and bulk mention extraction and evaluation. The core Bootleg model has remained largely unchanged. Checkout our documentation for more information on getting started. We have new models trained as described in our README.
Note
This branch os not backwards compatible with our old models or code base.
Some more subtle changes are below
- Support for data parallel and distributed data parallel training (through Emmental)
- FP16 (through Emmental)
- Easy install with
BootlegAnnotator
- Mention extraction code and alias map has been updated
- Models trained on October 2020 save of Wikipedia
- Have uncased and cased models
- Support for slice-based learning
- Support for
batch prepped
KG embeddings (only usebatch on the fly
)