The PyTorch Machine Reading Comprehension (PyTorch-MRC) toolkit, which was rewritten on the basis of Sogou Machine Reading Comprehension (SMRC), was designed for the fast and efficient development of modern machine comprehension models, including both published models and original prototypes.
The whole project is written and maintained by me alone, so I hope that some friends who like NLP and are interested in MRC will work with me to maintain it. Please contact me by email at [email protected].
data
- vocabulary.py: Vocabulary building, word/char index mapping and pretrained word embedding building.
- batch_generator.py: Mapping words and tags to indices and building them by PyTorch Dataset, padding length-variable features dynamically, transforming all of the features into tensors, and batching them by PyTorch DataLoader.
dataset
- squad.py: Dataset reader and evaluator (from official code) for SQuAD 1.1
examples
- Examples for running different models, where the specified data path should provided to run the examples
model
- Base class and subclasses of models, where any model should inherit the base class
- Built-in models such as BiDAF, R-Net and QANet
nn
- attention.py: Attention functions such as BiAttention, Trilinear and MultiHeadAttention
- layers: commonly used layers in PyTorch Machine Reading Comprehension, such as VariationalDropout, Highway and PointerNetwork
- recurrent: Special wrappers for LSTM and GRU
- similarity_function.py: Similarity functions for attention, such as dot_product, trilinear, and symmetric_nolinear
- util: some useful utility functions such as sequence_mask, weighted_sum and masked_softmax
utils
- tokenizer.py: Tokenizers that can be used for both English and Chinese
- feature_extractor: Extracting linguistic features used in some papers, e.g., POS, NER, and Lemma
Model | toolkit implementation | original paper |
---|---|---|
BiDAF | 77.8/68.1 | 77.3/67.7 |
R-Net(sogou) | 79.0/70.5 | 79.5/71.1 |
R-Net(hkust) | 78.3/69.8 | 79.5/71.1 |
IARNN-Word | - | - |
IARNN-hidden | - | - |
DrQA | - | 78.8/69.5 |
FusionNet | - | 82.5/74.1 |
QANet | - | 82.7/73.6 |
BERT-Base | - | 88.5/80.8 |
For help or issues using this toolkit, please submit a GitHub issue or by email [email protected].
When implementing the MRC model, sometimes I didn't follow the paper reproduction model completely, because some parts of the paper were not clear to me or I didn't think they play a decisive role. So here's a description. Next I'll list the changes I've made.