Access
- Data Set on Zenodo: full / permissively licensed subset
- Data Sample
- ML Data on Hugging Face: citation recommendation / IMRaD classification
Documentation
- Publications
- Scientometrics (author copy) (2020)
- JCDL 2023 (author copy) (2023)
- Data Format
- Usage
- Development
- Cite
unarXive contains
- 1.9 M structured paper full-texts, containing
- 63 M references (28 M linked to OpenAlex)
- 134 M in-text citation markers (65 M linked)
- 9 M figure captions
- 2 M table captions
- 742 M pieces of mathematical notation preserved as LaTeX
A comprehensive documentation of the data format can be found here.
You can find a data sample here.
If you want to use unarXive for citation recommendation or IMRaD classification, you can simply use our Hugging Face datasets:
For example, in the case of citation recommendation:
from datasets import load_dataset
citrec_data = load_dataset('saier/unarxive_citrec')
citrec_data = citrec_data.class_encode_column('label') # assign target label column
citrec_data = citrec_data.remove_columns('_id') # remove sample ID column
For instructions how to re-create or extend unarXive, see src/.
Versions
- Current release (1991–2022): see Access section above
- Previous releases (old format):
Development Status
See issues.
Current version
@inproceedings{Saier2023unarXive,
author = {Saier, Tarek and Krause, Johan and F\"{a}rber, Michael},
title = {{unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network}},
booktitle = {2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL)},
year = {2023},
pages = {66--70},
month = jun,
doi = {10.1109/JCDL57899.2023.00020},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
}
Initial publication
@article{Saier2020unarXive,
author = {Saier, Tarek and F{\"{a}}rber, Michael},
title = {{unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata}},
journal = {Scientometrics},
year = {2020},
volume = {125},
number = {3},
pages = {3085--3108},
month = dec,
issn = {1588-2861},
doi = {10.1007/s11192-020-03382-z}
}