The work in this repo is part of the "Networked Mathematics" project at the Topos Institute.
You can read about the project in the blog posts:
- Introducing the MathFoldr Project (11 Jul 2021)
- The many facets of Networked Mathematics (18 Apr 2022)
- Mathematical concepts: how do you recognize them? (16 Nov 2022)
and in our paper “Extracting Mathematical Concepts from Text”, presented at 8th Workshop on Noisy User-generated Text 2022(W-NUT) associated with COLING 2022. Here's a short video about the work.
You can also "eyeball" the first two corpora we investigated in the Parmesan0.11 and Parmesan0.12 prototypes.
For the unmodified results of our 2022 paper, see this branch.
A dev version of Parmesan 0.2 is now up at http://www.jacobcollard.com/parmesan2/
This repository contains a corpus based on the contents of abstracts in Theory and Applications of Categories (TAC) as of c. December 2020. This can be used as a training/testing corpus for mathematical NLP and machine learning projects.
The corpus contains the following data files:
tac.conll
contains an automatically annotated version of the corpus, with dependency structures and POS tags.tac.json
contains the original corpus, in JSON format.tac_metadata.json
contains the original corpus, in JSON format, with additional metadata such as authors and keywords.tac_stats.json
contains some basic statistics about the corpus, including frequency of common words and parts of speech.
The tac-experiments
folder contains a series of simple experiments evaluating
various automatic terminology extraction methods on the TAC corpus. To fully
run these experiments, you will need an installation of both
DyGIE++ and
Parmenides.
The latter is unfortunately not freely available, but you can contact the
authors for distribution.
There are two types of part-of-speech tags in the corpus statistics, both
generated by spaCy. The first tagset, labeled "pos" in nlab_stats.json
,
represents coarse-grained part of speech and is taken from the
Universal POS tag set. The
second tagset, "tag", is specific to spaCy's pretrained English model.
Details about the different tagsets, as well as other label schemes for this model, can be found on spaCy's website.