Releases: KWARC/llamapun
upgrades and statement extraction for arXMLiv 08.2019
arXMLiv 08.2019 release
Tagged release used to extract the token model and embeddings for the arXMLiv 08.2019 corpus.
Paragraph dataset extraction, first public release
A derivative dataset from arXMLiv 08.2018
, intended for "statement classification" of paragraphs has been generated via what is now the 0.3.2 release of llamapun.
For details see #34
Parallel primitives
It is now possible to use llamapun while fully utilizing available CPU cores (configurable as is in rayon
).
Most of the examples are now refactored to the parallel primitives, and can see a 20x speedup on high-end chips with 16+ cores. A pass over arXMLiv 08.2018
now takes between 2-3 hours for a lightweight task (frequency reports, token models, etc) on such hardware.
The library also uses the parallel-friendly RoNode
libxml struct, which allows for additional gains when iterating over the DOM.
Example from corpus_mathml_stats
:
use llamapun::parallel_data::Corpus;
// ...
let corpus = Corpus::new(corpus_path);
let catalog = corpus.catalog_with_parallel_walk(|document| {
document
.get_math_nodes()
.into_par_iter()
.map(|math| {
let mut catalog = HashMap::new();
dfs_record(math, &open_ended, &mut catalog);
catalog
})
.reduce(HashMap::new, |mut map1, map2| {
for (k, v) in map2 {
let entry = map1.entry(k).or_insert(0);
*entry += v;
}
map1
})
});
For details, consult #29
AMS labeled dataset, 08.2018
Eliminated memory leaks related to libxml use, this release has been used to generate the AMS paragraph dataset induced by the arXMLiv 08.2018 HTML5 corpus.
arXMLiv 08.2018 release
Changes for generating the arXMLiv 08.2018 token models:
- Update dependencies
- Improve
corpus_token_model
generation to include math lexemes - Improve paragraph iterator to skip over paragraphs containing
ltx_ERROR
markup - improve sentence tokenization to treat words with any capital letters as potential sentence breakers
- word lexemes now properly attach
's
possessives
Public arXMLiv dataset release
This release is tagged to mark the library version used for generating the Corpus Token Model for arXMLiv 08.2017 dataset, to be released 02.2018.
A major upgrade is merging @jfschaefer 's pattern-matching component as described in #8
The release also includes a refresh of the dependencies for 2018, and minor bug fixes. llamapun still requires a nightly release of Rust to build and run.