Releases · KWARC/llamapun

A derivative dataset from arXMLiv 08.2018, intended for "statement classification" of paragraphs has been generated via what is now the 0.3.2 release of llamapun.

For details see #34

Assets 2

16 Apr 19:28

dginev

0.3.0

9ddc453

Parallel primitives

It is now possible to use llamapun while fully utilizing available CPU cores (configurable as is in rayon).

Most of the examples are now refactored to the parallel primitives, and can see a 20x speedup on high-end chips with 16+ cores. A pass over arXMLiv 08.2018 now takes between 2-3 hours for a lightweight task (frequency reports, token models, etc) on such hardware.

The library also uses the parallel-friendly RoNode libxml struct, which allows for additional gains when iterating over the DOM.

Example from corpus_mathml_stats:

use llamapun::parallel_data::Corpus;
// ...
let corpus = Corpus::new(corpus_path);
let catalog = corpus.catalog_with_parallel_walk(|document| {
  document
  .get_math_nodes()
  .into_par_iter()
  .map(|math| {
    let mut catalog = HashMap::new();
    dfs_record(math, &open_ended, &mut catalog);
    catalog
  })
  .reduce(HashMap::new, |mut map1, map2| {
    for (k, v) in map2 {
      let entry = map1.entry(k).or_insert(0);
      *entry += v;
    }
    map1
  })
});

For details, consult #29

Assets 2

27 Sep 05:47

dginev

0.2.1

eedb507

AMS labeled dataset, 08.2018

Eliminated memory leaks related to libxml use, this release has been used to generate the AMS paragraph dataset induced by the arXMLiv 08.2018 HTML5 corpus.

Assets 2

24 Sep 17:52

dginev

0.2.0

4078fc2

arXMLiv 08.2018 release

Changes for generating the arXMLiv 08.2018 token models:

Update dependencies
Improve corpus_token_model generation to include math lexemes
Improve paragraph iterator to skip over paragraphs containing ltx_ERROR markup
improve sentence tokenization to treat words with any capital letters as potential sentence breakers
word lexemes now properly attach 's possessives

Assets 2

22 Jan 21:55

dginev

0.1

9cb93d3

Public arXMLiv dataset release

This release is tagged to mark the library version used for generating the Corpus Token Model for arXMLiv 08.2017 dataset, to be released 02.2018.

A major upgrade is merging @jfschaefer 's pattern-matching component as described in #8

The release also includes a refresh of the dependencies for 2018, and minor bug fixes. llamapun still requires a nightly release of Rust to build and run.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: KWARC/llamapun

upgrades and statement extraction for arXMLiv 08.2019

arXMLiv 08.2019 release

Paragraph dataset extraction, first public release

Parallel primitives

AMS labeled dataset, 08.2018

arXMLiv 08.2018 release

Public arXMLiv dataset release