Bookcorpus data contains pretokenized text #486

orsharir · 2020-08-09T06:53:24Z

It seem that the bookcoprus data downloaded through the library was pretokenized with NLTK's Treebank tokenizer, which changes the text in incompatible ways to how, for instance, BERT's wordpiece tokenizer works. For example, "didn't" becomes "did" + "n't", and double quotes are changed to `` and '' for start and end quotes, respectively.

On my own projects, I just run the data through NLTK's TreebankWordDetokenizer to reverse the tokenization (as best as possible). I think it would be beneficial to apply this transformation directly on your remote cached copy of the dataset. If you choose to do so, I would also suggest to use my fork of NLTK that fixes several bugs in their detokenizer (I've opened a pull-request, but they've yet to respond): nltk/nltk#2575

lhoestq · 2020-08-31T15:52:02Z

Yes indeed it looks like some ' and spaces are missing (for example in dont or didnt).
Do you know if there exist some copies without this issue ?
How would you fix this issue on the current data exactly ? I can see that the data is raw text (not tokenized) so I'm not sure I understand how you would do it. Could you provide more details ?

orsharir · 2020-09-01T13:11:19Z

I'm afraid that I don't know how to obtain the original BookCorpus data. I believe this version came from an anonymous Google Drive link posted in another issue.

Going through the raw text in this version, it's apparent that NLTK's TreebankWordTokenizer was applied on it (I gave some examples in my original post), followed by:
' '.join(tokens)
You can retrieve the tokenization by splitting on whitespace. You can then "detokenize" it with TreebankWordDetokenizer class of NLTK (though, as I suggested, use the fixed version in my repo). This will bring the text closer to its original form, but some steps of TreebankWordTokenizer are destructive, so it wouldn't be one-to-one. Something along the lines of the following should work:

treebank_detokenizer = nltk.tokenize.treebank.TreebankWordDetokenizer()
db = nlp.load_dataset('bookcorpus', split=nlp.Split.TRAIN)
db = db.map(lambda x: treebank_detokenizer.detokenize(x['text'].split()))

Regarding other issues beyond the above, I'm afraid that I can't help with that.

lhoestq · 2020-09-08T09:36:00Z

Ok I get it, that would be very cool indeed

What kinds of patterns the detokenizer can't retrieve ?

orsharir · 2020-09-08T14:18:31Z

The TreebankTokenizer makes some assumptions about whitespace, parentheses, quotation marks, etc. For instance, while tokenizing the following text:

Dwayne "The Rock" Johnson

will result in:

Dwayne `` The Rock '' Johnson

where the left and right quotation marks are turned into distinct symbols. Upon reconstruction, we can attach the left part to its token on the right, and respectively for the right part. However, the following texts would be tokenized exactly the same:

Dwayne " The Rock " Johnson
Dwayne " The Rock" Johnson
Dwayne     " The Rock" Johnson
...

In the above examples, the detokenizer would correct these inputs into the canonical text

Dwayne "The Rock" Johnson

However, there are cases where there the solution cannot easily be inferred (at least without a true LM - this tokenizer is just a bunch of regexes). For instance, in cases where you have a fragment that contains the end of quote, but not its beginning, plus an accidental space:

... and it sounds fantastic, " he said.

In the above case, the tokenizer would assume that the quotes refer to the next token, and so upon detokenization it will result in the following mistake:

... and it sounds fantastic, "he said.

While these are all odd edge cases (the basic assumptions do make sense), in noisy data they can occur, which is why I mentioned that the detokenizer cannot restore the original perfectly.

arvieFrydenlund · 2020-09-12T16:33:04Z

To confirm, since this is preprocessed, this was not the exact version of the Book Corpus used to actually train the models described here (particularly Distilbert)? https://huggingface.co/datasets/bookcorpus

Or does this preprocessing exactly match that of the papers?

orsharir · 2020-09-20T09:11:31Z

I believe these are just artifacts of this particular source. It might be better to crawl it again, or use another preprocessed source, as found here: https://github.com/soskek/bookcorpus

richarddwang · 2020-09-30T11:58:11Z

Yes actually the BookCorpus on hugginface is based on this. And I kind of regret naming it as "BookCorpus" instead of something like "BookCorpusLike".

But there is a good news ! @shawwn has replicated BookCorpus in his way, and also provided a link to download the plain text files. see here. There is chance we can have a "OpenBookCorpus" !

mariosasko · 2022-10-04T17:44:29Z

Resolved via #856

vblagoje mentioned this issue Nov 16, 2020

Add open book corpus #856

Merged

mariosasko closed this as completed Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bookcorpus data contains pretokenized text #486

Bookcorpus data contains pretokenized text #486

orsharir commented Aug 9, 2020

lhoestq commented Aug 31, 2020

orsharir commented Sep 1, 2020

lhoestq commented Sep 8, 2020

orsharir commented Sep 8, 2020

arvieFrydenlund commented Sep 12, 2020

orsharir commented Sep 20, 2020

richarddwang commented Sep 30, 2020 •

edited

Loading

mariosasko commented Oct 4, 2022

Bookcorpus data contains pretokenized text #486

Bookcorpus data contains pretokenized text #486

Comments

orsharir commented Aug 9, 2020

lhoestq commented Aug 31, 2020

orsharir commented Sep 1, 2020

lhoestq commented Sep 8, 2020

orsharir commented Sep 8, 2020

arvieFrydenlund commented Sep 12, 2020

orsharir commented Sep 20, 2020

richarddwang commented Sep 30, 2020 • edited Loading

mariosasko commented Oct 4, 2022

richarddwang commented Sep 30, 2020 •

edited

Loading