Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bookcorpus data contains pretokenized text #486

Closed
orsharir opened this issue Aug 9, 2020 · 8 comments
Closed

Bookcorpus data contains pretokenized text #486

orsharir opened this issue Aug 9, 2020 · 8 comments

Comments

@orsharir
Copy link
Contributor

orsharir commented Aug 9, 2020

It seem that the bookcoprus data downloaded through the library was pretokenized with NLTK's Treebank tokenizer, which changes the text in incompatible ways to how, for instance, BERT's wordpiece tokenizer works. For example, "didn't" becomes "did" + "n't", and double quotes are changed to `` and '' for start and end quotes, respectively.

On my own projects, I just run the data through NLTK's TreebankWordDetokenizer to reverse the tokenization (as best as possible). I think it would be beneficial to apply this transformation directly on your remote cached copy of the dataset. If you choose to do so, I would also suggest to use my fork of NLTK that fixes several bugs in their detokenizer (I've opened a pull-request, but they've yet to respond): nltk/nltk#2575

@lhoestq
Copy link
Member

lhoestq commented Aug 31, 2020

Yes indeed it looks like some ' and spaces are missing (for example in dont or didnt).
Do you know if there exist some copies without this issue ?
How would you fix this issue on the current data exactly ? I can see that the data is raw text (not tokenized) so I'm not sure I understand how you would do it. Could you provide more details ?

@orsharir
Copy link
Contributor Author

orsharir commented Sep 1, 2020

I'm afraid that I don't know how to obtain the original BookCorpus data. I believe this version came from an anonymous Google Drive link posted in another issue.

Going through the raw text in this version, it's apparent that NLTK's TreebankWordTokenizer was applied on it (I gave some examples in my original post), followed by:
' '.join(tokens)
You can retrieve the tokenization by splitting on whitespace. You can then "detokenize" it with TreebankWordDetokenizer class of NLTK (though, as I suggested, use the fixed version in my repo). This will bring the text closer to its original form, but some steps of TreebankWordTokenizer are destructive, so it wouldn't be one-to-one. Something along the lines of the following should work:

treebank_detokenizer = nltk.tokenize.treebank.TreebankWordDetokenizer()
db = nlp.load_dataset('bookcorpus', split=nlp.Split.TRAIN)
db = db.map(lambda x: treebank_detokenizer.detokenize(x['text'].split()))

Regarding other issues beyond the above, I'm afraid that I can't help with that.

@lhoestq
Copy link
Member

lhoestq commented Sep 8, 2020

Ok I get it, that would be very cool indeed

What kinds of patterns the detokenizer can't retrieve ?

@orsharir
Copy link
Contributor Author

orsharir commented Sep 8, 2020

The TreebankTokenizer makes some assumptions about whitespace, parentheses, quotation marks, etc. For instance, while tokenizing the following text:

Dwayne "The Rock" Johnson

will result in:

Dwayne `` The Rock '' Johnson

where the left and right quotation marks are turned into distinct symbols. Upon reconstruction, we can attach the left part to its token on the right, and respectively for the right part. However, the following texts would be tokenized exactly the same:

Dwayne " The Rock " Johnson
Dwayne " The Rock" Johnson
Dwayne     " The Rock" Johnson
...

In the above examples, the detokenizer would correct these inputs into the canonical text

Dwayne "The Rock" Johnson

However, there are cases where there the solution cannot easily be inferred (at least without a true LM - this tokenizer is just a bunch of regexes). For instance, in cases where you have a fragment that contains the end of quote, but not its beginning, plus an accidental space:

... and it sounds fantastic, " he said.

In the above case, the tokenizer would assume that the quotes refer to the next token, and so upon detokenization it will result in the following mistake:

... and it sounds fantastic, "he said.

While these are all odd edge cases (the basic assumptions do make sense), in noisy data they can occur, which is why I mentioned that the detokenizer cannot restore the original perfectly.

@arvieFrydenlund
Copy link

To confirm, since this is preprocessed, this was not the exact version of the Book Corpus used to actually train the models described here (particularly Distilbert)? https://huggingface.co/datasets/bookcorpus

Or does this preprocessing exactly match that of the papers?

@orsharir
Copy link
Contributor Author

I believe these are just artifacts of this particular source. It might be better to crawl it again, or use another preprocessed source, as found here: https://github.com/soskek/bookcorpus

@richarddwang
Copy link
Contributor

richarddwang commented Sep 30, 2020

Yes actually the BookCorpus on hugginface is based on this. And I kind of regret naming it as "BookCorpus" instead of something like "BookCorpusLike".

But there is a good news ! @shawwn has replicated BookCorpus in his way, and also provided a link to download the plain text files. see here. There is chance we can have a "OpenBookCorpus" !

@mariosasko
Copy link
Collaborator

Resolved via #856

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants