wrong token #744

CnsAd · 2017-01-16T05:58:38Z

INPUT:

import spacy
nlp = spacy.load('en')
doc1 = nlp(u"They have killed the bat last night. We were so scared!")
for token in doc1:
    print(token)

OUTPUT:
They
have
killed
the
bat
last
night
.
We
we
re
so
scared
the "were" has been tokenized wrongly!

The text was updated successfully, but these errors were encountered:

keotic · 2017-01-16T10:01:55Z

Noticed that as well, currently spotted this bug only at "were"

ines · 2017-01-16T11:54:56Z

Ah, this seems to be a mistake in the tokenizer exceptions. It's adding all contractions with and without apostrophes, but were and Were should obviously have been excluded (like it's currently done for well, hell, ill etc).

This is easy to fix – will do this now and add a regression test.

lock · 2018-05-09T04:38:45Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added lang / en English language data and models performance labels Jan 16, 2017

ines closed this as completed in 50878ef Jan 16, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong token #744

wrong token #744

CnsAd commented Jan 16, 2017

keotic commented Jan 16, 2017

ines commented Jan 16, 2017

lock bot commented May 9, 2018

wrong token #744

wrong token #744

Comments

CnsAd commented Jan 16, 2017

keotic commented Jan 16, 2017

ines commented Jan 16, 2017

lock bot commented May 9, 2018