Save Japanese NER model by using nlp.to_disk #1557

buivietan · 2017-11-13T04:13:06Z

I got this error "AttributeError: 'JapaneseTokenizer' object has no attribute 'to_disk'" when trying to save Japanese NER model in spaCy 2.0.2. Can you guys help me to fix this error? Thanks so much

Environment

Operating System: Win 10 Pro
Python Version Used: 3.6.3
spaCy Version Used: 2.0.2

ines · 2017-11-13T07:44:55Z

Thanks for the report! The reason this happens is that the Japanese tokenizer is a custom implementation via the Janome library and not using spaCy's serializable Tokenizerclass. So when you call nlp.to_disk(), spaCy will call the to_disk() methods of all pipeline components and the tokenizer – which fails in this case.

Possible solutions for now:

Use Pickle. This is probably the easiest and safest way – but if you're looking to create a spaCy model package, you'd have to modify the package's __init__.py to load your pickle file instead of initialising the model the standard way (see here for details).
Set nlp.tokenizer = None before saving out the model. This is not so nice – but in this case, it shouldn't really matter, since spaCy currently doesn't ship with any Japanese language data that you'd want to serialize with the model anyways. (Haven't tested this approach, but pretty sure it works!)

We should probably allow disabling the tokenizer via the disable keyword argument on Language.to_disk, too (which is currently only possible for pipeline components). Will think about the best way to solve this.

Btw, curious to hear about your results on training Japanese NER – sounds very exciting!

buivietan · 2017-11-13T08:04:16Z

@ines Thanks so much for your quick reply. I'll try your solution and give you my feedback on training Japanese NER :)

buivietan · 2017-11-15T01:39:57Z

Hi @ines,
Sorry for bothering you. I tried your solution but got this error using Pickle “AttributeError: Can't pickle local object 'FeatureExtracter..feature_extracter_fwd' “. Below is my sample code
# save model to output directory
print("Saving model...")
nlp.tokenizer = None
ner_model = pickle.dumps(nlp)
Moreover, I tried to build simple Chinese and Thai NER models, and I could save these models successfully using nlp.to_disk method. I wonder if there is something wrong with Japanese in spaCy ver 2.0.2 that we got this error “AttributeError: 'JapaneseTokenizer' object has no attribute 'to_disk' ”
Can you help me out this trouble. Thanks so much.

ines · 2017-11-15T10:35:24Z

Hmm, this is strange! I think the difference between Japanese and Thai/Chinese is that it provides a create_tokenizer method (see here), while the others only overwrite the make_doc (see here).

What happens if you don't use Pickle and the regular nlp.to_disk() method, but tell it to disable the tokenizer, for example:

nlp.to_disk('/path/to/model', disable=['tokenizer'])

If this works, the only problem here is that you'll also need to set disable=['tokenizer'] when you load the model back in using nlp.from_disk(). So packaging a model and loading it via spacy.load() won't work out-of-the-box.

We'll think about a good way to solve this in the future. When saving out a model, spaCy should probably check if the tokenizer is serializable and if not, show a warning, but serialize anyway.

Nice to hear that Chinese and Thai worked well – this is really cool!

buivietan · 2017-11-15T10:47:04Z

@ines
Oh yeah, It worked. I could save the Japanese NER model successfully. Thanks so much :)

ines · 2017-11-15T11:46:20Z

Just pushed a fix to Japanese that implements "dummy" serialization methods on the tokenizer to prevent the error. I also found another small bug that caused the Japanese vocab to not set the lang correctly (meaning that the saved out model's meta.json had "lang": "" set, which causes an error when loading the model back in).

Just tested it locally and both to/from disk and to/from bytes now works correctly. This means you should also be able to package your Japanese model as a Python package using the spacy package command.

yarongon · 2018-01-30T15:02:15Z

I have a similar problem that I could not fix: I've trained a custom NER model that I'd like to save to the disk, and since I'm using a custom tokenizer I don't want to save the tokenizer. Here's what I did:

import spacy

nlp = spacy.load("en")
nlp.tokenizer = some_custom_tokenizer
# Train the NER model...
nlp.tokenizer = None
nlp.to_disk('/tmp/my_model', disable=['tokenizer'])

(Due to this thread I did not packaged the model)
When I try to load it, the pipeline is empty, and surprisingly, is has the default spaCy tokenizer.

nlp = spacy.blank('en').from_disk('/tmp/model', disable=['tokenizer'])

I need to load the model without the tokenizer but with the full pipeline. Any ideas? thanks.

yarongon · 2018-01-31T15:37:29Z

More about this issue: when I tried to load the model like this:

loaded_nlp = spacy.load('/model/directory', disable=['tokenizer'])

I got an error:

FileNotFoundError: [Errno 2] No such file or directory: '/model/directory/tokenizer'

I looked at the code of util.load_model_from_path and I think I found a bug there. Line 158 is:

return nlp.from_disk(model_path)

If the disable parameter will be added to the call, it will be possible to directly use spacy.load for loading models without specific parts:

return nlp.from_disk(model_path, disable)

lock · 2018-05-08T00:55:31Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added the lang / ja Japanese language data and models label Nov 13, 2017

ines closed this as completed in c9d72de Nov 15, 2017

ines mentioned this issue Nov 15, 2017

Issues upgrading from v1.9 to v2 #1584

Closed

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save Japanese NER model by using nlp.to_disk #1557

Save Japanese NER model by using nlp.to_disk #1557

buivietan commented Nov 13, 2017 •

edited

Loading

ines commented Nov 13, 2017

buivietan commented Nov 13, 2017

buivietan commented Nov 15, 2017

ines commented Nov 15, 2017

buivietan commented Nov 15, 2017 •

edited

Loading

ines commented Nov 15, 2017

yarongon commented Jan 30, 2018 •

edited

Loading

yarongon commented Jan 31, 2018

lock bot commented May 8, 2018

Save Japanese NER model by using nlp.to_disk #1557

Save Japanese NER model by using nlp.to_disk #1557

Comments

buivietan commented Nov 13, 2017 • edited Loading

Environment

ines commented Nov 13, 2017

buivietan commented Nov 13, 2017

buivietan commented Nov 15, 2017

ines commented Nov 15, 2017

buivietan commented Nov 15, 2017 • edited Loading

ines commented Nov 15, 2017

yarongon commented Jan 30, 2018 • edited Loading

yarongon commented Jan 31, 2018

lock bot commented May 8, 2018

buivietan commented Nov 13, 2017 •

edited

Loading

buivietan commented Nov 15, 2017 •

edited

Loading

yarongon commented Jan 30, 2018 •

edited

Loading