Fix doctag unicode problem. Fix 1543 #1544

englhardt · 2017-08-21T15:17:20Z

gojomo · 2017-08-21T22:57:43Z

Can you add a unit test in test_doc2vec.py that fails before the fix, and works thereafter?

englhardt · 2017-08-22T08:48:04Z

I have added a test.

On develop branch:

Launching unittests with arguments python -m unittest test_doc2vec.TestDoc2VecModel.test_unicode_in_doctag in .../gensim/gensim/test

Using TensorFlow backend.

Error
Traceback (most recent call last):
  File "/usr/lib64/python2.7/unittest/case.py", line 329, in run
    testMethod()
  File ".../gensim/gensim/test/test_doc2vec.py", line 106, in test_unicode_in_doctag
    model.save_word2vec_format(testfile(), doctag_vec=True, word_vec=True, binary=True)
  File ".../gensim/gensim/models/doc2vec.py", line 850, in save_word2vec_format
    doctag = prefix + str(self.docvecs.index_to_doctag(i))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa1' in position 1: ordinal not in range(128)

menshikh-iv · 2017-08-25T11:17:54Z

gensim/test/test_doc2vec.py


    def _tag(self, i):
-        return i if not self.string_tags else '_*%d' % i
+        if self.unicode_tags:
+            return u'_\xa1_%d' % i


Looks strange, why is it so different with line 41, why you use a `\xa1_'?

I enforce a title that is not ASCII encodable, i.e. a unicode character '¡'.

@gojomo wdyt?

Testing with a non-ASCII character seems sensible/necessary to me!

menshikh-iv · 2017-08-25T11:19:22Z

gensim/test/test_doc2vec.py

@@ -95,6 +100,13 @@ def testPersistenceWord2VecFormat(self):
        binary_model_dv = keyedvectors.KeyedVectors.load_word2vec_format(test_word, binary=True)
        self.assertEqual(len(model.wv.vocab), len(binary_model_dv.vocab))

+    def test_unicode_in_doctag(self):


Please add a test with an exception (and check it with assertraise).

The test should rather be the inverse of assertraise. The exception should not be thrown with this PR. I will add a inverse test.

It's a good idea to add 2 tests: first with unicode_tags=False and assertRaise and second is a current test.

I think you are mixing up what really happens, let me explain:
In gensim 2.1 a modified save_word2vec_format for Doc2Vec was added in #1256. The storing handled to export any utf8 symbols doc2vec.py#L853. Unfortunately, the str(..) call in doc2vec.py#L850 fails in Python 2.7 when the doctags contain chars not ASCII encodable.
Currently the develop branch is broken and the added test will fail, i.e. UnicodeEncodeError is thrown for Python 2.7.
The test now checks whether it can successfully store unicode doctags and fails otherwise. I have to generate them (with the code above) since the default string doctags are ASCII encodable (-> addition of the '¡' character).
So, what should the second test actually be then?

menshikh-iv · 2017-08-25T11:20:14Z

gensim/test/test_doc2vec.py

+        model = doc2vec.Doc2Vec(DocsLeeCorpus(unicode_tags=True), min_count=1)
+        model.save_word2vec_format(testfile(), doctag_vec=True, word_vec=True, binary=True)
+        binary_model_dv = keyedvectors.KeyedVectors.load_word2vec_format(testfile(), binary=True)
+        self.assertEqual(len(model.wv.vocab) + len(model.docvecs), len(binary_model_dv.vocab))


How is this assert related with your PR / with reason for test?

It is true that lines 107/108 are not related to the PR. I can remove them.

englhardt · 2017-08-25T12:11:21Z

I have updated the test. Please have a look.

piskvorky

Thanks for the detailed report @englhardt ! Looks like a bug indeed.

piskvorky · 2017-08-28T17:30:54Z

gensim/models/doc2vec.py

@@ -847,7 +847,7 @@ def save_word2vec_format(self, fname, doctag_vec=False, word_vec=True, prefix='*
                    fout.write(utils.to_utf8("%s %s\n" % (total_vec, self.vector_size)))
                # store as in input order
                for i in range(len(self.docvecs)):
-                    doctag = prefix + str(self.docvecs.index_to_doctag(i))
+                    doctag = "%s%s" % (prefix, self.docvecs.index_to_doctag(i))


We should pick one type (byte string or unicode) and stick with it.

Supporting both types at the same time like this (doctag will be silently upcast to unicode in this solution, if the argument is unicode) looks very brittle.

I'm not as familiar with this codebase as @gojomo is, but converting the user input to one fixed type (either utf8 bytestring or unicode, but consistently) is preferable. At this point, during model save, the type ought to be fixed, no silent guessing needed.

I agree that the hidden cast to unicode is not optimal. The problem is that the method index_to_doctag can return a string as well as an int. The fix is similar to another place in the class here.
I am open for other suggestions though.

Aha, so the expected types are either unicode or int. A u"%s" % index_to_doctag should work either way, right? No change needed (beside dropping that str). IIRC that int was an optimization hack (uses less memory).

My suggestion is to keep the tags explicitly as "unicode or int", or "bytestring or int". And then during string formatting, rely on the fact it's unicode-or-int (or byte-or-int), possibly even with an assert to drive this contract home.

@gojomo What are the invariants here, how to do this cleanly? That unicode-or-str-or-int is really confusing, and apparently not encapsulated well (fixes in one place leave bugs in other places).

IMO the type of tags should be as permissive as possible – conceivably even anything that works as a dict key would work everywhere in Doc2Vec except in this new bit of convenience for shoehorning doc-vecs into the (limited, somewhat-'legacy') word2vec.c format. For people who want to use this format, they should be sure to use tags that can write as a nice string-token – I'd think it'd be OK here if this code is tolerant of any reasonable such choice. (They may also need to make sure, themselves, that their tags don't include internal spaces if using the 'text' word2vec.c format.)

@piskvorky I changed it to u"%s%s" % .. so that the cast is less hidden.
@gojomo Your last point is especially important, yeah.

menshikh-iv · 2017-09-18T10:09:48Z

@gojomo should we merge this now?

gojomo · 2017-09-19T04:26:30Z

I haven't independently exercised the bug or confirmed the fix, but as all old tests and the well-focused new test pass, and the changes are tiny/well-focused, it looks OK to me!

menshikh-iv · 2017-09-19T05:07:48Z

Thanks @englhardt, congratz with your first PR 👍
Please follow an instruction from #1587 and receive gifts!

Fix doctag unicode

d956669

Add test for unicode doctags.

a67c07d

menshikh-iv suggested changes Aug 25, 2017

View reviewed changes

Fix doc2vec unicode title test.

109f956

piskvorky reviewed Aug 28, 2017

View reviewed changes

Make the unicode tag cast less hidden.

89e3eb8

menshikh-iv approved these changes Sep 19, 2017

View reviewed changes

menshikh-iv changed the title ~~Fix doctag unicode~~ Fix doctag unicode problem. Fix 1543 Sep 19, 2017

menshikh-iv merged commit 5a49a79 into piskvorky:develop Sep 19, 2017

englhardt deleted the fix-doctag-unicode branch September 19, 2017 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix doctag unicode problem. Fix 1543 #1544

Fix doctag unicode problem. Fix 1543 #1544

englhardt commented Aug 21, 2017

gojomo commented Aug 21, 2017

englhardt commented Aug 22, 2017 •

edited

Loading

menshikh-iv Aug 25, 2017

englhardt Aug 25, 2017

menshikh-iv Aug 25, 2017

gojomo Sep 7, 2017 •

edited

Loading

menshikh-iv Aug 25, 2017

englhardt Aug 25, 2017

menshikh-iv Aug 25, 2017

englhardt Aug 25, 2017

menshikh-iv Aug 25, 2017

englhardt Aug 25, 2017

englhardt commented Aug 25, 2017

piskvorky left a comment

piskvorky Aug 28, 2017 •

edited

Loading

englhardt Aug 29, 2017

piskvorky Aug 29, 2017 •

edited

Loading

gojomo Sep 7, 2017

englhardt Sep 8, 2017

menshikh-iv commented Sep 18, 2017

gojomo commented Sep 19, 2017

menshikh-iv commented Sep 19, 2017

Fix doctag unicode problem. Fix 1543 #1544

Fix doctag unicode problem. Fix 1543 #1544

Conversation

englhardt commented Aug 21, 2017

gojomo commented Aug 21, 2017

englhardt commented Aug 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gojomo Sep 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

englhardt commented Aug 25, 2017

piskvorky left a comment

Choose a reason for hiding this comment

piskvorky Aug 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Aug 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Sep 18, 2017

gojomo commented Sep 19, 2017

menshikh-iv commented Sep 19, 2017

englhardt commented Aug 22, 2017 •

edited

Loading

gojomo Sep 7, 2017 •

edited

Loading

piskvorky Aug 28, 2017 •

edited

Loading

piskvorky Aug 29, 2017 •

edited

Loading