Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing PR 1326 and providing some tests for unicode wiki corpora #1333

Merged
merged 6 commits into from
May 27, 2017

Conversation

alekol
Copy link
Contributor

@alekol alekol commented May 17, 2017

A few things:

  • Fixed a problem in the wikicorpus tokenizer that was causing problems on unicode corpora
  • Added a test with a unicode corpora to verify the fix above
  • Fixed a couple of issues in word2vec that could result in concatenation of byte array and strings

This is my first pull request, so.. I may have missed smth :)

@@ -1112,7 +1112,7 @@ def update_weights(self):
# randomize the remaining words
for i in xrange(len(self.wv.syn0), len(self.wv.vocab)):
# construct deterministic seed from word AND seed argument
newsyn0[i-len(self.wv.syn0)] = self.seeded_vector(self.wv.index2word[i] + str(self.seed))
newsyn0[i-len(self.wv.syn0)] = self.seeded_vector(utils.to_unicode(self.wv.index2word[i]) + str(self.seed))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change inside word2vec.py, and similar below, absolutely necessary? It seems if the caller has provided unicode strings in their corpus, the contents of index2word() will already be unicode. But it seems unless seeded_vector() is otherwise throwing an exception, it really doesn't matter whether this string-used-for-deterministic-randomization is unicode or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the same, though it seemed nice to be explicit about this. But you're right - once the problem in the corpus is addressed, this is not causing issues anymore.

I've reverted the changes.

@alekol
Copy link
Contributor Author

alekol commented May 18, 2017

Guys, I'd appreciate if someone hints me on how to debug this job: https://travis-ci.org/RaRe-Technologies/gensim/jobs/233524516

It seems that for some reason on py2.7 an existing test takes too long to complete. Given this one is unchanged, I'm inclined to think it's a build-related hiccup, but.. then again - I'm not used to CI Travis and I'm not sure on how to respin the build without changing the pull request..

Thoughts/hints on how to proceed?

First unicode article in this sample is
1) папа
"""
if sys.version_info < (2, 7, 0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we dropped support for 2.6 , we can also remove this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll clean this up too

@tmylk
Copy link
Contributor

tmylk commented May 18, 2017

@alekol This test is known to be flaky on Travis. It hasn't been investigated. Restarted the build and hope it passes.

@alekol
Copy link
Contributor Author

alekol commented May 18, 2017

@tmylk How can I restart a build without making fake commits?

@tmylk
Copy link
Contributor

tmylk commented May 18, 2017

It requires admin privileges to the project unfortunately so there isn't a way.

@alekol
Copy link
Contributor Author

alekol commented May 18, 2017

Ok, any idea how to run these tests locally or debug them? For some reason on py 2.7 one of the tests seems to take too long (> 10min) and I really have no idea how to troubleshoot this..

@menshikh-iv
Copy link
Contributor

@alekol you can run specific test with this command
python gensim/test/<SOME_TEST> <TEST_CLASS>.<TEST_METHOD>
for example
python gensim/test/test_coherencemodel.py TestCoherenceModel.testCv

@gojomo
Copy link
Collaborator

gojomo commented May 18, 2017

I hit the restart on the timed-out test, and it appears to be hanging on the test_first_element (gensim.test.test_wikicorpus.TestWikiCorpus) ... test again.

Has this been a problem before this fix? If so, it's a shame it's throwing uncertainty into other fixes. It may be related to the attempt to use multiprocessing (in a Python process on a minimal build container that's already thick with prior test state). It may be a good idea to explicitly specify processes=1 in the test_first_element() instantiation of WikiCorpus.

@alekol
Copy link
Contributor Author

alekol commented May 19, 2017

@gojomo Thanks for the hint! Seems that this was indeed the cause as applying it resulted in a test pass! :)

Being the github noob that I am, what're the next steps to get this into develop? I expect that someone with sufficient privileges will do the merge or should I?

@gojomo
Copy link
Collaborator

gojomo commented May 19, 2017

Glad to hear! At this point it'd be up to a project committer to merge the PR. I'll leave that to @tmylk , but everything looks good to me. (One-line fix, test case, tweak to make existing tests less hang-prone.)

@menshikh-iv
Copy link
Contributor

Thank you @alekol

@menshikh-iv menshikh-iv merged commit ade3e02 into piskvorky:develop May 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants