[MRG] fix save_facebook_model failure after update-vocab & other initialization streamlining #2944

gojomo · 2020-09-12T20:05:00Z

Fixes #2853 & fixes #2943.

Refactors a bunch of FastText/Word2Vec/Doc2Vec initialization & update code for clarity & less-duplication. Each of these classes now share the faster bulk style of vector-random-initalization previously only FastText used, a slight change in prior per-vector seeding behavior.

Seems to fix the test code for triggering ancient Doc2Vec update crash in #1019 - but there may still be problems there, and there's been no design/testing for build_vocab(..., update=True) on Doc2Vec, so current status there is still "not supported".

There was some purported (but possibly not useful, and not covered by unit-tests) mmap_path-specifying code in KeyedVectors that's been stubbed out here with FIXMEs/comments, pending intended (#2955) repair/rationalization of that functionality before 4.0.0 final release.

gojomo · 2020-09-14T07:11:25Z

The paths for first, or post-vocab-expansion, allocation/initialization of vectors/vectors_vocab were more complex & redundant than necessary - streamlining into fewer methods also cleared up the discrepancy causing the save_facebook_model error.

There's a chance this also resolves very-old Doc2Vec-segfault #1019 - the test code there doesn't trigger a segfault anymore, but maybe it was other recent changes that fixed it, or maybe it's just become harder to trigger.

Making the Word2Vec & Doc2Vec vector-randomizations share the FastText code-path means they all now do the randomization in one bulk call, as FT has been doing, instead of the vector-at-a-time-seeded-by-key that W2V/D2V were doing. This should be faster.

But, at the moment the prior support for memmapped vector-arrays hasn't yet been maintained in this refactor, and the .uniform() call used (following the FT precedent) will 1st generate oversized np.float64numbers. Fixing those is next.

gensim/models/word2vec.py

gojomo · 2020-09-23T19:45:00Z

When the larger #2939 is merged, I propose rebasing this on that & applying - it fixes the original goal, and then some.

But, mmap functionality is currently broken in KeyedVectors, and likely other 2Vec classes as well - though in ways not covered by tests. That would best be addressed in a separate focused issue/PR - for which I've started #2955.

piskvorky · 2020-09-28T12:19:32Z

#2939 is merged now. @gojomo will you be able to finish this & the broken mmap?

I'd love to release this week. Only the documentation items in #2960 (for me) and this ticket (for you, hopefully) are left now.

Plus the decision around supporting pre-3.8.3 models, but that's best discussed live – see email.

gojomo · 2020-09-29T01:17:09Z

Subject to the provisos...

this either breaks or simply abandons the pretense that mmapping works for KeyedVectors instances (which will be fixed in a separate PR)
I had to disable the pytest-rerunfailures modules (which I don't think was active) to work around _pytest.resultlog was removed in pytest 6.1.0 pytest-dev/pytest-rerunfailures#128

...this is ready-for-final-review & merging!

gensim/models/doc2vec.py

gensim/models/fasttext.py

gensim/models/keyedvectors.py

mpenkov · 2020-09-29T07:20:29Z

gensim/models/keyedvectors.py

+
+def prep_vectors(target_shape, prior_vectors=None, seed=0, dtype=REAL):
+    """Return a numpy array of the given shape. Reuse prior_vectors values instance or values
+    to extent possible. Initialize new values randomly if requested."""


This seems to be a public API function. We should document the parameters so they appear in the documentation.

It ultimately may not be, pending the mmap work & other initialization clean-up, which might also jostle the internal names a bit. (At the moment, this is only called from resize_vectors() which may be the preferable public-entry point, because outside callers are less-likely to have a prior_vectors.)

OK, but this stuff will show up in the docs, right?

I can think of several better ways forward:

Make the docstring a code-comment so it doesn't show up in the docs

Mark the function as internal

piskvorky

Is the PR description up to date? W.r.t. functionality contained within, and issues fixed by it.

PR titles and descriptions is what we have to go with for release notes. Plus that's what interested users see when they click through for more details on a change.

piskvorky · 2020-10-04T16:28:29Z

@gojomo is it still the case this doesn't fix #2853? Is the PR description up-to-date w.r.t. all the changes within?

gojomo · 2020-10-06T08:26:12Z

Title & top-comment updated to describe what's addressed. Any lingering TODOs/doc-comment/naming stuff will be addressed in a followup PR related to #2955 or others.

gojomo · 2020-10-06T19:44:09Z

So, that one distant test-failing is a real head-scratcher, as noted in the code-comment I've added in 1edbb4c:

    rng = np.random.default_rng(seed=seed)  # use new instance of numpy's recommended generator/algorithm               
    # FIXME: `uniform` passes all tests, but generates temporary double-sized np.float64 array,                         
    # then cast-down ito right-sized np.float32, which means momentary 3x RAM usage on the model's                      
    # largest structure (often GB in size)                                                                              
    new_vectors = rng.uniform(-1.0, 1.0, target_shape).astype(dtype)
    # Meanwhile, this alternative, which by docs/reasoning/visual-inspection should be equivalent                       
    # while never creating the unneeded oversized np.float64 array, passes all *2Vec class                              
    # functional tests, but mysteriously (but reliably!) fails one obscure barely-sensible test                         
    # of a fringe downstream functionality: `TestBackMappingTranslationMatric.test_infer_vector`.                       
    # I'd adjust or jettison that test entirely *except* that the failure is *so* reliable, and                         
    # *so* mysterious, that it may be warning of something very subtle. So for now, very briefly,                       
    # sticking with the RAM-wasteful-but-all-tests-passing approach above, TODO debug/fix ASAP.                         
    # new_vectors = rng.random(target_shape, dtype=dtype)  # [0.0, 1.0)                                                 
    # new_vectors *= 2.0  # [0.0, 2.0)                                                                                  
    # new_vectors -= 1.0  # [-1.0, 1.0)                                                                                 
    new_vectors /= vector_size

So, in order to get this PR mergeable, I've reverted the code to what works-but-is-inefficient, with that blaring comment commitment to remedy as part of the next PR. (The inefficiency is what FastText has been doing all along, and what passes all tests, so no little harm to develop, but still something I'd like to fix before 4.0.0-final.)

gensim/models/keyedvectors.py

…mment

gojomo · 2020-10-08T23:00:11Z

On further investigation, the flimsiness of the failing test is the real problem, and it was only passing in the 1st place via lucky seeding. (Over a range of seeds, either code alternative makes that test fail 20-30% of the time.) So I'll disable that test, & created #2977 to eventually – not at all urgently – clean up the related classes/docs.

piskvorky · 2020-10-09T10:23:48Z

Thanks! Merge will have to wait for @mpenkov's review, but I already branched off here for my tickets, so no more rebasing / force-pushing please (normal merges are OK).

mpenkov

Left some minor comments. Please have a look @gojomo .

gensim/models/doc2vec.py

mpenkov · 2020-10-09T23:57:15Z

gensim/models/keyedvectors.py

+            Vector dimensions will default to `np.float32` (AKA `REAL` in some Gensim code) unless
+            another type is provided here.
+        mapfile_path : string, optional
+            TODO: UNDER CONSTRUCTION / SUBJECT TO CHANGE - pending mmap work


What should we do about this parameter? I don't think leaving a TODO in publicly visible documentation is a good look.

Suggested change

TODO: UNDER CONSTRUCTION / SUBJECT TO CHANGE - pending mmap work

Currently unimplemented.

But really, if we're not currently using this parameter, let's just get rid of it.

Per my comments 4 days ago, I'm not proposing any of these comments become part of the permanent codebase - they're all just temporary until #2955 is addressed, to prevent this PR from growing to become a diverse grab-bag of things unrelated to its title and initial focus. As @piskvorky has already started building on this in another PR, and I've got other functional things that depend on this, polishing comments here just risks making the tangle of PRs-on-PRs worse.

gensim/models/keyedvectors.py

mpenkov · 2020-10-10T00:01:56Z

gensim/models/keyedvectors.py

+
+def prep_vectors(target_shape, prior_vectors=None, seed=0, dtype=REAL):
+    """Return a numpy array of the given shape. Reuse prior_vectors values instance or values
+    to extent possible. Initialize new values randomly if requested."""


OK, but this stuff will show up in the docs, right?

I can think of several better ways forward:

Make the docstring a code-comment so it doesn't show up in the docs

Mark the function as internal

gensim/models/word2vec.py

gensim/test/test_translation_matrix.py

Co-authored-by: Michael Penkov <[email protected]>

gojomo · 2020-10-10T01:27:23Z

For some reason, github isn't showing any option to reply to @mpenkov's comment about prep_vectors. To that:

it's likely to change as part of Ensure 2Vec classes (KeyedVectors, Word2Vec, Doc2Vec, FastText) support desired level of mmap support #2955 / Restore/improve/streamline hooks for controlling/reusing build_vocab() steps #2975 further initialization cleanup
in its final form, it might still be a useful vector-resize-and-reinitialize routine for outside use, so visibility in API docs may not be any concern

So, I'd say: let's see how it settles in that other work.

gojomo · 2020-10-15T03:05:32Z

Does the approval mean it's OK for me to push the Github 'rebase & merge' button?

piskvorky · 2020-10-15T08:46:02Z

OK from me. Don't know about @mpenkov . Definitely no rebase though – just merge!

gojomo · 2020-10-15T22:14:06Z

I'm just looking at the default button in the Github interface (which I suspect results in the cleanest trunk history), and might not foul any history elsewhere... nor foul other similar merges). But: if you or @mpenkov can just merge this as soon as it's OK, that'll be quickest - in my opinion, this PR has long been functionally done, and I'd prefer for simplicity to base the followup PR off develop.

piskvorky · 2020-10-15T22:18:50Z

Okay. Merging here – @mpenkov please add any additional comments here / into @gojomo 's new PR.

gojomo mentioned this pull request Sep 12, 2020

save_facebook_model() - AssertionError #2853

Closed

piskvorky added this to the 4.0.0 milestone Sep 13, 2020

gojomo force-pushed the ft_save_after_update_vocab branch 4 times, most recently from cd2e326 to 52adad8 Compare September 14, 2020 03:54

piskvorky requested changes Sep 15, 2020

View reviewed changes

gensim/models/word2vec.py Show resolved Hide resolved

gensim/models/word2vec.py Outdated Show resolved Hide resolved

gensim/models/word2vec.py Outdated Show resolved Hide resolved

gojomo mentioned this pull request Sep 16, 2020

Fix doc entry for corpus_iterable in train #2945

Closed

This was referenced Sep 23, 2020

[MRG] *2Vec SaveLoad improvements #2939

Merged

Ensure 2Vec classes (KeyedVectors, Word2Vec, Doc2Vec, FastText) support desired level of mmap support #2955

Open

gojomo force-pushed the ft_save_after_update_vocab branch from 2387b5d to a21cad6 Compare September 28, 2020 22:59

gojomo changed the title ~~[WIP] save_facebook_model after update-vocab~~ [MRG] save_facebook_model after update-vocab Sep 29, 2020