Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pre-trained word2vec models in doc2vec #1338

Closed
bgokden opened this issue May 19, 2017 · 3 comments
Closed

Using pre-trained word2vec models in doc2vec #1338

bgokden opened this issue May 19, 2017 · 3 comments

Comments

@bgokden
Copy link

bgokden commented May 19, 2017

Is there a practical way of using pre-trained word2vec models in doc2vec?

There is a forked version of Gensim that does it but it is pretty old.
Referenced here: https://github.com/jhlau/doc2vec
Forked Gensim here: https://github.com/jhlau/gensim

Otherwise I would like to add this feature as jhlau did and merge it back.

@gojomo
Copy link
Collaborator

gojomo commented May 19, 2017

You can manually patch-up a model to insert word-vectors from elsewhere before training. The existing intersect_word2vec_format() may be useful, directly or as an example - it assumes you've already created a model with its own vocabulary (including the frequency info needed for negative-sampling or frequent-word-downsampling), but then want to use some external source to replace some/all of the word-vector values.

I personally don't think the case for such re-use is yet strong – indeed in some often top-performing Doc2Vec training modes (like pure PV-DBOW), input-word-vectors aren't trained or used at all, so loading them would be completely superfluous. You can see some discussion of related issues, including links to messages elsewhere, in the Github issue thread: #1270 (comment)

@maohbao
Copy link

maohbao commented Dec 17, 2019

This fork supports the latest gensim 3.8, which can train doc2vec model with pretrained word2vec.

https://github.com/maohbao/gensim

@gojomo
Copy link
Collaborator

gojomo commented Dec 18, 2019

As per above, I think the evidence for the benefit of such a technique is muddled.

Also: it should be possible simply by poking/prodding a standard model at the right points between instantiation and training – without any major changes or new-parameters to the relevant models, or using a forked version of gensim (that will drift further away from other changes/fixes over time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants