Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Phrases optimizations #837

Closed
wants to merge 8 commits into from
Closed

[WIP] Phrases optimizations #837

wants to merge 8 commits into from

Conversation

gojomo
Copy link
Collaborator

@gojomo gojomo commented Aug 25, 2016

  • avoids an unnecessary duplicate giant dictionary in common case of a single add_vocab() call
  • adds new Phraser helper class & supporting options on existing methods

Phraser takes a Phrases and does a single (time-consuming) pass to discover all the Phrases that it would want to create, saving those into a much much smaller (and somewhat faster) helper object. This can be saved & used separately.

Needs more testing.

@piskvorky
Copy link
Owner

Note to self: use the faster detection loop in Python.

@lev good student project: implement the inner loops as C (via Cython) optional extension (optional ala word2vec extension).

@piskvorky
Copy link
Owner

@gojomo @lev ready to merge? What's missing?

@tmylk tmylk mentioned this pull request Oct 4, 2016
@tmylk
Copy link
Contributor

tmylk commented Oct 4, 2016

Created #918 asking for volunteers to create tests

@tmylk tmylk added the difficulty easy Easy issue: required small fix label Oct 4, 2016
@tmylk
Copy link
Contributor

tmylk commented Oct 16, 2016

Merged in #954

@tmylk tmylk closed this Oct 16, 2016
@tmylk tmylk deleted the Phrases__opt branch January 13, 2017 02:23
@tmylk
Copy link
Contributor

tmylk commented Jan 13, 2017

@gojomo Should Phrases be deprecated? The existing warning about its slowness doesn't seem to deter users.

@gojomo
Copy link
Collaborator Author

gojomo commented Jan 14, 2017

@tmylk It's necessary to use a Phrases to construct a Phraser. Phraser takes a bunch of extra time to create, but then is slightly faster but much more compact. (The main benefit is memory, not speed, and certainly not speed-to-first-use, which is worse.)

Also, a Phraser essentially locks in the effective min_count and threshold values at the time of its creation – so if you want to try other values, you need to go back to a Phrases with the full frequency data.

So Phrases can't quite be deprecated or hidden... perhaps some renaming/refactoring and updated docs/examples would guide people to the best options for them. There's also the potential for using more memory-efficient approximate counts, to use less memory during initial Phrases analysis. I believe one or two folks have started down that road before, but their PRs never reached mergeable state. If there's someone with the interest/skills to tackle that, they might want to, at the same time, try to rationalize the interfaces/names for better clarity.

@piskvorky
Copy link
Owner

piskvorky commented Jan 15, 2017

@tmylk a fast C/Cython implementation of Phrases, ideally multicore, would be a great student project.

The entire "training" is essentially incrementing a counter, no reason it shouldn't be as fast as your input iterator provides.

Plus, phrases (collocations) are not going anywhere, so it's a stable module to invest more time into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants