[WIP] Phrases optimizations #837

gojomo · 2016-08-25T09:11:42Z

avoids an unnecessary duplicate giant dictionary in common case of a single add_vocab() call
adds new Phraser helper class & supporting options on existing methods

Phraser takes a Phrases and does a single (time-consuming) pass to discover all the Phrases that it would want to create, saving those into a much much smaller (and somewhat faster) helper object. This can be saved & used separately.

Needs more testing.

piskvorky · 2016-08-25T12:31:39Z

Note to self: use the faster detection loop in Python.

@lev good student project: implement the inner loops as C (via Cython) optional extension (optional ala word2vec extension).

piskvorky · 2016-09-02T05:25:58Z

@gojomo @lev ready to merge? What's missing?

tmylk · 2016-10-04T14:01:17Z

Created #918 asking for volunteers to create tests

tmylk · 2016-10-16T09:53:46Z

Merged in #954

tmylk · 2017-01-13T02:27:28Z

@gojomo Should Phrases be deprecated? The existing warning about its slowness doesn't seem to deter users.

gojomo · 2017-01-14T21:47:03Z

@tmylk It's necessary to use a Phrases to construct a Phraser. Phraser takes a bunch of extra time to create, but then is slightly faster but much more compact. (The main benefit is memory, not speed, and certainly not speed-to-first-use, which is worse.)

Also, a Phraser essentially locks in the effective min_count and threshold values at the time of its creation – so if you want to try other values, you need to go back to a Phrases with the full frequency data.

So Phrases can't quite be deprecated or hidden... perhaps some renaming/refactoring and updated docs/examples would guide people to the best options for them. There's also the potential for using more memory-efficient approximate counts, to use less memory during initial Phrases analysis. I believe one or two folks have started down that road before, but their PRs never reached mergeable state. If there's someone with the interest/skills to tackle that, they might want to, at the same time, try to rationalize the interfaces/names for better clarity.

piskvorky · 2017-01-15T02:36:35Z

@tmylk a fast C/Cython implementation of Phrases, ideally multicore, would be a great student project.

The entire "training" is essentially incrementing a counter, no reason it shouldn't be as fast as your input iterator provides.

Plus, phrases (collocations) are not going anywhere, so it's a stable module to invest more time into.

gojomo added 6 commits August 23, 2016 22:05

reuse dict if possible; configurable log-freq

fe17c10

really reuse (missing line)

d7685db

initial 'Phraser': tiny/faster post-analysis phrasing

98dc82a

log load-finished time

7b45071

Phraser fully functional

2235799

rm trailing spaces

75d2344

gojomo added 2 commits August 25, 2016 18:22

fix test; match prior ' '-joined phrases

9752a15

check threshold; allows adjust up (not down) adter initial build

9a0070c

tmylk mentioned this pull request Oct 4, 2016

Add tests for #837 #918

Closed

tmylk added the difficulty easy Easy issue: required small fix label Oct 4, 2016

anujkhare mentioned this pull request Oct 16, 2016

[WIP] Phrases optimizations with tests #954

Merged

tmylk pushed a commit that referenced this pull request Oct 16, 2016

[WIP] Phrases optimizations with tests. Includes #837 (#954)

dd10cc1

tmylk closed this Oct 16, 2016

tmylk deleted the Phrases__opt branch January 13, 2017 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Phrases optimizations #837

[WIP] Phrases optimizations #837

gojomo commented Aug 25, 2016

piskvorky commented Aug 25, 2016

piskvorky commented Sep 2, 2016

tmylk commented Oct 4, 2016

tmylk commented Oct 16, 2016

tmylk commented Jan 13, 2017

gojomo commented Jan 14, 2017 •

edited

Loading

piskvorky commented Jan 15, 2017 •

edited

Loading

[WIP] Phrases optimizations #837

[WIP] Phrases optimizations #837

Conversation

gojomo commented Aug 25, 2016

piskvorky commented Aug 25, 2016

piskvorky commented Sep 2, 2016

tmylk commented Oct 4, 2016

tmylk commented Oct 16, 2016

tmylk commented Jan 13, 2017

gojomo commented Jan 14, 2017 • edited Loading

piskvorky commented Jan 15, 2017 • edited Loading

gojomo commented Jan 14, 2017 •

edited

Loading

piskvorky commented Jan 15, 2017 •

edited

Loading