-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to pass Dictionary.docFreq to TfidfModel.__init__() #8
Comments
Ok, I added the option to initialize This can help if a) we have the dictionary in the first place (the corpus was constructed through I also PEP8-fied the tfidf code while i was at it. |
I find it interesting that your commit message implies this change is only beneficial "if your corpus is super slow". Either way, I just did one testrun, and I'm not seeing a noticeable speedup yet, which suprises me... |
Yeah it will be always faster, no question about it. That comment meant that in a broader perspective (computing similarities, LSI, LDA, ...), the one extra pass that increments a few number is negligible. Unless the pass (corpus iteration) itself is very expensive -- then it matters. In some cases, it might matter a lot, which in my opinion outweighs the added complexity of the code; that's why I accepted this feature. |
Aha. I see. Thanks. |
TfidfModel.initialize() calculates document frequencies for tokens. However, these are also calculated when creating a Dictionary object. TfidfModel.init can therefore take a keyword arg that allows providing Dictionary.docFreq and prevents recalculating document frequencies.
The text was updated successfully, but these errors were encountered: