-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further focus/slim keyedvectors.py module #2873
Comments
@gojomo do you see this as essential for 4.0.0 = API breaking? Or can we leave it for a later release? |
This is low-risk and not-hard, but also low-priority. Making the decision that some things should be relocated would be nicer to do in 4.0.0, along with other "update your imports/code/function-names" changes, but could wait. |
My 1st thoughts would be:
Next but lower-priority, some of the other utility methods may be amenable to more reuse. (Perhaps, the Finally, after everything else has settled, the methods should be reordered by importance & grouped by role, so the autogenerated documentation has the most-used stuff up top, and a casual top-to-bottom read makes more sense. |
@piskvorky Sorry I missed this. I'm in the middle of a house move, so I'd rather not get involved until things have settled down. If this can wait a couple of weeks, then I'd be happy to pick it up then. It looks like my sort of thing, and I've done it a couple of times with gensim already. |
Yes, it can. In order of urgency:
|
How about moving the high-level methods from keyedvectors.py to a separate wordtasks.py submodule? They could be pure functions there. For example:
All the above operations do not modify the keyedvectors model, they are read-only. This would leave the lower-level IO stuff in keyedvectors.py, so serialization (loading/saving) shouldn't be affected, as far as I understand. The name wordtasks comes from the keyedvectors docstring:
WDYT? |
@gojomo Let's remove this from the 4.0 milestone and deal with it later |
Pre-#2698,
keyedvectors.py
was 2500+ lines, including functionality over-specific to other models, & redundant classes. Post-#2698, with some added generic functionality, it's still over 1800 lines.It should shed some other grab-bag utility functions that have accumulated, & don't logically fit inside the
KeyedVectors
class.In particular, the evaluation (analogies, word_ranks) helpers could move to their own module that takes a KV instance as an argument. (If other more-sophisticated evaluations can be contributed, as would be welcome, they should also live alongside those, rather than bloating
KeyedVectors
.)The
get_keras_embedding
method, as its utilit is narrow to very specific uses, and is conditional on a not-necessarily install package, could go elsewhere too – either a kera-focused utilities module, or even just documentation/example code about how to convert to/from keras from `KeyedVectors.Some of the more advanced word-vector-using calculations, like 'Word Mover's Distance' or 'Soft Cosine SImilarity', could move to method-specific modules that are then better documented/self-contained/optimized, without bloating the generic 'set of vectors' module. (They might be more discoverable, there, as well.)
And finally, some of the existing calculations could be unified/streamlined (especially the two variants of
most_similar()
, and some of the steps shared by multiple operations). My hope would be the module is eventually <1000 lines.The text was updated successfully, but these errors were encountered: