-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up NdMapping.groupby #349
Conversation
import pandas | ||
ndmapping_groupby = ndmapping_groupby_pandas | ||
except: | ||
ndmapping_groupby = ndmapping_groupby_python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this a parameterized function (i.e a class) that uses one of two possible bothmethods
in __call__
. My only other comment is that we need some docstrings here...
Definitely a very valuable PR: it cleans up I'm happy to implement my suggestion of turning |
Great! The key thing is that all the tests are passing now... My only comment now is whether you are happy for me to make this into a parameterized function? Or do you object to having a single parameterized function for groupby? Shouldn't take long to do and I am happy to change it if you are busy. |
Must have missed my comment somehow:
|
Sorry yes - I skimmed your reply too quickly. I'll do the final refactor now. |
Avoids hardcoding the 'Index' dimension used for NdElement types
If the tests pass now, I'll go ahead and merge. |
Significant speed up of NdMapping.groupby
Excellent! |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This PR refactors the NdMapping.groupby operation into a separate function and provides an alternative implementation based on Pandas, which is significantly faster for large datasets. You can see the linear performance scaling by the old implementation and might just make out the sublinear performance of pandas, which becomes very significant for large datasets >10000 items. This is a temporary workaround until we come up with a general solution data API for NdMapping types that's being discussed in #347.