Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve non-GCC implementation of statistics functions Quantile/Rank #77

Closed
AdamGlustein opened this issue Feb 13, 2024 · 1 comment · Fixed by #333
Closed

Improve non-GCC implementation of statistics functions Quantile/Rank #77

AdamGlustein opened this issue Feb 13, 2024 · 1 comment · Fixed by #333
Labels
good first issue Good issue for first-time contributors lang: c++ Issues and PRs related to the C++ codebase type: feature Issues and PRs related to new features

Comments

@AdamGlustein
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Due to the policy-based data structures (PBDS) library being gcc specific, I had to change the implementation of Quantile and Rank in cpp/cppnodes/statsimpl.cpp to not use an order statistics tree. Instead, they use a std::multiset to store sorted values, and iterate on that set in linear time to find each quantile/rank when triggered. This is much more inefficient than the O(log n) lookup in the PBDS tree.

Describe the solution you'd like
It would be helpful if someone implemented a balanced (red-black) order statistics tree (OST) natively in csp. The OST is a binary search tree with the additional invariant that each node contains the size of its own subtree (https://en.wikipedia.org/wiki/Order_statistic_tree). This allows for efficient (log n time) insert, find, delete, and index-based lookup.

Describe alternatives you've considered
There are actually 2 data structures that will achieve O(log n) insert, delete and index-based lookup

  1. The order statistics tree mentioned above
  2. An indexable skiplist. This is what pandas uses for their rolling quantiles. However, pandas has the advantage of static data, so they can size it up front knowing the maximum size of the data interval. In a time-windowed csp stats node, we do not know the maximum number of ticks which will fall in an interval, so sizing becomes a bit trickier (I think we'd have to resize/add links dynamically at powers of 2). Here is a link explaining the impl https://rhettinger.wordpress.com/2010/02/06/lost-knowledge/.

Additional context
This would be helpful for any non gcc users and would also make the code more readable since we're no longer using 2 different approaches for the same stats calculation.

@AdamGlustein AdamGlustein added good first issue Good issue for first-time contributors type: feature Issues and PRs related to new features lang: c++ Issues and PRs related to the C++ codebase labels Feb 13, 2024
@AdamGlustein
Copy link
Collaborator Author

May be able to leverage the boost multindex datat structure for this:
https://www.boost.org/doc/libs/1_75_0/libs/multi_index/doc/tutorial/indices.html#rnk_indices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good issue for first-time contributors lang: c++ Issues and PRs related to the C++ codebase type: feature Issues and PRs related to new features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant