Improve non-GCC implementation of statistics functions Quantile/Rank #77

AdamGlustein · 2024-02-13T16:24:09Z

Is your feature request related to a problem? Please describe.
Due to the policy-based data structures (PBDS) library being gcc specific, I had to change the implementation of Quantile and Rank in cpp/cppnodes/statsimpl.cpp to not use an order statistics tree. Instead, they use a std::multiset to store sorted values, and iterate on that set in linear time to find each quantile/rank when triggered. This is much more inefficient than the O(log n) lookup in the PBDS tree.

Describe the solution you'd like
It would be helpful if someone implemented a balanced (red-black) order statistics tree (OST) natively in csp. The OST is a binary search tree with the additional invariant that each node contains the size of its own subtree (https://en.wikipedia.org/wiki/Order_statistic_tree). This allows for efficient (log n time) insert, find, delete, and index-based lookup.

Describe alternatives you've considered
There are actually 2 data structures that will achieve O(log n) insert, delete and index-based lookup

The order statistics tree mentioned above
An indexable skiplist. This is what pandas uses for their rolling quantiles. However, pandas has the advantage of static data, so they can size it up front knowing the maximum size of the data interval. In a time-windowed csp stats node, we do not know the maximum number of ticks which will fall in an interval, so sizing becomes a bit trickier (I think we'd have to resize/add links dynamically at powers of 2). Here is a link explaining the impl https://rhettinger.wordpress.com/2010/02/06/lost-knowledge/.

Additional context
This would be helpful for any non gcc users and would also make the code more readable since we're no longer using 2 different approaches for the same stats calculation.

AdamGlustein · 2024-02-13T17:29:18Z

May be able to leverage the boost multindex datat structure for this:
https://www.boost.org/doc/libs/1_75_0/libs/multi_index/doc/tutorial/indices.html#rnk_indices

fixes Point72#77 Signed-off-by: Mohit Chhaya <[email protected]> Signed-off-by: Tim Paine <[email protected]>

AdamGlustein added good first issue Good issue for first-time contributors type: feature Issues and PRs related to new features lang: c++ Issues and PRs related to the C++ codebase labels Feb 13, 2024

mrchhaya mentioned this issue Jul 11, 2024

Improve non-GCC implementation of statistics functions Quantile/Rank #322

Closed

timkpaine mentioned this issue Jul 15, 2024

Improve implementation of statistics functions Quantile/Rank using boost::multi_index #333

Merged

timkpaine closed this as completed in #333 Jul 17, 2024

timkpaine closed this as completed in c2be3cf Jul 17, 2024

wrieg123 pushed a commit to wrieg123/csp that referenced this issue Jul 20, 2024

Improve non-GCC implementation of statistics functions Quantile/Rank,

c640f04

fixes Point72#77 Signed-off-by: Mohit Chhaya <[email protected]> Signed-off-by: Tim Paine <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve non-GCC implementation of statistics functions Quantile/Rank #77

Improve non-GCC implementation of statistics functions Quantile/Rank #77

AdamGlustein commented Feb 13, 2024

AdamGlustein commented Feb 13, 2024

Improve non-GCC implementation of statistics functions Quantile/Rank #77

Improve non-GCC implementation of statistics functions Quantile/Rank #77

Comments

AdamGlustein commented Feb 13, 2024

AdamGlustein commented Feb 13, 2024