Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into contributing / integrating with polars? #19

Closed
jvdd opened this issue Jan 19, 2023 · 15 comments
Closed

Look into contributing / integrating with polars? #19

jvdd opened this issue Jan 19, 2023 · 15 comments
Labels
enhancement New feature or request

Comments

@jvdd
Copy link
Member

jvdd commented Jan 19, 2023

Integration / operability with polars can be realized in two manners:

This Issue is, of course, open for discussion about how users / contributors see this


On another note, perhaps the argminmax its runtime speed / algorithm can be contributed to various parts of polars?

@jvdd jvdd added the enhancement New feature or request label Jan 19, 2023
@jvdd jvdd changed the title Look into contributing / integrating this to polars? Look into contributing / integrating with polars? Jan 19, 2023
@jmakov
Copy link

jmakov commented Feb 12, 2023

I'd be interested in both points since it seems not possible to use tsdownsample with polars without conversion. Also related to the second point, would be really great to see comparison with argminmax and current polars implementation especially since it seems I've stumbled upon a regression (fixed for now): pola-rs/polars#6794 (my code uses polars_expr.arg_max() a lot in a for loop).

@jvdd
Copy link
Member Author

jvdd commented Feb 13, 2023

Hi @jmakov, I'll touch upon both points here. But first some questions.

  • Are you using polars in Python (and thus not in Rust)?
    -> If you are using Python polars, I suppose that you mean with "conversion" casting the data to an numpy array? Does this result in a large runtime and/or memory overhead?

Below I'll share a bit more about the underlying Rust code and how I envision this integration.


tsdownsample integration with polars

Some background 🙃 : At the time of creating this library I was mainly focused on creating Python bindings for (highly optimized) downsampling algorithms (written in Rust). At the time I was mainly (actually only) focussed on supporting numpy arrays. As a result, the implemented algorithms in this library are very tightly coupled to ndarray::ArrayView1<T>. This currently limits the flexibility / other use cases that the underlying Rust code could serve (e.g., integrating with polars).
=> So, the current tsdownsample architecture does not (conveniently) allow integration with other data (array-like) types.

How can we solve this?

Changing the implementation to slice (&[T]) and adding the downsampling algorithms to one or multiple Traits should conveniently allow (i.e., with very little additional code) to use tsdownsample its downsampling algorithms on different datatypes (e.g., Vec, arrow::PrimitiveArray, polars datatypes, ...).

With this in mind I refactored argminmax (which is truly the core of the downsampling algorithms) to &[T] and conveniently added support for Vec and appache arrow, see jvdd/argminmax#13
Somewhere in the future (can't really give a proper timeline) I plan to do this as well for tsdownsample.

argminmax and polars

Some related Issues in the argminmax repo:

(very basic) benchmarks

When comparing it with polars - I notice a 10x performance gain for smaller arrays (e.g. 100k values), for larger arrays (e.g., 50M values) the performance gain is around 3x.

Results for 100k values ⬇️

argminmax: 0.000556642 sec
polars: 0.005885739 sec

My benchmarking code.

Disclaimer: is certainly not best practice!! - but results are still quite indicative

main.rs ⬇️

use argminmax::ArgMinMax;
use polars::prelude::*;
use rand::Rng;
use std::time::Instant;

fn get_random_array(n: usize) -> Vec<i32> {
    let mut rng = rand::thread_rng();
    let mut v = Vec::with_capacity(n);
    for _ in 0..n {
        v.push(rng.gen_range(0, 100));
    }
    v[0] = i32::MAX;
    v
}


fn main() {
    let n: usize = 100_000;
    let n_repeat: usize = 50;
    let v: Vec<i32> = get_random_array(n);

    let start = Instant::now();
    for _ in 0..n_repeat {
        let (_, max_index) = v.argminmax();
        assert_eq!(max_index, 0);
    }
    let elapsed = start.elapsed();
    println!("argminmax: {}.{:09} sec", elapsed.as_secs(), elapsed.subsec_nanos());


    // Create a polars Series
    let s = Series::new("foo", v);

    let start = Instant::now();
    for _ in 0..n_repeat {
        let max_index = s.arg_max();
        assert_eq!(max_index, Some(0));
    }
    let elapsed = start.elapsed();
    println!("polars: {}.{:09} sec", elapsed.as_secs(), elapsed.subsec_nanos());
}

Cargo.toml ⬇️

[package]
name = "polars_bench"
version = "0.1.0"
edition = "2021"

[dependencies]
argminmax =  { git = "https://github.com/jvdd/argminmax" }
polars = { version = "0.27.2", features = ["simd"] }
rand = { version = "0.7.2" }
rand_distr = { version = "0.2.2", default-features = false }

Hope this answer helps! I'm interested in hearing what you think of this.

Cheers,
Jeroen

P.S. you are (of course) always welcome to start contributing to either package or further extend (e.g., in Python) the above benchmarks and share your findings here 😊

@jmakov
Copy link

jmakov commented Feb 13, 2023

Thanks for the quick response. I'm using polars from python. I guess I could try it only with numpy arrays.

Another question: in my code I'm looking for the first occurrence (index) where a value is greater than given value (in the same column). And I wanted to call argminmax from tsdownsample. But having a quick look at tsdownsample code I don't find an obvious way to call it on my numpy array. Am I missing sth? I only checked the python code since I'm not familiar with rust.

@jvdd
Copy link
Member Author

jvdd commented Feb 13, 2023

Indeed! I haven't created any proper numpy (or Python) bindings for argminmax.

One (somewhat hacky) solution is to use argminmax through tsdownsample its MinMaxDownsampler (this uses argminmax to return the index values of the min & max values) -> calling MinMaxDownsampler with n_out=2 should call argminmax exactly once for the whole array.

Here a reusable Python implementation ⬇️

import numpy as np; from tsdownsample import MinMaxDownsampler

downsampler = MinMaxDownsampler()
def argminmax(arr: np.ndarray) -> (int, int):
    """Returns (min_index, max_index) for the given array."""
    # Call argminmax exactly once on the data (by downsampling it to 2 datapoints)
    (idx1, idx2) = downsampler.downsample(arr, n_out=2)
    # Return the (argmin_index, argmax_index)
    if arr[idx1] < arr[idx2]:
        # idx1 is the argmin_index
        return idx1, idx2
    else:
        # idx1 is the argmax_index
        return idx2, idx1

But if you will be working with numpy arrays - you might as well just call np.argmax 🤔 - should be +/- same performance (except for f16)

@jmakov
Copy link

jmakov commented Feb 13, 2023

Was already calling np.argmax then refactored to polars and saw big perf increase 15min on a 116 core cluster to 5min on one 24 core node). Oh, ok, though argminmax would give an additional boost. Thank you very much for clarifications!

@jvdd
Copy link
Member Author

jvdd commented Feb 13, 2023

Can you share some (minimal) Python code of how you use polars for this? Do I understand correctly that polars performs argmin / argmax multi-threaded?

=> If polars calls argmin / argmax multi-threaded, I am quite confident that integrating argminmax in polars will result in a further performance boost (the 3-10x performance gain vs polars I talked about above is just the single-threaded execuction).

argminmax scales really well multi-threaded (as argmin / argmax is a CPU x memory bound issue) - this is exactly what we observed in the multi-threaded MinMaxDownsampler!

@jmakov
Copy link

jmakov commented Feb 13, 2023

I'm using ray.io to distribute work and make polars use only 1 process. The code taken from pola-rs/polars#6794 where I first reported performance regression and started looking for alternatives:

$ test.py
import numpy
import polars

numpy.random.seed(42)


df_size = 10**5

df = polars.DataFrame({"idx": range(0, df_size), "col1": numpy.random.randint(1, 1_000, df_size)})

def get_idx_where_first_value_greater(df):
    for row in df.iter_rows():
        idx = row[0]
        (df[idx, 1] < df[idx:, 1]).arg_max()

get_idx_where_first_value_greater(df)

You can run it like this to prevent multithreading:
export POLARS_MAX_THREADS=1 && time python test.py

@jvdd
Copy link
Member Author

jvdd commented Feb 13, 2023

I monitored the runtime of the get_idx_where_first_value_greater and observed that by far the largest chunk (95+%) of your time goes to (df[idx, 1] < df[idx:, 1]) operation (and not the arg_max)

See my experiment below ⬇️

import polars
import numpy as np
import timeit

np.random.seed(42)

df_size = 10**4  # set this to 10x lower (otherwise the benchmark takes way to long)

df = polars.DataFrame({"idx": range(0, df_size), "col1": np.random.randint(1, 1_000, df_size)})

# Only create a bool Series
def get_idx_where_first_value_greater_cmp(df):
    for row in df.iter_rows():
        idx = row[0]
        (df[idx, 1] < df[idx:, 1]) #.arg_max()

# Original code
def get_idx_where_first_value_greater_arg_max(df):
    for row in df.iter_rows():
        idx = row[0]
        (df[idx, 1] < df[idx:, 1]).arg_max()

print("polars diff")
res = timeit.timeit('get_idx_where_first_value_greater_cmp(df)', globals=globals(), number=7)
print(res, "s")

print("polars diff argmax")
res = timeit.timeit('get_idx_where_first_value_greater_arg_max(df)', globals=globals(), number=7)
print(res, "s")

Can you confirm this?

@jmakov
Copy link

jmakov commented Feb 13, 2023

In the issue I referenced somebody ran py-spy from where it seems most of the time was in .argmax(). Also the fix addressed that and the regression went away.
I ran my original example with and without .arg_max() commented out. Original script takes about 6.14s, .arg_max() commented out about 5.7s. Looks like you're right, didn't consider that at all - most of the time taken by "<" operation...

@jvdd
Copy link
Member Author

jvdd commented Feb 13, 2023

Ran line_profiler and this seems to confirm taht the (df[idx, 1] < df[idx:, 1]) is the bottleneck

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    10                                           def get_idx_where_first_value_greater(df):
    11    100000  114806212.0   1148.1      0.4      for row in df.iter_rows():
    12    100000   48883868.0    488.8      0.2          idx = row[0]
    13    100000 25935368268.0 259353.7     98.4         d = (df[idx, 1] < df[idx:, 1])
    14    100000  265593691.0   2655.9      1.0          d.arg_max()

@jmakov
Copy link

jmakov commented Feb 13, 2023

Thanks! I think that solves the issue with looking for faster argmax :)

@jvdd
Copy link
Member Author

jvdd commented Feb 13, 2023

Not really sure about this, but I think using arg_where might suit your problem better than using argmax.

Was fun toying around with some polars code! 👌

@jmakov
Copy link

jmakov commented Feb 13, 2023

Didn't know that existed, will check it out, thanks!

@jvdd
Copy link
Member Author

jvdd commented Jul 5, 2023

Closing as pola-rs/polars#8074 brings argminmax to polars

@jvdd jvdd closed this as completed Jul 5, 2023
@kszlim
Copy link

kszlim commented Jan 4, 2024

It would be great to support this (meaning this package's) high level functionality with arrow arrays and/or polars dataframes/series directly too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants