Add method fold_with to Producer and UnindexedProducer #184

bluss · 2016-12-19T13:59:59Z

This method allows the producers to customize the inner loop. This is
useful for iterators that can provide a more efficient inner loop than
the default for loop.

Example: multidimensional arrays.

bluss · 2016-12-19T14:32:26Z

This allows parallel producers to customize the inner loop. This is a temporary measure until iterators gain this natively (one proposal is the fold_ok method).

This is marked WIP because I haven't been able to use this on UnidexedProducer parallel iterators to make a real difference.

This is because for ~~very~~ lightweight operations, I need to set the weight very low; but the unindexed producer has no way of communicating the weight to the splitter (?), so it always splits the producer down to the limit used by fn can_split().

I do have success using ParallelIterator with UnindexedProducer if I do something heavy enough. For example: if the lower limit for can_split is 128 elements and the operation is f64::exp it's a win.

Replacing that expensive function call with something that autovectorizes has big gains:

fn fastexp(x: f64) -> f64 {
    let x = 1. + x/1024.;
    x.powi(1024)
}

However, only if the can split threshold is hardcoded to a larger limit. (Something like at least 10 000 elements at least, it turns out, for “fastexp”).

bluss · 2016-12-19T14:40:04Z

As an example of potential gains, the PR enables fastexp to be autovectorized in ndarray (with the parallel traits implemented, not yet integrated into ndarray), for a case like this:

    let mut a = Array2::<f64>::zeros((M, N));
    let mut a = a.slice_mut(s![.., ..-1]);
    a.view_mut().into_par_iter().for_each(|x| *x = fastexp(*x));

The inner loop customization is needed because the array is not contiguous (due to the slicing), yet each row is contiguous, so the inner loop can still be subject to loop optimizations if we circumvent the general Iterator::next.

.weight() has no effect here it seems? I think a way to impact the unindexedproducer's splitting is needed.

The WIP parallel iterator implementation is here: https://github.com/bluss/rust-ndarray/compare/par-iter...par-iter-elements?expand=1

nikomatsakis · 2016-12-19T17:04:18Z

Hi @bluss -- very cool! I will attempt to take a look soon. :)

cuviper · 2016-12-19T17:38:41Z

src/par_iter/internal.rs

+                break;
+            }
+        }
+        folder.complete()


I think it may be more useful to leave complete out of this. At least for ChainProducer, that would let it call fold_using separately on its two parts, recursively if there are multiple levels of chains, then just let the bridge functions complete it at the end.

Neat. I know that idea :D rust-lang/rust#37315

I knew I had seen it somewhere! 😄

cuviper · 2016-12-19T17:55:58Z

I might bikeshed this to something like fold_with, not sure, but anyway I like the idea. It's also interesting that the full concept could be applied to the short-circuiting fold_ok -- very nice!

As for tuning parameters, note that we're trying to move away from weight -- see #111. This is why I didn't bother adding the cost function to the new UnindexedProducer at all. But we probably do want some kind of tunables, and I think a minimum split size would address what you're wanting here. My rough idea is that we need both a minimum (jobs should be at least this big) and a maximum (jobs should be no bigger, e.g. used to fit the working size in cache).

so it always splits the producer down to the limit used by fn can_split().

This shouldn't always happen unless your workload is very small. The "thief" splitter works by first splitting roughly to the number of job threads, and then each time a job is stolen that gets split to the number of threads again. Towards the end of task, there probably will be a lot of eager stealing and thus splitting, maybe down to the minimum, but for most of a busy workload they should stay relatively large.

Automatic tuning is hard -- if you have input on manual tuning parameters, please chime in!

bluss · 2016-12-19T18:17:06Z

One important piece of tuning turns out to be (maybe not surprisingly) using an explicit configuration to use one thread per core (so not counting hyperthreading).

cuviper

I'm happy with this, but I'll leave it open for @nikomatsakis to take a look.

bluss · 2016-12-19T18:38:33Z

Sure. I'll squash this. I removed the WIP tag since it is an improvement when using a thread per core.

bluss · 2016-12-19T18:38:46Z

I'll squash when you think it's ready, I mean.

nikomatsakis · 2016-12-20T09:16:28Z

src/par_iter/internal.rs

+        where F: Folder<Self::Item>,
+    {
+        for item in self {
+            folder = folder.consume(item);


At some point, we discussed whether we should have Producer: UnindexedProducer, but we didn't have a good reason for it (even though it logically makes sense). The existence of this duplicate code might provide such a reason.

I'm new to the internals, maybe split_at can be removed and only split is left?

bridge_producer_consumer (indexed case) is this today:

let mid = len / 2; let (left_producer, right_producer) = producer.split_at(mid); let (left_consumer, right_consumer, reducer) = consumer.split_at(mid);

It could change to something like:

let (left_producer, right_producer) = producer.split(); let mid = left_producer.len(); let (left_consumer, right_consumer, reducer) = consumer.split_at(mid);

so the difference is that an indexed producer can tell its length (and that it has a meaningful length or before / after ordering of the halves).

ndarray's unindexed producer can tell the length of each part perfectly. What sets it apart is that (1) it can't split exactly at len / 2 since it must split in chunks of some axis' size (2) It doesn't want to split so that it has a "before" and "after" half. It prefers an unordered split for performance.

About order:

Take a 3 × 4 array

[[1, 2, 3, 4], [5, 6, 7, 8], [9, 0, 1, 2]];

The logical order of the regular .iter() is defined to be "logical" row major, which means the elements in the order we read them: 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2. The memory order of the array is separated from the logical order of .iter().

For parallelism, we guarantee no order and the unindexed producer for an array view will pick either of these two splits, depending on which preserves memory locality:

[[1, 2, | 3, 4], - - - - - - - - - v.split_at(Axis(0), 1) // best for row major memory order [5, 6, | 7, 8], [9, 0, | 1, 2]]; \ v.split_at(Axis(1), 2) // best for column major memory order

No, you need split_at in order to handle zip.

(That said, I did want to consider changing the structure so that the producer can offer a suggestion about where to split instead. But I haven't decided on the best way to do it.)

So the key distinctions are:

can split at an arbitrary point

needed for zip

knows exactly how many elements it will produce

needed for precise consumers

Today we tie those together, but they could be separated. Something like:

UnindexedProducer

can split itself at some point

KnownLengthProducer: UnindexedProducer

can tell you the number of items it will produce

Producer: KnownLengthProducer

can further be split at an arbitrary point

we might then have two bridge methods:

one takes a KnownLengthProducer and a Consumer

this is the one that today requires a Producer

it would instead invoke producer.split(), get resulting length, and invoke consumer.split_at(), as you suggested

one takes a UnindexedProducer and a UnindexedConsumer

same as today

We would probably have to tweak the ParallelIterator hierarchy a bit too, but the idea is that zip() would require that its sub-iterators can produce a full Producer (or, potentially, that the LHS can produce a KnownLengthProducer and the RHS can produce a full Producer). It's split() implementation can then call left.split() and right.split_at(left_split.len())`.

Make sense?

Yes that makes sense. I'm happy with the two traits that are now; since the UnindexedProducer here didn't want to participate in zip in any way even if it can tell its exact length, the reason is that it doesn't produce elements in any particular order.

At some point, we discussed whether we should have Producer: UnindexedProducer, but we didn't have a good reason for it (even though it logically makes sense). The existence of this duplicate code might provide such a reason.

Yes, although IIRC we were also wanting impl<T: Producer> UnindexedProducer for T to avoid duplicating can_split/split boilerplate, but that would really want specialization. Plus that boilerplate would never actually be invoked.

In the case of just fold_while, perhaps we could make a BaseProducer common to both, so the boilerplate is a one-line empty impl in most cases. Or we could extract this body to a shared free function just to avoid the duplication. But this is only 6 lines, let's not overthink it. :)

I think I'd vote for not more trivial traits to implement, unless they do something noticeable.

Yet another option is to add something like:

trait Folder { // ... fn consume_iter<I:IntoIterator>(self, iterator: I) -> Self { // ... } }

... which might be useful to folks specializing fold_with too.

This method allows the producers to customize the inner loop. This is useful for iterators that can provide a more efficient inner loop than the default for loop. Example: multidimensional arrays.

bluss · 2016-12-20T13:26:56Z

Squashed commits together.

nikomatsakis · 2016-12-21T09:15:18Z

OK, I propose we merge this as is, but experiment with the producer hierarchy a bit.

bluss · 2016-12-21T11:06:41Z

Will this be in a point release, like 0.6.1 ?

nikomatsakis · 2016-12-21T17:40:51Z

@bluss presumably

bluss · 2016-12-21T18:03:48Z

Rayon is making big strides anyway, and that's exciting.

nikomatsakis · 2016-12-22T00:33:06Z

@bluss I'll push a new release ASAP, no reason to wait I guess

cuviper reviewed Dec 19, 2016

View reviewed changes

bluss changed the title ~~WIP: Add method fold_using to Producer and UnindexedProducer~~ Add method fold_using to Producer and UnindexedProducer Dec 19, 2016

bluss changed the title ~~Add method fold_using to Producer and UnindexedProducer~~ Add method fold_with to Producer and UnindexedProducer Dec 19, 2016

cuviper approved these changes Dec 19, 2016

View reviewed changes

nikomatsakis reviewed Dec 20, 2016

View reviewed changes

bluss added 2 commits December 20, 2016 14:25

Add method fold_with to Producer and UnindexedProducer

7fa8b8d

This method allows the producers to customize the inner loop. This is useful for iterators that can provide a more efficient inner loop than the default for loop. Example: multidimensional arrays.

Implement fold_with for ChainProducer

31449c8

bluss force-pushed the fold-using branch from 02f7f80 to 31449c8 Compare December 20, 2016 13:26

nikomatsakis merged commit a2f2cc2 into rayon-rs:master Dec 21, 2016

bluss deleted the fold-using branch December 21, 2016 17:51

bluss mentioned this pull request Jan 5, 2017

Rename split_off to split_off_left (as discussed in PR #189) #198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method fold_with to Producer and UnindexedProducer #184

Add method fold_with to Producer and UnindexedProducer #184

bluss commented Dec 19, 2016

bluss commented Dec 19, 2016 •

edited

Loading

bluss commented Dec 19, 2016

nikomatsakis commented Dec 19, 2016

cuviper Dec 19, 2016

bluss Dec 19, 2016

cuviper Dec 19, 2016

cuviper commented Dec 19, 2016

bluss commented Dec 19, 2016

cuviper left a comment

bluss commented Dec 19, 2016

bluss commented Dec 19, 2016

nikomatsakis Dec 20, 2016

bluss Dec 20, 2016 •

edited

Loading

bluss Dec 20, 2016

nikomatsakis Dec 20, 2016

nikomatsakis Dec 20, 2016

nikomatsakis Dec 20, 2016

bluss Dec 20, 2016 •

edited

Loading

cuviper Dec 20, 2016

bluss Dec 20, 2016

cuviper Dec 20, 2016

bluss commented Dec 20, 2016

nikomatsakis commented Dec 21, 2016

bluss commented Dec 21, 2016

nikomatsakis commented Dec 21, 2016

bluss commented Dec 21, 2016

nikomatsakis commented Dec 22, 2016

Add method fold_with to Producer and UnindexedProducer #184

Add method fold_with to Producer and UnindexedProducer #184

Conversation

bluss commented Dec 19, 2016

bluss commented Dec 19, 2016 • edited Loading

bluss commented Dec 19, 2016

nikomatsakis commented Dec 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cuviper commented Dec 19, 2016

bluss commented Dec 19, 2016

cuviper left a comment

Choose a reason for hiding this comment

bluss commented Dec 19, 2016

bluss commented Dec 19, 2016

Choose a reason for hiding this comment

bluss Dec 20, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bluss Dec 20, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bluss commented Dec 20, 2016

nikomatsakis commented Dec 21, 2016

bluss commented Dec 21, 2016

nikomatsakis commented Dec 21, 2016

bluss commented Dec 21, 2016

nikomatsakis commented Dec 22, 2016

bluss commented Dec 19, 2016 •

edited

Loading

bluss Dec 20, 2016 •

edited

Loading

bluss Dec 20, 2016 •

edited

Loading