Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method fold_with to Producer and UnindexedProducer #184

Merged
merged 2 commits into from
Dec 21, 2016

Conversation

bluss
Copy link
Contributor

@bluss bluss commented Dec 19, 2016

This method allows the producers to customize the inner loop. This is
useful for iterators that can provide a more efficient inner loop than
the default for loop.

Example: multidimensional arrays.

@bluss
Copy link
Contributor Author

bluss commented Dec 19, 2016

This allows parallel producers to customize the inner loop. This is a temporary measure until iterators gain this natively (one proposal is the fold_ok method).

This is marked WIP because I haven't been able to use this on UnidexedProducer parallel iterators to make a real difference.

This is because for very lightweight operations, I need to set the weight very low; but the unindexed producer has no way of communicating the weight to the splitter (?), so it always splits the producer down to the limit used by fn can_split().

I do have success using ParallelIterator with UnindexedProducer if I do something heavy enough. For example: if the lower limit for can_split is 128 elements and the operation is f64::exp it's a win.

Replacing that expensive function call with something that autovectorizes has big gains:

fn fastexp(x: f64) -> f64 {
    let x = 1. + x/1024.;
    x.powi(1024)
}

However, only if the can split threshold is hardcoded to a larger limit. (Something like at least 10 000 elements at least, it turns out, for “fastexp”).

@bluss
Copy link
Contributor Author

bluss commented Dec 19, 2016

As an example of potential gains, the PR enables fastexp to be autovectorized in ndarray (with the parallel traits implemented, not yet integrated into ndarray), for a case like this:

    let mut a = Array2::<f64>::zeros((M, N));
    let mut a = a.slice_mut(s![.., ..-1]);
    a.view_mut().into_par_iter().for_each(|x| *x = fastexp(*x));

The inner loop customization is needed because the array is not contiguous (due to the slicing), yet each row is contiguous, so the inner loop can still be subject to loop optimizations if we circumvent the general Iterator::next.

.weight() has no effect here it seems? I think a way to impact the unindexedproducer's splitting is needed.

The WIP parallel iterator implementation is here: https://github.com/bluss/rust-ndarray/compare/par-iter...par-iter-elements?expand=1

@nikomatsakis
Copy link
Member

Hi @bluss -- very cool! I will attempt to take a look soon. :)

break;
}
}
folder.complete()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be more useful to leave complete out of this. At least for ChainProducer, that would let it call fold_using separately on its two parts, recursively if there are multiple levels of chains, then just let the bridge functions complete it at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat. I know that idea :D rust-lang/rust#37315

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew I had seen it somewhere! 😄

@cuviper
Copy link
Member

cuviper commented Dec 19, 2016

I might bikeshed this to something like fold_with, not sure, but anyway I like the idea. It's also interesting that the full concept could be applied to the short-circuiting fold_ok -- very nice!

As for tuning parameters, note that we're trying to move away from weight -- see #111. This is why I didn't bother adding the cost function to the new UnindexedProducer at all. But we probably do want some kind of tunables, and I think a minimum split size would address what you're wanting here. My rough idea is that we need both a minimum (jobs should be at least this big) and a maximum (jobs should be no bigger, e.g. used to fit the working size in cache).

so it always splits the producer down to the limit used by fn can_split().

This shouldn't always happen unless your workload is very small. The "thief" splitter works by first splitting roughly to the number of job threads, and then each time a job is stolen that gets split to the number of threads again. Towards the end of task, there probably will be a lot of eager stealing and thus splitting, maybe down to the minimum, but for most of a busy workload they should stay relatively large.

Automatic tuning is hard -- if you have input on manual tuning parameters, please chime in!

@bluss bluss changed the title WIP: Add method fold_using to Producer and UnindexedProducer Add method fold_using to Producer and UnindexedProducer Dec 19, 2016
@bluss bluss changed the title Add method fold_using to Producer and UnindexedProducer Add method fold_with to Producer and UnindexedProducer Dec 19, 2016
@bluss
Copy link
Contributor Author

bluss commented Dec 19, 2016

One important piece of tuning turns out to be (maybe not surprisingly) using an explicit configuration to use one thread per core (so not counting hyperthreading).

Copy link
Member

@cuviper cuviper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this, but I'll leave it open for @nikomatsakis to take a look.

@bluss
Copy link
Contributor Author

bluss commented Dec 19, 2016

Sure. I'll squash this. I removed the WIP tag since it is an improvement when using a thread per core.

@bluss
Copy link
Contributor Author

bluss commented Dec 19, 2016

I'll squash when you think it's ready, I mean.

where F: Folder<Self::Item>,
{
for item in self {
folder = folder.consume(item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we discussed whether we should have Producer: UnindexedProducer, but we didn't have a good reason for it (even though it logically makes sense). The existence of this duplicate code might provide such a reason.

Copy link
Contributor Author

@bluss bluss Dec 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to the internals, maybe split_at can be removed and only split is left?

bridge_producer_consumer (indexed case) is this today:

            let mid = len / 2;
            let (left_producer, right_producer) = producer.split_at(mid);
            let (left_consumer, right_consumer, reducer) = consumer.split_at(mid);

It could change to something like:

            let (left_producer, right_producer) = producer.split();
            let mid = left_producer.len();
            let (left_consumer, right_consumer, reducer) = consumer.split_at(mid);

so the difference is that an indexed producer can tell its length (and that it has a meaningful length or before / after ordering of the halves).

ndarray's unindexed producer can tell the length of each part perfectly. What sets it apart is that (1) it can't split exactly at len / 2 since it must split in chunks of some axis' size (2) It doesn't want to split so that it has a "before" and "after" half. It prefers an unordered split for performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About order:

Take a 3 × 4 array

[[1, 2, 3, 4],
 [5, 6, 7, 8],
 [9, 0, 1, 2]];

The logical order of the regular .iter() is defined to be "logical" row major, which means the elements in the order we read them: 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2. The memory order of the array is separated from the logical order of .iter().

For parallelism, we guarantee no order and the unindexed producer for an array view will pick either of these two splits, depending on which preserves memory locality:

[[1, 2, | 3, 4],
 - - - - - - - - - v.split_at(Axis(0), 1) // best for row major memory order
 [5, 6, | 7, 8],
 [9, 0, | 1, 2]];
        \ v.split_at(Axis(1), 2)  // best for column major memory order

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you need split_at in order to handle zip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That said, I did want to consider changing the structure so that the producer can offer a suggestion about where to split instead. But I haven't decided on the best way to do it.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the key distinctions are:

  • can split at an arbitrary point
    • needed for zip
  • knows exactly how many elements it will produce
    • needed for precise consumers

Today we tie those together, but they could be separated. Something like:

  • UnindexedProducer
    • can split itself at some point
  • KnownLengthProducer: UnindexedProducer
    • can tell you the number of items it will produce
  • Producer: KnownLengthProducer
    • can further be split at an arbitrary point

we might then have two bridge methods:

  • one takes a KnownLengthProducer and a Consumer
    • this is the one that today requires a Producer
    • it would instead invoke producer.split(), get resulting length, and invoke consumer.split_at(), as you suggested
  • one takes a UnindexedProducer and a UnindexedConsumer
    • same as today

We would probably have to tweak the ParallelIterator hierarchy a bit too, but the idea is that zip() would require that its sub-iterators can produce a full Producer (or, potentially, that the LHS can produce a KnownLengthProducer and the RHS can produce a full Producer). It's split() implementation can then call left.split() and right.split_at(left_split.len())`.

Make sense?

Copy link
Contributor Author

@bluss bluss Dec 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense. I'm happy with the two traits that are now; since the UnindexedProducer here didn't want to participate in zip in any way even if it can tell its exact length, the reason is that it doesn't produce elements in any particular order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we discussed whether we should have Producer: UnindexedProducer, but we didn't have a good reason for it (even though it logically makes sense). The existence of this duplicate code might provide such a reason.

Yes, although IIRC we were also wanting impl<T: Producer> UnindexedProducer for T to avoid duplicating can_split/split boilerplate, but that would really want specialization. Plus that boilerplate would never actually be invoked.

In the case of just fold_while, perhaps we could make a BaseProducer common to both, so the boilerplate is a one-line empty impl in most cases. Or we could extract this body to a shared free function just to avoid the duplication. But this is only 6 lines, let's not overthink it. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd vote for not more trivial traits to implement, unless they do something noticeable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another option is to add something like:

trait Folder {
    // ...
    fn consume_iter<I:IntoIterator>(self, iterator: I) -> Self {
        // ...
    }
}

... which might be useful to folks specializing fold_with too.

This method allows the producers to customize the inner loop. This is
useful for iterators that can provide a more efficient inner loop than
the default for loop.

Example: multidimensional arrays.
@bluss
Copy link
Contributor Author

bluss commented Dec 20, 2016

Squashed commits together.

@nikomatsakis
Copy link
Member

OK, I propose we merge this as is, but experiment with the producer hierarchy a bit.

@bluss
Copy link
Contributor Author

bluss commented Dec 21, 2016

Will this be in a point release, like 0.6.1 ?

@nikomatsakis
Copy link
Member

@bluss presumably

@nikomatsakis nikomatsakis merged commit a2f2cc2 into rayon-rs:master Dec 21, 2016
@bluss bluss deleted the fold-using branch December 21, 2016 17:51
@bluss
Copy link
Contributor Author

bluss commented Dec 21, 2016

Rayon is making big strides anyway, and that's exciting.

@nikomatsakis
Copy link
Member

@bluss I'll push a new release ASAP, no reason to wait I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants