Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParallelEnumerable.ToDictionary does not parallelize #96262

Open
madelson opened this issue Dec 22, 2023 · 8 comments · May be fixed by #100590
Open

ParallelEnumerable.ToDictionary does not parallelize #96262

madelson opened this issue Dec 22, 2023 · 8 comments · May be fixed by #100590
Labels
area-System.Linq.Parallel help wanted [up-for-grabs] Good issue for external contributors in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@madelson
Copy link
Contributor

madelson commented Dec 22, 2023

Description

I would expect the following to be a valid pattern, but it does not lead to parallel execution of ComputeExpensiveThing():

var resultsByItem = items.AsParallel().ToDictionary(i => i, i => ComputeExpensiveThing());

Reproduction Steps

    var items = Enumerable.Range(0, 100).AsParallel();
    var stopwatch = Stopwatch.StartNew();
    
    // completes in ~10s because there is no parallelism
    var dictionary = items.ToDictionary(i => i, i => { Thread.Sleep(100); return i; });
    
    // workaround completes in ~1s
    /*var dictionary = items.Select(i =>
        {
            Thread.Sleep(100);
            return new KeyValuePair<int, int>(i, i);
        })
        .ToDictionary(kvp => kvp.Key, kvp => kvp.Value);*/

    Console.WriteLine(stopwatch.Elapsed);

Expected behavior

From a parallelism perspective, I would expect

items.AsParallel().ToDictionary(i => ..., i => ...)

to be equivalent to:

items.AsParallel().Select(i => KeyValuePair.Create(..., ...)).ToDictionary(kvp => kvp.Key, kvp => kvp.Value);

For example, ParallelEnumerable.Min() parallelizes its selector:

Enumerable.Range(1, 10).AsParallel().Min(i => { Thread.Sleep(1000); return i; }); // completes in ~1s

Actual behavior

keySelector and elementSelector execute serially

Regression?

No response

Known Workarounds

As shown above, pushing any non-trivial selectors up one layer into a Select call gets around the issue.

Configuration

.NET 8 Windows 11 x64

Other information

ParallelEnumerable.ToDictionary has a serial loop in which the selectors are executed.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Dec 22, 2023
@ghost
Copy link

ghost commented Dec 22, 2023

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I would expect the following to be a valid pattern, but it does not lead to parallel execution of ComputeExpensiveThing():

var resultsByItem = items.AsParallel().ToDictionary(i=> i, i => ComputeExpensiveThing());

Reproduction Steps

    var items = Enumerable.Range(0, 100).AsParallel();
    var stopwatch = Stopwatch.StartNew();
    
    // completes in ~10s because there is no parallelism
    var dictionary = items.ToDictionary(i => i, i => { Thread.Sleep(100); return i; });
    
    // workaround completes in ~1s
    /*var dictionary = items.Select(i =>
        {
            Thread.Sleep(100);
            return new KeyValuePair<int, int>(i, i);
        })
        .ToDictionary(kvp => kvp.Key, kvp => kvp.Value);*/

    Console.WriteLine(stopwatch.Elapsed);

Expected behavior

From a parallelism perspective, I would expect

items.AsParallel().ToDictionary(i => ..., i => ...)

to be equivalent to:

items.AsParallel().Select(i => KeyValuePair.Create(..., ...)).ToDictionary(kvp => kvp.Key, kvp => kvp.Value);

For example, ParallelEnumerable.Min() parallelizes it's selector:

Enumerable.Range(1, 10).AsParallel().Min(i => { Thread.Sleep(1000); return i; }); // completes in ~1s

Actual behavior

keySelector and elementSelector execute serially

Regression?

No response

Known Workarounds

As shown above, pushing any non-trivial selectors up one layer into a Select call gets around the issue.

Configuration

.NET 8 Windows 11 x64

Other information

ParallelEnumerable.ToDictionary has a serial loop in which the selectors are executed.

Author: madelson
Assignees: -
Labels:

area-System.Collections, untriaged

Milestone: -

@ghost
Copy link

ghost commented Dec 22, 2023

Tagging subscribers to this area: @dotnet/area-system-linq-parallel
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I would expect the following to be a valid pattern, but it does not lead to parallel execution of ComputeExpensiveThing():

var resultsByItem = items.AsParallel().ToDictionary(i=> i, i => ComputeExpensiveThing());

Reproduction Steps

    var items = Enumerable.Range(0, 100).AsParallel();
    var stopwatch = Stopwatch.StartNew();
    
    // completes in ~10s because there is no parallelism
    var dictionary = items.ToDictionary(i => i, i => { Thread.Sleep(100); return i; });
    
    // workaround completes in ~1s
    /*var dictionary = items.Select(i =>
        {
            Thread.Sleep(100);
            return new KeyValuePair<int, int>(i, i);
        })
        .ToDictionary(kvp => kvp.Key, kvp => kvp.Value);*/

    Console.WriteLine(stopwatch.Elapsed);

Expected behavior

From a parallelism perspective, I would expect

items.AsParallel().ToDictionary(i => ..., i => ...)

to be equivalent to:

items.AsParallel().Select(i => KeyValuePair.Create(..., ...)).ToDictionary(kvp => kvp.Key, kvp => kvp.Value);

For example, ParallelEnumerable.Min() parallelizes it's selector:

Enumerable.Range(1, 10).AsParallel().Min(i => { Thread.Sleep(1000); return i; }); // completes in ~1s

Actual behavior

keySelector and elementSelector execute serially

Regression?

No response

Known Workarounds

As shown above, pushing any non-trivial selectors up one layer into a Select call gets around the issue.

Configuration

.NET 8 Windows 11 x64

Other information

ParallelEnumerable.ToDictionary has a serial loop in which the selectors are executed.

Author: madelson
Assignees: -
Labels:

area-System.Linq.Parallel, untriaged

Milestone: -

@jeffhandley jeffhandley added this to the Future milestone Mar 21, 2024
@jeffhandley jeffhandley added help wanted [up-for-grabs] Good issue for external contributors and removed untriaged New issue has not been triaged by the area owner labels Mar 21, 2024
@lateapexearlyspeed lateapexearlyspeed linked a pull request Apr 3, 2024 that will close this issue
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Apr 3, 2024
@stephentoub
Copy link
Member

stephentoub commented Apr 3, 2024

If memory serves, the decision not to parallelize the selectors is because they're almost always trivial, e.g. just accessing a property from the element to select, and parallelizing it would make the 99% case slower.

What led to this issue? I realize it's of course possible for a non-trivial selector to be used, but could you highlight real-world examples of this being done with PLINQ? As is highlighted, a developer can always choose to use a Select instead if they in fact have an expensive selector.

@madelson
Copy link
Contributor Author

madelson commented Apr 4, 2024

@stephentoub A pattern I’ve found myself wanting to use fairly often with PLINQ is that I have N items and for each item I want to compute some expensive value. It’s convenient to build a dictionary that maps from the items to the values for use in downstream computations.

It’s been a while but I believe the case that prompted me to post this was related to processing large vectors. I had a set of N long vectors (represented as float arrays) and for each one I needed to compute a statistic that required comparing that vector to M other vectors in a larger set. It felt natural to PLINQ a dictionary mapping the vectors to the statistics and then process from there.

Since then, another case I ran into was implementing a generic algorithm style optimization where I had N chromosomes and for each one I needed to run a fitness function. And associate that result wi the the chromosome.

For both of these, there are obviously other ways to do it but I believe it would be preferable from a “least surprise” perspective if the various PLINQ operators were consistent in terms of parallelism.

How much overhead would consistent parallelism really add in this case (especially given that we can assume the query involves parallelism already)? I would assume that this would just tack one more selector to the chain of operations that is already running in parallel.

@stephentoub
Copy link
Member

How much overhead would consistent parallelism really add in this case (especially given that we can assume the query involves parallelism already)? I would assume that this would just tack one more selector to the chain of operations that is already running in parallel.

We could test the overhead when there's already parallel work happening and this would indeed be appending an additional item. Possibly we could allow the ToDictionary selector to be parallel in that case but not in the case where there's not already parallel work happening. But it would really come down to measurements... there's going to be significant overhead just in the AsParallel().ToDictionary(...) case if the selectors are trivial, as is often the case (I understand it's not true in your cited case).

@madelson
Copy link
Contributor Author

there's going to be significant overhead just in the AsParallel().ToDictionary(...) case if the selectors are trivial, as is often the case (I understand it's not true in your cited case).

If a user calls .AsParallel().ToDictionary(...) can't we assume that their intent is parallelism? If someone does someCollection.AsParallel().ToDictionary(trivial, trivial) and that adds overhead over not having AsParallel, would that really be so bad?

@stephentoub
Copy link
Member

There's a ton of inappropriate use of AsParallel out there, where folks add AsParallel to say "make this fast" even if parallelism is the wrong answer. Making such use way slower is indeed problematic.

@madelson
Copy link
Contributor Author

Is it at all convincing that as far as I’ve tested every other PLINQ operator does parallelize its selector (Select, Where, Sum, Min, Max, Any, All, Average, Count, etc), and does incur significant, measurable overhead on someone who just throws in AsParallel() when it isn’t needed?

I’m also curious whether there is any usage data that would inform whether the impact of penalizing folks who are misusing AsParallel().ToDictionary() would be counterbalanced by helping those who are using it “correctly” and silently not getting the parallelism they’d expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Linq.Parallel help wanted [up-for-grabs] Good issue for external contributors in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants