You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This process graph has a final step that result in only 25 tasks, taking up to 1 hour for a single task.
The statistics that are computed are binary associative, so we do not necessarily need to group over full time dimension, which could speed things up considerably.
Other option is to have parallelization over different bands, but not sure how that would be possible to implement efficiently? Maybe putting multithreading at executor level?
I made a first commit to parallellize computation of the statistics for multiple bands. This would be the 'quick win' solution.
There's also a more complex solution, which would involve recognizing that we are computing statistics that can be implemented as reducers at the spark level. This would avoid grouping on spatial key, and thus allow much more parallellization.
This process graph has a final step that result in only 25 tasks, taking up to 1 hour for a single task.
The statistics that are computed are binary associative, so we do not necessarily need to group over full time dimension, which could speed things up considerably.
Other option is to have parallelization over different bands, but not sure how that would be possible to implement efficiently? Maybe putting multithreading at executor level?
The text was updated successfully, but these errors were encountered: