Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix quadratic algorithm in merge_streams #314

Open
lars-t-hansen opened this issue Dec 14, 2023 · 4 comments
Open

Fix quadratic algorithm in merge_streams #314

lars-t-hansen opened this issue Dec 14, 2023 · 4 comments
Labels
component:sonalyze sonalyze/* pri:low task:enhancement New feature or request

Comments

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Dec 14, 2023

For the longer load reports, more than 50% of the running time is spent in merge_streams, though perf is not terribly helpful at pinpointing more than that.

Adding a little debug printing we find:

ml1.hpc.uio.no: MERGING 222 STREAMS, OUTER ITER 6936
ml8.hpc.uio.no: MERGING 9746 STREAMS, OUTER ITER 8619
ml4.hpc.uio.no: MERGING 1476 STREAMS, OUTER ITER 7461
ml9.hpc.uio.no: MERGING 61 STREAMS, OUTER ITER 8631
ml7.hpc.uio.no: MERGING 661 STREAMS, OUTER ITER 8632
ml6.hpc.uio.no: MERGING 42846 STREAMS, OUTER ITER 8631
ml3.hpc.uio.no: MERGING 629 STREAMS, OUTER ITER 8631
ml2.hpc.uio.no: MERGING 24 STREAMS, OUTER ITER 8631

That's probably more streams than I expected, but there is one stream per job and this is from data for one month, so what was I thinking.

The merging is a nested loop operating on the set of sorted-by-ascending-time streams. The outer loop first finds the lowest timestamp among the timestamps at the heads of the streams, and then it collects all records across all the streams with roughly that time and removes them. It repeats until all streams are empty. The number of iterations of the outer loop will correspond roughly to the number of time steps in the window of data we're looking at - 30 days exactly with 5 minute intervals would yield 8640 time steps.

The data structure is a vector of vectors for the streams, and a vector of ints for indices into those streams.

The first loop is therefore a search across all the vectors but it could probably be some kind of priority queue - we're only looking for a single value but we have to look at all the elements in the list.

The second loop is trickier because the selection logic is far from straightforward, but again, if the streams were sorted with lowest leading timestamp first we could probably pick off the streams at the head and ignore the rest, and not loop over everything. That said, sample times will be highly correlated and in fact the second loop may pick up records from many or indeed all of the streams.

@lars-t-hansen lars-t-hansen added task:enhancement New feature or request component:sonalyze sonalyze/* labels Dec 14, 2023
@lars-t-hansen lars-t-hansen self-assigned this Dec 14, 2023
@lars-t-hansen lars-t-hansen mentioned this issue Dec 14, 2023
8 tasks
@lars-t-hansen
Copy link
Collaborator Author

lars-t-hansen commented Dec 14, 2023

While there are micro-optimizations that apply in the inner loops of merge_streams, the essential insight is probably that there is usually a smallish number of jobs - really O(1) - that are active at any one time, and the trick to performance is to manage the set of sample streams such that we only look at the ones that are running: by looking at all streams at every time step we make the algorithm O(t^2) where t is the number of time steps. If we can concentrate on the live streams we can use the naive algorithms. The set of streams can probably be divided into three: the ones that are in the past, the ones that are running, and the ones that are in the future. Streams can be moved from one set to the other (this is awkward b/c rust and b/c these data structures are just Vec<Box>, not Vec<Rc>) and the scans will only process the set of the ones running.

All attempts at implementing this efficiently enough to beat the highly optimized naive loops have so far met with failure. I think in some cases the resulting code is so complicated that it confuses Rust's bounds check optimization, but that is really TBD.

I'll test the optimizations for the naive loops, which make a fair amount of difference on their own, and then land those and leave this bug open.

@lars-t-hansen
Copy link
Collaborator Author

I merged the micro-optimizations. They give us about a 25% improvement on 30 days of data and 10% on 90 days (the big-Oh strikes back); the new output was bitwise identical to the old and there didn't seem to be any reason to hold this back.

@lars-t-hansen lars-t-hansen removed their assignment Dec 18, 2023
@lars-t-hansen
Copy link
Collaborator Author

I think we'll back-burner what remains here until we have more data from Fox, Saga et al. We want to know what a typical input size is. It could well be that the number of streams on the ML systems is very high because these are systems without a batch queue.

@lars-t-hansen lars-t-hansen changed the title Speed up merge_streams Fix quadratic algorithm in merge_streams Dec 18, 2023
@lars-t-hansen
Copy link
Collaborator Author

Finally some credible data from fox. Running a 90-day report remotely with the sonalyzed cache enabled takes about 3 minutes, quite a long time and much longer than for the ML systems (about 20s for the same report iirc). For betzy this will not do. We could hack around it for betzy by partitioning in various ways - entire nodes are allocated to jobs so if a node is involved with one job at one time it is not involved with the others - but it would be better to fix the algorithm somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:sonalyze sonalyze/* pri:low task:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant