-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix quadratic algorithm in merge_streams #314
Comments
While there are micro-optimizations that apply in the inner loops of merge_streams, the essential insight is probably that there is usually a smallish number of jobs - really O(1) - that are active at any one time, and the trick to performance is to manage the set of sample streams such that we only look at the ones that are running: by looking at all streams at every time step we make the algorithm O(t^2) where t is the number of time steps. If we can concentrate on the live streams we can use the naive algorithms. The set of streams can probably be divided into three: the ones that are in the past, the ones that are running, and the ones that are in the future. Streams can be moved from one set to the other (this is awkward b/c rust and b/c these data structures are just All attempts at implementing this efficiently enough to beat the highly optimized naive loops have so far met with failure. I think in some cases the resulting code is so complicated that it confuses Rust's bounds check optimization, but that is really TBD. I'll test the optimizations for the naive loops, which make a fair amount of difference on their own, and then land those and leave this bug open. |
For #314 (sonarlog) - Micro-optimize merge_streams
I merged the micro-optimizations. They give us about a 25% improvement on 30 days of data and 10% on 90 days (the big-Oh strikes back); the new output was bitwise identical to the old and there didn't seem to be any reason to hold this back. |
I think we'll back-burner what remains here until we have more data from Fox, Saga et al. We want to know what a typical input size is. It could well be that the number of streams on the ML systems is very high because these are systems without a batch queue. |
Finally some credible data from fox. Running a 90-day report remotely with the sonalyzed cache enabled takes about 3 minutes, quite a long time and much longer than for the ML systems (about 20s for the same report iirc). For betzy this will not do. We could hack around it for betzy by partitioning in various ways - entire nodes are allocated to jobs so if a node is involved with one job at one time it is not involved with the others - but it would be better to fix the algorithm somehow. |
For the longer load reports, more than 50% of the running time is spent in merge_streams, though
perf
is not terribly helpful at pinpointing more than that.Adding a little debug printing we find:
That's probably more streams than I expected, but there is one stream per job and this is from data for one month, so what was I thinking.
The merging is a nested loop operating on the set of sorted-by-ascending-time streams. The outer loop first finds the lowest timestamp among the timestamps at the heads of the streams, and then it collects all records across all the streams with roughly that time and removes them. It repeats until all streams are empty. The number of iterations of the outer loop will correspond roughly to the number of time steps in the window of data we're looking at - 30 days exactly with 5 minute intervals would yield 8640 time steps.
The data structure is a vector of vectors for the streams, and a vector of ints for indices into those streams.
The first loop is therefore a search across all the vectors but it could probably be some kind of priority queue - we're only looking for a single value but we have to look at all the elements in the list.
The second loop is trickier because the selection logic is far from straightforward, but again, if the streams were sorted with lowest leading timestamp first we could probably pick off the streams at the head and ignore the rest, and not loop over everything. That said, sample times will be highly correlated and in fact the second loop may pick up records from many or indeed all of the streams.
The text was updated successfully, but these errors were encountered: