-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SCP 3619: costing for serialiseData #4480
Conversation
I've asked the Hydra people to send you some. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a brief look at the R code and it looks sensible!
@@ -13,14 +13,14 @@ library(broom, quietly=TRUE, warn.conflicts=FALSE) | |||
|
|||
|
|||
## At present, times in the becnhmarking data are typically of the order of | |||
## 10^(-6) seconds. WE SCALE THESE UP TO MILLISECONDS because the resulting | |||
## 10^(-6) seconds. WE SCALE THESE UP TO MICROSECONDS because the resulting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Would be good for Nikos to look too. |
Nice! Looks from the graphs that adding another regressor with an exponent may fit better. Not sure if we want that complexity though. |
Okay, this looks good for now, we can refine it later. |
Costing
serialiseData
andequalsData
is a little tricky because we measure the size ofData
objects using only a single number and execution times can be very different for objects of the same size. For example, here's a plot of serialisation times:The red line is a regression line obtained by standard linear regression, and it clearly underestimates the serialisation times for many inputs. This PR attempts to fit a more conservative model. We do this by discarding everything below the line and fitting another linear model to the remaining data, repeating until we get a line which lies above at least 90% of the original data (or until we've performed twenty iterations, but with the data here we only require two iterations). We also go to some trouble to force the fitted model to have a sensible intercept, partly because our benchmark results are biased towards small values.
Here are the results of applying this method to the benchmark figures for
serialiseData
andequalsData
.SerialiseData
The bound for
serialiseData
underestimates 7.8% of the datapoints; for these points (the ones above the red line), the observed value exceeds the predicted value by a factor of up to 2.9x, with a mean of 2.08x (most of this happens for small sizes: see below). The prediction exceeds the observed values in the remaining 92.2% of the data, by a factor of up to 20.3x (ie, the ratio (predicted time)/(observed time) is 20.3); the mean overestimate is 4.68x.The graph above is for Data objects of size up to about 880,000, which is quite large (and look at the times!). If we zoom in on things of size up to 5000 we get the following graph:
Here we see a series of observations for small objects heading upwards at a very steep angle (you can just about see these in the previous graph if you look closely at the bottom left corner), and these account for most of the large underpredictions . I'm not sure if these points represent a real trend (ie, whether we could construct larger objects which fall on the same steep line) or if they're just some peculiarity of small data. If we increase the gradient of the red line so that it lies above most of these points then we end up with a costing function which overestimates costs for larger data by a factor of 200 or more, so we probably don't want to do that unless we really can have larger data objects which behave badly; if that is the case then a better generator would give us data which would lead to a better model without having to change any R code.
EqualsData
The same method produces good results for
equalsData
as well. Here's what it does for the full dataset:Only 1.5% of the observations lie above the line; for these points, the observed value exceeds the predicted value by a factor of up to 1.27x, with a mean of 1.11x The prediction exceeds the observed values in the remaining 98.5% of the data, by a factor of up to 12.9x; the mean overestimate is 2.11x.
If we zoom in on the bottom left we see that we don't get the apparently atypical observations that we got for
serialiseData
, even though the benchmarks for the two functions use exactly the same inputsConclusion
This method appears to give us quite accurate upper bounds on execution times for functions which have to traverse entire
Data
objects. Because of the non-homogeneous nature ofData
these bounds are quite conservative. Note that the inferred costs are quite expensive: for example the costing function forserialiseData
would charge 20.7µs for serialising an object of size 50, 40.1µs for size 100, and 309.3µs for size 1000. ForequalsData
the costs would be 2.1µs, 3.02µs, and 19.7µs. We could decrease costs by reverting to a standard linear model, but then we'd end up undercharging for some inputs. It would be useful to know what sort ofData
objects people will be serialising in practice, and how large they are likely to be. We could also do with a better generator for Data: see SCP-3653.