I spiked on implementing the AZ balancing feature. Herein lies the summary of the spike. I'm putting this in a repo instead of a gist so it can have collaborators.
Start and Stop Auctions should take AZ (or Cluster) into account with multi-instance apps. Instances should be well-balanced across AZs so that the High Availability/Fault Tolerance a user gets by having multiple instances is less brittle in the face of an AZ going down. In that regard, this is to improve our SLA.
- Refactor simulation so its easy to add more tests
Auction - Give Reps knowledge of what "AZ" they're on, and add statistics about AZ balancing before implementing AZ balancing, so we can see the before/after improvement
Rep | Auction - App-Manager knows how many AZs Reps/Executors are distributed across, and communicates this number to the Auctioneer via LRPStartAuction and LRPStopAuction
App-Manager | Auctioneer | Runtime-Schema - Auction combines NumberOfAZs from App-Manager and AZNumber from Rep to take AZ balancing into account when computing score/bid
Auction - Update Inigo, see it pass
Inigo - Bump Deps
App-Manager | Auctioneer | Inigo | Rep - Update Diego-Release to pass
numAZs
to App-Manager andAZNumber
to each rep
Diego-Release - BOSH-Lite-AWS deploy
cf-release/acceptance-deployed
anddiego-release/az_balance
and see CATS pass. I've done this, and have an EC2 jumpbox you can go on and run CATS (and Inigo).
- Add useful auction simulations (including some simulations are not AZ-related)
- Add some sorely-needed variance statistics for existing variables in the simulation report
- Add statistics for some important variables not previously covered in the simulation report (including many variables that are not AZ-related)
- Make simulation report generation code more sane
- Make simulation setup code more sane
- Bring a little more consistency between treatment of StopAuctions vis-à-vis StartAuctions where it improves sanity
- Right now only in-process simulation reps are balanced across multiple "AZs", the corresponding work needs to be done when
communicationMode
is something other than "in-process", e.g. NATS. - SVG (browser) reports for simulation have not been touched. All the improved statistics are in the CLI report. SVG reports need some love.
- Merge work onto
master
(ordevelop
in the case ofdiego-release
).
- Pre-filter Reps out of the auction according to whether they belong to / don't belong to a given Placement Pool / Tag
BUT: this will be easy to add on, and it's entirely separate since it's pre-filtering, not part of the main run of the auction - Ensure that we can always handle apps with large memory or disk requirements (i.e. ensure that we don't distribute the small granular "sand" apps so well that we have no room for the "boulders")
BUT: a simulation has been added to try and capture this - Fine-tune auction parameters to fully optimize app placement behaviour
BUT: it's totally straightforward to tweak the coefficients for Start and Stop Auctions - Make auction parameters configurable via the BOSH deployment manifest
BUT: this would be very easy to do if desired - Speed up time between a user pushing an app and it being started, e.g. by taking into account whether a Rep already has the droplet for the app cached locally
BUT: performance is just as good as before after this spike, and these performance enhancements can be added totally separately later