Skip to content

Amit-PivotalLabs/diego-az-balance-spike-gist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

Summary of Diego AZ Balancing Spike

I spiked on implementing the AZ balancing feature. Herein lies the summary of the spike. I'm putting this in a repo instead of a gist so it can have collaborators.

  1. Main Goal
  2. Completed Tasks
  3. Other Wins
  4. Remaining Work
  5. Non-Goals

Start and Stop Auctions should take AZ (or Cluster) into account with multi-instance apps. Instances should be well-balanced across AZs so that the High Availability/Fault Tolerance a user gets by having multiple instances is less brittle in the face of an AZ going down. In that regard, this is to improve our SLA.

  • Refactor simulation so its easy to add more tests
    Auction
  • Give Reps knowledge of what "AZ" they're on, and add statistics about AZ balancing before implementing AZ balancing, so we can see the before/after improvement
    Rep | Auction
  • App-Manager knows how many AZs Reps/Executors are distributed across, and communicates this number to the Auctioneer via LRPStartAuction and LRPStopAuction
    App-Manager | Auctioneer | Runtime-Schema
  • Auction combines NumberOfAZs from App-Manager and AZNumber from Rep to take AZ balancing into account when computing score/bid
    Auction
  • Update Inigo, see it pass
    Inigo
  • Bump Deps
    App-Manager | Auctioneer | Inigo | Rep
  • Update Diego-Release to pass numAZs to App-Manager and AZNumber to each rep
    Diego-Release
  • BOSH-Lite-AWS deploy cf-release/acceptance-deployed and diego-release/az_balance and see CATS pass. I've done this, and have an EC2 jumpbox you can go on and run CATS (and Inigo).
  • Add useful auction simulations (including some simulations are not AZ-related)
  • Add some sorely-needed variance statistics for existing variables in the simulation report
  • Add statistics for some important variables not previously covered in the simulation report (including many variables that are not AZ-related)
  • Make simulation report generation code more sane
  • Make simulation setup code more sane
  • Bring a little more consistency between treatment of StopAuctions vis-à-vis StartAuctions where it improves sanity
  • Right now only in-process simulation reps are balanced across multiple "AZs", the corresponding work needs to be done when communicationMode is something other than "in-process", e.g. NATS.
  • SVG (browser) reports for simulation have not been touched. All the improved statistics are in the CLI report. SVG reports need some love.
  • Merge work onto master (or develop in the case of diego-release).
  • Pre-filter Reps out of the auction according to whether they belong to / don't belong to a given Placement Pool / Tag
    BUT: this will be easy to add on, and it's entirely separate since it's pre-filtering, not part of the main run of the auction
  • Ensure that we can always handle apps with large memory or disk requirements (i.e. ensure that we don't distribute the small granular "sand" apps so well that we have no room for the "boulders")
    BUT: a simulation has been added to try and capture this
  • Fine-tune auction parameters to fully optimize app placement behaviour
    BUT: it's totally straightforward to tweak the coefficients for Start and Stop Auctions
  • Make auction parameters configurable via the BOSH deployment manifest
    BUT: this would be very easy to do if desired
  • Speed up time between a user pushing an app and it being started, e.g. by taking into account whether a Rep already has the droplet for the app cached locally
    BUT: performance is just as good as before after this spike, and these performance enhancements can be added totally separately later

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published