Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENG-700 #377

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

ENG-700 #377

wants to merge 11 commits into from

Conversation

ekalosak
Copy link
Contributor

  • made servo issue a startup event when a run gets cancelled;
  • made the PrometheusConnector.startup() idempotent

@linear
Copy link

linear bot commented Dec 14, 2021

ENG-700 Don't hang on cancellation

This issue is mainly about the failure of the HPA+'s ServoX Connector failing to recover after the servo processes a servo.errors.EventCancelledError. As-is, the HPA+ Connector just fails to come back up, hangs, etc.

However, because debugging this issue requires generating EventCancelledErrors via the Opsani Console, I've put a lot of miles on that little "Abort" button and it fails to work more frequently than it does.

Abort failure replication instructions

  1. Start a run on the Opsani Console
  2. Click abort
  3. Repeat (1-2) <5 times and you'll see the run fail to abort.

Replication instructions

  1. servo run from your project directory (see 4.b)
  2. Start a run in the Opsani Console
  3. Abort an ongoing run in Opsani Console
  4. Force the servo to hang in its cancellation and not send the back half of the GOODBYE handshake
    1. This happens "for free" in the current (Dec 07 2021) ghcr.io/opsani/connector-hpaplus-pvt:main(sha256:ac1a01df4efe5c4990bbd846542355e29a94eaa837ee3b21976582569d95fe87) Docker image.

    2. You can do this locally using

      git clone [email protected]:opsani/connector-hpaplus-pvt.git
      cd connector-hpaplus-pvt
      poetry install
      poetry run servo -l DEBUG run
      
  5. Observe that the run fails to abort in the Opsani Console

Objective

Don't hang on cleaning up HPA+ from the servo when Cancellation event occurs.

Follow-on issues

  1. ENG-10 or ENG-9 targets the failure of OCO to abort a run ("bad state" after clicking abort)
  2. About 1 of every 2 or 3 aborts results in a failure to send the Cancel command to the servo, resulting in a "bad state" for the backend.

@ekalosak ekalosak marked this pull request as ready for review December 14, 2021 23:01
@linkous8 linkous8 self-requested a review January 13, 2022 18:26
Comment on lines 224 to +228
def run_main_loop(self) -> None:
if self._main_loop_task:
self._main_loop_task.cancel()
loop = asyncio.get_event_loop()
loop.create_task(self.servo.dispatch_event(servo.Events.startup))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def run_main_loop(self) -> None:
if self._main_loop_task:
self._main_loop_task.cancel()
loop = asyncio.get_event_loop()
loop.create_task(self.servo.dispatch_event(servo.Events.startup))
async def run_main_loop(self) -> None:
if self._main_loop_task:
self._main_loop_task.cancel()
await self.servo.startup()
self.logger.info(
f"Servo started with {len(self.servo.connectors)} active connectors [{self.optimizer.id} @ {self.optimizer.url or self.optimizer.base_url}]"
)

I agree we should update the startup/shutdown lifecycle given the implementation of cancel responses. However, we need to update a few more places for the sake of completeness:

  • We're moving this startup into run_main_loop() (delete the old one):
    await self.servo.startup()
  • await run_main_loop now that its async:
  • shutdown the servo during cancellation handling:
    • servox/servo/runner.py

      Lines 373 to 404 in 6ce0cbc

      if isinstance(error, (servo.errors.UnexpectedEventError, servo.errors.EventCancelledError)):
      if isinstance(error, servo.errors.UnexpectedEventError):
      self.logger.error(
      "servo has lost synchronization with the optimizer: restarting"
      )
      elif isinstance(error, servo.errors.EventCancelledError):
      self.logger.error(
      "optimizer has cancelled operation in progress: cancelling and restarting loop"
      )
      # Post a status to resolve the operation
      operation = progress['operation']
      status = servo.api.Status.from_error(error)
      self.logger.error(f"Responding with {status.dict()}")
      runner = self._runner_for_servo(servo.current_servo())
      await runner._post_event(operation, status.dict())
      tasks = [
      t for t in asyncio.all_tasks() if t is not asyncio.current_task()
      ]
      self.logger.info(f"Cancelling {len(tasks)} outstanding tasks")
      [self.logger.trace(f"\t{task.get_name()}") for task in tasks]
      [task.cancel() for task in tasks]
      await asyncio.gather(*tasks, return_exceptions=True)
      self.logger.trace("Cancelled tasks:")
      [self.logger.trace(f"\t{task.get_name()} {'cancelled' if task.cancelled() else 'not cancelled'}") for task in tasks]
      # Restart a fresh main loop
      if poll:
      runner = self._runner_for_servo(servo.current_servo())
      runner.run_main_loop()
    • servox/servo/runner.py

      Lines 518 to 527 in 6ce0cbc

      # Shut down the servo runners, breaking active control loops
      if len(self.runners) == 1:
      self.logger.info(f"Shutting down servo...")
      else:
      self.logger.info(f"Shutting down {len(self.runners)} running servos...")
      for fut in asyncio.as_completed(list(map(lambda r: r.shutdown(reason=reason), self.runners)), timeout=30.0):
      try:
      await fut
      except Exception as error:
      self.logger.critical(f"Failed servo runner shutdown with error: {error}")

@blakewatters
Copy link
Contributor

What are we doing with this? Seems pretty reasonable...

Except I don't love that each connector has to be aware of being potentially restarted and guard against it. Seems like a lifecycle teardown of the channel would keep it more straightforward or even a blanket teardown of all channels.

It shouldn't be the connector's responsibility to handle this state. It would have to be replicated everywhere

@linkous8
Copy link
Collaborator

Looking into the relevant internal project mgmt, it would seem @ekalosak reached a similar conclusion in that this should be handled externally. I'm stilling looking into it to see if it has impact on new startup behavior for the k8s connector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants