ENG-700 #377

ekalosak · 2021-12-14T21:10:57Z

made servo issue a startup event when a run gets cancelled;
made the PrometheusConnector.startup() idempotent

…PrometheusConnector.startup() idempotent

linear · 2021-12-14T21:10:59Z

This issue is mainly about the failure of the HPA+'s ServoX Connector failing to recover after the servo processes a servo.errors.EventCancelledError. As-is, the HPA+ Connector just fails to come back up, hangs, etc.

However, because debugging this issue requires generating EventCancelledErrors via the Opsani Console, I've put a lot of miles on that little "Abort" button and it fails to work more frequently than it does.

Abort failure replication instructions

Start a run on the Opsani Console
Click abort
Repeat (1-2) <5 times and you'll see the run fail to abort.

Replication instructions

servo run from your project directory (see 4.b)
Start a run in the Opsani Console
Abort an ongoing run in Opsani Console
Force the servo to hang in its cancellation and not send the back half of the GOODBYE handshake
1. This happens "for free" in the current (Dec 07 2021) ghcr.io/opsani/connector-hpaplus-pvt:main(sha256:ac1a01df4efe5c4990bbd846542355e29a94eaa837ee3b21976582569d95fe87) Docker image.
2. You can do this locally using
```
git clone [email protected]:opsani/connector-hpaplus-pvt.git
cd connector-hpaplus-pvt
poetry install
poetry run servo -l DEBUG run
```
Observe that the run fails to abort in the Opsani Console

Objective

Don't hang on cleaning up HPA+ from the servo when Cancellation event occurs.

Follow-on issues

ENG-10 or ENG-9 targets the failure of OCO to abort a run ("bad state" after clicking abort)
About 1 of every 2 or 3 aborts results in a failure to send the Cancel command to the servo, resulting in a "bad state" for the backend.

…ould have multiple sources

…line developer docs

… servo attribute is not typed as Optional

…opsani/servox into ek/issue-startup-after-run-cancellation

linkous8 · 2022-01-13T23:55:27Z

servo/runner.py

    def run_main_loop(self) -> None:
        if self._main_loop_task:
            self._main_loop_task.cancel()
+            loop = asyncio.get_event_loop()
+            loop.create_task(self.servo.dispatch_event(servo.Events.startup))


Suggested change

def run_main_loop(self) -> None:

if self._main_loop_task:

self._main_loop_task.cancel()

loop = asyncio.get_event_loop()

loop.create_task(self.servo.dispatch_event(servo.Events.startup))

async def run_main_loop(self) -> None:

if self._main_loop_task:

self._main_loop_task.cancel()

await self.servo.startup()

self.logger.info(

f"Servo started with {len(self.servo.connectors)} active connectors [{self.optimizer.id} @ {self.optimizer.url or self.optimizer.base_url}]"

)

I agree we should update the startup/shutdown lifecycle given the implementation of cancel responses. However, we need to update a few more places for the sake of completeness:

We're moving this startup into run_main_loop() (delete the old one):

servox/servo/runner.py

Line 257 in 6ce0cbc

await self.servo.startup()

await run_main_loop now that its async:

servox/servo/runner.py

Line 299 in 6ce0cbc

self.run_main_loop()

servox/servo/runner.py

Line 404 in 6ce0cbc

runner.run_main_loop()

shutdown the servo during cancellation handling:

servox/servo/runner.py

Lines 373 to 404 in 6ce0cbc

if isinstance(error, (servo.errors.UnexpectedEventError, servo.errors.EventCancelledError)):

if isinstance(error, servo.errors.UnexpectedEventError):

self.logger.error(

"servo has lost synchronization with the optimizer: restarting"

)

elif isinstance(error, servo.errors.EventCancelledError):

self.logger.error(

"optimizer has cancelled operation in progress: cancelling and restarting loop"

)

# Post a status to resolve the operation

operation = progress['operation']

status = servo.api.Status.from_error(error)

self.logger.error(f"Responding with {status.dict()}")

runner = self._runner_for_servo(servo.current_servo())

await runner._post_event(operation, status.dict())

tasks = [

t for t in asyncio.all_tasks() if t is not asyncio.current_task()

]

self.logger.info(f"Cancelling {len(tasks)} outstanding tasks")

[self.logger.trace(f"\t{task.get_name()}") for task in tasks]

[task.cancel() for task in tasks]

await asyncio.gather(*tasks, return_exceptions=True)

self.logger.trace("Cancelled tasks:")

[self.logger.trace(f"\t{task.get_name()} {'cancelled' if task.cancelled() else 'not cancelled'}") for task in tasks]

# Restart a fresh main loop

if poll:

runner = self._runner_for_servo(servo.current_servo())

runner.run_main_loop()

servox/servo/runner.py

Lines 518 to 527 in 6ce0cbc

# Shut down the servo runners, breaking active control loops

if len(self.runners) == 1:

self.logger.info(f"Shutting down servo...")

else:

self.logger.info(f"Shutting down {len(self.runners)} running servos...")

for fut in asyncio.as_completed(list(map(lambda r: r.shutdown(reason=reason), self.runners)), timeout=30.0):

try:

await fut

except Exception as error:

self.logger.critical(f"Failed servo runner shutdown with error: {error}")

blakewatters · 2022-05-18T17:21:30Z

What are we doing with this? Seems pretty reasonable...

Except I don't love that each connector has to be aware of being potentially restarted and guard against it. Seems like a lifecycle teardown of the channel would keep it more straightforward or even a blanket teardown of all channels.

It shouldn't be the connector's responsibility to handle this state. It would have to be replicated everywhere

linkous8 · 2022-05-18T17:54:03Z

Looking into the relevant internal project mgmt, it would seem @ekalosak reached a similar conclusion in that this should be handled externally. I'm stilling looking into it to see if it has impact on new startup behavior for the k8s connector

made servo issue a startup event when a run gets cancelled; made the …

77e0d52

…PrometheusConnector.startup() idempotent

ekalosak and others added 5 commits December 14, 2021 16:20

using targeted conditional rather than relying on an exception that c…

471d82b

…ould have multiple sources

removed a try-finally debugging block

e3b7b36

removed and downleveled logging relevant to this feature; improved in…

df41039

…line developer docs

removed a conditional that isn't necessary because the servo runner's…

7dc69ea

… servo attribute is not typed as Optional

Merge branch 'main' into ek/issue-startup-after-run-cancellation

5f4793e

ekalosak marked this pull request as ready for review December 14, 2021 23:01

ekalosak and others added 5 commits January 7, 2022 16:24

Merge branch 'main' into ek/issue-startup-after-run-cancellation

09f3448

removed debugging uuids from main_loop_names

aa28008

Merge branch 'ek/issue-startup-after-run-cancellation' of github.com:…

8aca161

…opsani/servox into ek/issue-startup-after-run-cancellation

added comment to trigger PR re-tag

8d55a19

Merge branch 'main' into ek/issue-startup-after-run-cancellation

6ce0cbc

linkous8 self-requested a review January 13, 2022 18:26

linkous8 requested changes Jan 13, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENG-700 #377

ENG-700 #377

ekalosak commented Dec 14, 2021

linear bot commented Dec 14, 2021

Abort failure replication instructions

Replication instructions

Objective

Follow-on issues

linkous8 Jan 13, 2022

blakewatters commented May 18, 2022

linkous8 commented May 18, 2022

	if isinstance(error, (servo.errors.UnexpectedEventError, servo.errors.EventCancelledError)):
	if isinstance(error, servo.errors.UnexpectedEventError):
	self.logger.error(
	"servo has lost synchronization with the optimizer: restarting"
	)
	elif isinstance(error, servo.errors.EventCancelledError):
	self.logger.error(
	"optimizer has cancelled operation in progress: cancelling and restarting loop"
	)

	# Post a status to resolve the operation
	operation = progress['operation']
	status = servo.api.Status.from_error(error)
	self.logger.error(f"Responding with {status.dict()}")
	runner = self._runner_for_servo(servo.current_servo())
	await runner._post_event(operation, status.dict())

	tasks = [
	t for t in asyncio.all_tasks() if t is not asyncio.current_task()
	]
	self.logger.info(f"Cancelling {len(tasks)} outstanding tasks")
	[self.logger.trace(f"\t{task.get_name()}") for task in tasks]
	[task.cancel() for task in tasks]

	await asyncio.gather(*tasks, return_exceptions=True)
	self.logger.trace("Cancelled tasks:")
	[self.logger.trace(f"\t{task.get_name()} {'cancelled' if task.cancelled() else 'not cancelled'}") for task in tasks]

	# Restart a fresh main loop
	if poll:
	runner = self._runner_for_servo(servo.current_servo())
	runner.run_main_loop()

	# Shut down the servo runners, breaking active control loops
	if len(self.runners) == 1:
	self.logger.info(f"Shutting down servo...")
	else:
	self.logger.info(f"Shutting down {len(self.runners)} running servos...")
	for fut in asyncio.as_completed(list(map(lambda r: r.shutdown(reason=reason), self.runners)), timeout=30.0):
	try:
	await fut
	except Exception as error:
	self.logger.critical(f"Failed servo runner shutdown with error: {error}")

ENG-700 #377

Are you sure you want to change the base?

ENG-700 #377

Conversation

ekalosak commented Dec 14, 2021

linear bot commented Dec 14, 2021

Abort failure replication instructions

Replication instructions

Objective

Follow-on issues

linkous8 Jan 13, 2022

Choose a reason for hiding this comment

blakewatters commented May 18, 2022

linkous8 commented May 18, 2022