Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud environment may restart workflows on deploy #10472

Closed
jrhizor opened this issue Feb 18, 2022 · 5 comments
Closed

cloud environment may restart workflows on deploy #10472

jrhizor opened this issue Feb 18, 2022 · 5 comments
Assignees
Labels
area/platform issues related to the platform priority/high High priority team/compose team/platform-move type/bug Something isn't working

Comments

@jrhizor
Copy link
Contributor

jrhizor commented Feb 18, 2022

Despite our tests for exactly this case in OSS we are seeing workflows restart.

https://airbytehq-team.slack.com/archives/C02TXQ020QM/p1645216105079419 for an example.

I need to dig in and see exactly what is failing here and why the tests are not representative. Maybe the Helm deploy mechanism terminates pods differently than the scale down in the tests.

@jrhizor jrhizor added type/bug Something isn't working priority/high High priority area/platform issues related to the platform labels Feb 18, 2022
@jrhizor jrhizor self-assigned this Feb 18, 2022
@jrhizor
Copy link
Contributor Author

jrhizor commented Feb 18, 2022

@cgardens fyi

@cgardens
Copy link
Contributor

@jrhizor thanks for the heads up. agreed that this is a high prio thing that needs to be figured out by the launch.

@jrhizor
Copy link
Contributor Author

jrhizor commented Feb 23, 2022

This morning I figured out why each of the test cases was having the behavior observed:

Test CasePasses (prod)Passes (new)Why?
KILL_ONLY_NON_SYNC
YesYesThis is the easiest to handle, and what we were testing previously. If the sync workflow isn't terminated the connection manager resumes as expected.
KILL_BOTH_SAME_TIME
NoYesHere the manager is still killed slightly before the other, which means that depending on the sync retry policy it can "reconnect" to the sync workflow.
KILL_SYNC_FIRST
NoYesHere the sync is killed, but the manager doesn't detect it with the new configuration because it retries within the timeouts.
KILL_NON_SYNC_FIRST
NoYesThe same thing happens in the opposite direction, it retries but within timeouts.
KILL_ONLY_SYNC
NoNoWhen you kill the sync only, there is no timeout-based grace period. Currently working on how to fix.

@jrhizor
Copy link
Contributor Author

jrhizor commented Feb 24, 2022

Updates are on #10565. It's pretty well tested at this point. Will do the release in a few hours and close this.

@jrhizor
Copy link
Contributor Author

jrhizor commented Feb 25, 2022

Verified with https://cloud.airbyte.io/workspaces/6a9fca7e-8af0-40b6-95ea-b23efb3245b0/connections/00dfafc1-83d1-4ae1-b4cc-a4615e25cacd/status (a failure but the sync continued properly throughout).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform priority/high High priority team/compose team/platform-move type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants