Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable leader election in ws-manager-mk2 (v3) #18539

Merged
merged 17 commits into from
Aug 26, 2023
Merged

Enable leader election in ws-manager-mk2 (v3) #18539

merged 17 commits into from
Aug 26, 2023

Conversation

aledbf
Copy link
Member

@aledbf aledbf commented Aug 17, 2023

Description

Reverts #18537

Change:

  • The previous revert was related to the disabled maintenance mode not detected by the stand-by replica.
  • Now, the issue is related to grpc request/s being processed by the stand-by replica and storing the state in memory.
  • The fix is to not bind the Subscribers helper to the grpc server instance connected to; instead, an informer is the source of truth.

Here we remove the activity and switch the state to the workspace status.

Summary generated by Copilot

🤖 Generated by Copilot at 27cc1a0

Refactor and improve the ws-manager-mk2 component, update its dependencies and tests, and add a new field LastActivity to the WorkspaceStatus struct.

Related Issue(s)

Fixes #

How to test

Documentation

Preview status

Gitpod was successfully deployed to your preview environment.

Build Options

Build
  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer
  • analytics=segment
  • with-dedicated-emulation
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • with-integration-tests=workspace
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • with-monitoring

/hold

@aledbf
Copy link
Member Author

aledbf commented Aug 17, 2023

@aledbf
Copy link
Member Author

aledbf commented Aug 17, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5892091350

  • recreate_vm: true

@Furisto
Copy link
Member

Furisto commented Aug 17, 2023

The manifest for the CRD has not been updated. Run make manifests.

@WVerlaek
Copy link
Member

Integration tests https://github.com/gitpod-io/gitpod/actions/runs/5891737721/job/15979818090

This actually used a build from main (main-gha.15563) 😅, you can either specify the installer version from this PR or tick the checkbox in the PR description to run the tests from there instead

@aledbf
Copy link
Member Author

aledbf commented Aug 17, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5895723228

  • recreate_vm: true

@aledbf
Copy link
Member Author

aledbf commented Aug 18, 2023

The manifest for the CRD has not been updated. Run make manifests.

done

@aledbf
Copy link
Member Author

aledbf commented Aug 18, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5902316472

  • recreate_vm: true

@aledbf aledbf force-pushed the aledbf/leader-v3 branch 2 times, most recently from 36c751a to c73dc1d Compare August 18, 2023 20:14
@aledbf
Copy link
Member Author

aledbf commented Aug 19, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5912178632

  • recreate_vm: true

@aledbf aledbf force-pushed the aledbf/leader-v3 branch 4 times, most recently from db24b0f to cd14a5b Compare August 21, 2023 12:55
@aledbf aledbf marked this pull request as ready for review August 21, 2023 12:55
@csweichel
Copy link
Contributor

Change lgtm with some minor comments.
Where can I find the benchmarks/measurements for the additional load imposed by moving the activity to the CR?

wsMetrics *controllerMetrics
ctx context.Context
cancel context.CancelFunc
wsMetrics *controllerMetrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do the ws-manager metrics behave when two replicas are enabled? Do both report metrics, or only one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics will be scrapped from all available replicas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break some of the dashboards then (and maybe alerts too). For instance we sum all workspaces by cluster, this will then double the amount of reported workspaces

Copy link
Member Author

@aledbf aledbf Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this gist (integration tests are running there)

Only the leader will have metrics related to the controller. If the current leader is not elected anymore, the pod is restarted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the gists. Both replicas report workspace metrics, e.g.:

gitpod_ws_manager_mk2_workspace_phase_total{class="g1-standard",phase="Running",type="Regular"} 1

is in both. Summing them as done in dashboards/alerts would give 2 replicas, while there's only 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gist updated after running the integration test

components/ws-manager-mk2/service/manager.go Outdated Show resolved Hide resolved
@WVerlaek
Copy link
Member

Where can I find the benchmarks/measurements for the additional load imposed by moving the activity to the CR?

I'd also like to see some benchmarks, as both a new replica is added watching for events, and the large increase in CR activity.

I don't think though that loadgen will be able to simulate the additional activity as the workspaces aren't marked as active. How would we test that?

@roboquat roboquat merged commit 687f337 into main Aug 26, 2023
16 checks passed
@roboquat roboquat deleted the aledbf/leader-v3 branch August 26, 2023 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants