Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud migration April 2024 #195

Merged

Conversation

sanjaysrikakulam
Copy link
Member

@sanjaysrikakulam sanjaysrikakulam commented Apr 26, 2024

This PR:

  1. adds the new state file based on the deployments in the new cloud
  2. disables the outputs of upload instance as the instance is not currently deployed in the new cloud
  3. adds new security groups, update the existing ones, and migrate whatever was still in use in the old cloud
  4. removes empty files
  5. commented out/disabled some user VMs temporarily as they are still running in the old cloud for user data backup. Once the users complete their backups, we can remove them in the old cloud, uncomment it here, and re-provision them.
  6. Disables beacon resources temporarily as they are still running in the old cloud. We need to migrate them to the new cloud. They are currently commented out, so the VMs won't be spawned in the new cloud until they are ready for migration.
  7. Remove some of the user VMs. These are currently only in the old cloud. The users have confirmed that their VMs can be removed, so we do not need their config files in the new cloud.
  8. Create a new plausible Terraform config file and use the snapshot as the image for the VM. Snapshots were created and moved from the old cloud to the new cloud. Similarly, the DNS resource conf was moved from dns.tf file
  9. Disabling/commenting on both mq and uploading instance resources as they might likely be moved to KVM. These VMs are still running in the old cloud.
  10. remove unused DNS CNAME records from the maintenance host
  11. disable/comment out bronze and silver workers as they are no longer needed. The silver worker in the old cloud is still running for the time being, and since the new cloud-only supports V3 block storage, updated the resource in the gold worker TF file
  12. I moved the FTP DNS record from the dns.tf file to the instance TF file itself.
  13. For dokku, influxdb, and stats snapshots from the old cloud were used as the image, and also reattached the same volume from the old cloud was used. Since reattaching to the same volume from the old cloud, block volume creation was commented out, and the new cloud-only supports v3 block volume from now on
  14. For CVMFS, stratum 0 and 1, snapshots from the old cloud were used as the image and also reattached to the same volume. Since reattaching to the same volume from the old cloud, block volume creation was commented out, and the new cloud-only supports v3 block volume from now on
  15. Change the flavor for celery. Bjoern suggested that this should be fine as celery is more IO-bound than CPU/Mem.
  16. Create a new apollo Terraform config file and use the snapshot from the old cloud as the image of the VM in the new cloud.
  17. Add allow_overwrite to some DNS resources and move plausible, apollo, and ftp DNS resources to its own compute instance files

Snapshots:
I created snapshots of the VMs, downloaded them, and uploaded them to the new cloud. Once the upload was done, a property (--property hw_video_model=cirrus) was added to each of the images. Due to some config changes in the new cloud, this property is required for all the images that we upload to the new cloud. Manuel said he would investigate this.

How to add a property to an uploaded image

openstack image set --property hw_video_model=cirrus <image_name>

Volumes:

  1. Manually detach the volume from the VM in the old cloud
  2. Create a snapshot if needed
  3. Attach the same volume to the VM spawned in the new cloud (make sure to set the device path in the TF files and comment out the block volume resource creation. Also, keep in mind that only the block volume V3 is supported in the new cloud)

OS Images:
I downloaded the OS images currently used from the old cloud, uploaded them to the new cloud, and set the property mentioned above to the images.

I encountered various issues when snapshotting and reattaching volumes, but I don't know what or how to document the troubleshooting.

Ref: https://github.com/usegalaxy-eu/issues/issues/533

and move plausible, apollo, and ftp DNS resources to its own compute instance files
… image of the VM

snapshots were created and moved from the old cloud to the new cloud
Bjoern suggested that this should be fine as celery is more IO bound than CPU/Mem.
…same volume

since reattaching to the same volume from the old cloud, block volume creation was commented out and the new cloud only supports v3 block volume from now on
…same volume

since reattaching to the same volume from the old cloud, block volume creation was commented out and the new cloud only supports v3 block volume from now on
the silver worker from the old cloud is still running for the time being and since the new cloud only supports V3 block storage, update the resource in the gold worker
… moved to KVM.

These VMs are still running in the old cloud
…the image of the VM

snapshots were created and moved from the old cloud to the new cloud. Similarly the DNS resource conf was moved from dns.tf file
These are currently only in the old cloud. The users have confirmed that their VMs can be removed, so we do not need their config files in the new cloud.
… old cloud.

commented out so the VMs won't be spawned in the new cloud until ready for migration.
…d cloud for user data backup

Once the users complete their backups we can remove them in the old cloud and uncomment it here and re-provision them.
migrated whatever was still in use in the old cloud
@sanjaysrikakulam
Copy link
Member Author

I have now added all the deployments I have done manually so far to this PR to keep track of everything.

More changes are yet to come as we progress with the migration.

@sanjaysrikakulam
Copy link
Member Author

New cloud credentials (clouds.yaml) were created and added to the Jenkins credentials. The Jenkins projects need to be reconfigured to use the new credentials and they are currently disabled as all the changes are custom and are being deployed locally.

@sanjaysrikakulam
Copy link
Member Author

sanjaysrikakulam commented May 21, 2024

I have added an openrc credential (new cloud; user: freiburg_galaxy (service user account)) to our Jenkins and reconfigured the infrastructure , infrastructure_pr, vgcn-generic-internal, vgcn-worker-gpu-internal, and vgcn-workers-internal Jenkins projects to use the new cloud credentials.

@sanjaysrikakulam sanjaysrikakulam marked this pull request as ready for review May 21, 2024 13:12
@sanjaysrikakulam
Copy link
Member Author

Does anyone want to have a look at this PR?

Copy link
Contributor

@mira-miracoli mira-miracoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the terraform.tfstate fully, but I can do if you think it is better to be checked
The "ingress-from-proxy" secgroup was for flower, but we are accessing this now via tailscale, which is more secure and we don't have to manage login credentials.

Otherwise everything looks fine to me. Thank you, this looks like a lot of work!

instance_core_stats.tf Show resolved Hide resolved
secgroup_ingress-from-proxy.tf Show resolved Hide resolved
secgroup_ingress-from-proxy.tf Show resolved Hide resolved
secgroup_ufr-ingress.tf Outdated Show resolved Hide resolved
secgroup_ingress-from-proxy.tf Show resolved Hide resolved
secgroup_interactive_egress.tf Show resolved Hide resolved
@sanjaysrikakulam
Copy link
Member Author

Thank you, @mira-miracoli, for reviewing this. :)

@sanjaysrikakulam sanjaysrikakulam merged commit fac482e into usegalaxy-eu:main May 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants