Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add docker autoscaler executor #1118

Merged
merged 35 commits into from
Aug 3, 2024

Conversation

mmoutama09
Copy link
Contributor

@mmoutama09 mmoutama09 commented Apr 23, 2024

Description

Provides a new executor using the new GitLab autoscaler executor. I've been using the fleeting plugin for AWS only.

Prerequisite: Docker must already be installed on the AMI used by worker machines (the Docker autoscaler does not install it, unlike the Docker machine). Additionally, the user used to connect to the workers must also be added to the Docker group.

Related to issue #624

Verification

Built an AMI with Docker based on Amazon Linux 2023. Set up the new executor according to the example. Works!

Copy link
Contributor

Hey @mmoutama09! 👋

Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process.

Make sure that this PR clearly explains:

  • the problem being solved
  • the best way a reviewer and you can test your changes

With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE.

The following ChatOps commands are supported:

  • /help: notifies a maintainer to help you out

Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command.

This message was generated automatically. You are welcome to improve it.

@Tiduster
Copy link

Hi @kayman-mk.
I am a colleague of @mmoutama09 and @Kadeux.

This change could be the next major release of the module.

Gitlab is still on track to make their plugin GA this summer: https://gitlab.com/groups/gitlab-org/-/epics/6995

We are still NOT using this version in our production setup, but we will deploy it on part of our runners in June.

What should be the next steps for this PR?

Best regards,

@kayman-mk
Copy link
Collaborator

kayman-mk commented May 23, 2024

Sounds quite promising to get rid of the outdated docker machine. As soon as GitLab has published their module, we can integrate it here.

As far as I can see the docker machine can still be used, so we can create a feature release. Before the next major release I will check if we can get rid of docker machine to simplify the code.

Could you please post the settings to test this change?

At the moment I am working on #1117. That change will be merged before to support zero downtime during deployment of a new version.

@Tiduster
Copy link

Thanks @kayman-mk for your answer.

I was not aware of this zero downtime PR, very interesting, we can test it as well in our environment.
I will try to look at it and gave a feedback if I find something interesting

@kayman-mk
Copy link
Collaborator

@Tiduster Could you please post a minimal configuration showing which AMIs to use to get this up and running?

@mmoutama09 mmoutama09 force-pushed the add_gitlab_docker_autoscaler branch from ea41c66 to 0226a05 Compare June 3, 2024 14:40
@mmoutama09 mmoutama09 force-pushed the add_gitlab_docker_autoscaler branch from 0226a05 to 9a967f6 Compare June 5, 2024 09:39
@kayman-mk
Copy link
Collaborator

kayman-mk commented Jun 6, 2024

Just tried it, but with no success. Runner is up and working. But in case a job is processed, GitLab shows

Running with gitlab-runner 16.4.2 (e77af703)
  on prod-gitlab-ci-runner-test-Gitlab-Runner-TEST-A PsqsZYpLQ, system ID: s_0a07de49d04b
Resolving secrets 00:00
Preparing the "docker-autoscaler" executor 00:50
Dialing instance i-05caf9a7284ccaxxx...
Instance i-05caf9a7284ccaxxx connected
ERROR: Failed to remove network for build
ERROR: Preparation failed: error during connect: Get "http://internel.tunnel.invalid/v1.24/info": dialing environment connection: ssh: rejected: connect failed (open failed) (docker.go:826:0s)

The Runner shows in Cloudwatch

{
    "external-address": "",
    "instance-id": "i-05caf9a7284ccaxxx",
    "internal-address": "100.64.30.16",
    "job": 5314382,
    "level": "info",
    "msg": "Dialing instance",
    "project": 987,
    "runner": "PsqsZYpLQ",
    "time": "2024-06-06T07:40:32Z",
    "use-external-address": true
}
{"error":"networksManager is undefined","job":5314382,"level":"error","msg":"Failed to remove network for build","network":"","project":987,"runner":"PsqsZYpLQ","time":"2024-06-06T07:40:50Z"}

The first error seems to be related to use_external_addr = true in the config. Changed to false.

And I noticed that Docker was not installed on the Runner and ubuntu was not part of the docker group. After fixing that, my job was executed. AMI is ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20240531, but I have no idea how to change it.

@kayman-mk
Copy link
Collaborator

kayman-mk commented Jun 6, 2024

  • We mustn't use runner_worker_docker_machine_instance for configuration as it is tied to the docker+machine executor

@mmoutama09
Copy link
Contributor Author

@kayman-mk The installation of Docker is now mandatory indeed; I've mentioned it in the usage.md file (along with adding the user in docker group).
I reused all the variables from runner_worker_docker_machine_instance to avoid the duplication of multiple variables. However, we could do it differently if we don't mind having multiple if conditions in a local block to determine which variable should be taken. What do you think?

@Tiduster
Copy link

@kayman-mk

And I noticed that Docker was not installed on the Runner and ubuntu was not part of the docker group. After fixing that, my job was executed. AMI is ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20240531, but I have no idea how to change it.

On our side we build a custom AMI from ubuntu and we add docker package manually. Docker autoscaler do not do this by default, so it require an AMI with docker engine to work.

We re-used runner_worker_docker_machine_ami_filter and runner_worker_docker_machine_ami_owners for this to no duplicate variables. We can create new variables if you prefer.

@mmoutama09 added some information about this in usage.md.

Best regards,

@mmoutama09 mmoutama09 force-pushed the add_gitlab_docker_autoscaler branch from 20fd1b6 to bff273f Compare June 25, 2024 14:22
@mmoutama09
Copy link
Contributor Author

@kayman-mk I've updated my code to separate docker-autoscaler from docker+machine.

To use the new docker-autoscaler we must provide an AMI with docker installed and the user used by autoscaler to connect to workers must be added to docker group.

The variable docker-registry-mirror is no longer provided as it is not in the runner autoscaler configuration, but we could add it directly to the AMI.

@kayman-mk
Copy link
Collaborator

Hmm, the need for a custom built image doesn't sound good to me at first hand. Any chance to use a pre-existing AMI instead? Or can we install Docker on the fly?

In case we want to host this AMI: Can you provide a built script (Packer?)?

@kayman-mk
Copy link
Collaborator

@Tiduster Could you please share the PAcker scripts to build the AMI? Would be a good idea to have them available and/or publish an AMI here.

@kayman-mk
Copy link
Collaborator

Why do we have the sg_ingresses introduced here? Just asking because there is no way to do the same for docker+machine

@kayman-mk
Copy link
Collaborator

kayman-mk commented Jul 27, 2024

@Tiduster, @mmoutama09 How do I set all the Docker options (runner_worker_docker_options) and the registry mirror?

EDIT: Docker works via old Docker options.

@kayman-mk
Copy link
Collaborator

kayman-mk commented Jul 28, 2024

Hm, tried again with version 1.0.0 of the plugin. Still have the connection issue described above.

Preparation failed: preparing environment: dial ssh: after retrying 30 times: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

Checked the security groups and did a ssh from the agent to the worker. Same error. So at least no problem with creating the connection.. And I do not see any other errors in the log.

EDIT: Not sure how this is internally working. I tried sending a temporary key and login with that one. Still no success.

root@ip-192-64-74-36 bin]# aws ec2-instance-connect send-ssh-public-key --instance-id i-015020b95c394df73 --availability-zone "eu-central-1b" --instance-os-user ec2-user --ssh-public-key file:///root/.ssh/id_ed25519.pub --region eu-central-1
{
    "RequestId": "2c73837e-cf97-42f5-a415-0b0639a8b196",
    "Success": true
}
[root@ip-192-64-74-36 bin]# ssh [email protected]
The authenticity of host '192.64.35.158 (192.64.35.158)' can't be established.
ED25519 key fingerprint is SHA256:i8JpNRBtjgOCatirWNpfJgMbDQt9Yp5ZtgHHhMD0pvk.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '192.64.35.158' (ED25519) to the list of known hosts.
[email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

AMI is amazon/al2023-ami-ecs-hvm-2023.0.20240723-kernel-6.1-x86_64

@Tiduster
Copy link

Tiduster commented Jul 28, 2024

It uses EC2 connect in the background, as visible in the recommended IAM Policy: https://gitlab.com/gitlab-org/fleeting/plugins/aws#recommended-iam-policy

We didn't have this issue on our side, maybe something is missing in the module.

We can fix it monday :-) .

@Tiduster
Copy link

After more testing, the ECS official AMI is just not compatible.

We didn't dive too much, because as it's not working out of the box, using this AMI is useless in our use case :-( .
We assume there is something in the SSH default configuration or something that is preventing fleeting plugin to connect to the worker when this AMI is in use.

We DO NOT have this issue with a simple custom AMI pre-installed with docker, using Ubuntu or Amazon Linux 2023.

Unfortunately, baking a custom AMI is once again mandatory.

Copy link
Collaborator

@kayman-mk kayman-mk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tiduster @mmoutama09 Thanks for adding this function to the project. Great job!

@kayman-mk
Copy link
Collaborator

To be merged now. Before the release I will do a last check. Everything is working out of the box. Provisioning new machines seems to be 25% faster. I love it!

@kayman-mk kayman-mk merged commit 8aaad0c into cattle-ops:main Aug 3, 2024
19 checks passed
kayman-mk pushed a commit that referenced this pull request Aug 4, 2024
🤖 I have created a release *beep* *boop*
---


##
[7.11.0](7.10.0...7.11.0)
(2024-08-03)


### Features

* add docker autoscaler executor
([#1118](#1118))
([8aaad0c](8aaad0c))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: cattle-ops-releaser-2[bot] <134548870+cattle-ops-releaser-2[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@Tiduster
Copy link

Tiduster commented Aug 5, 2024

Thanks for all the fixes.
We will switch part of our production setup on this new version ASAP to detect any remaining issue.

Hopefully it has less bug than docker+machine ^^·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants