Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying a timeout option seems to be ignored by the runner #496

Closed
francisfuzz opened this issue May 21, 2020 · 7 comments
Closed

Specifying a timeout option seems to be ignored by the runner #496

francisfuzz opened this issue May 21, 2020 · 7 comments

Comments

@francisfuzz
Copy link

Describe the bug

Specifying a timeout option of 5 seconds doesn't seem to be applied when initializing a service or a container in an Actions workflow.

To Reproduce
Steps to reproduce the behavior:

  • Create an Actions workflow that initializes either a service or a container with the following options for health check with a 5-second timeout. Here's an example:
    services:
      redis:
        image: redis:alpine
        options: --health-cmd "redis-cli ping" --health-interval 10s --health-timeout 5s --health-retries 5
        ports:
          - 6379:6379
  • Trigger a workflow.
  • Observe the workflow run logs will show (see below) an incremental backoff, just not 5 seconds apart for timeout as evidence in logs below.

Expected behavior

A clear and concise description of what you expected to happen.

I expected to see the health check performed in 5 second increments rather than a lower count of 1-4 seconds.

Runner Version and Platform

  • Latest in production as of May 20, 2020.
  • OS of the machine running the runner? Linux

What's not working?

Please include error messages and screenshots.

See below. ✌️

Job Log Output

https://github.com/francisfuzz/molecula/runs/694847523?check_suite_focus=true#step:2:39
logs_2.zip

/usr/bin/docker create --name 19f22998701d46b9948ff54476cadd73_redisalpine_f9b603 --label be76db --network github_network_1977d79f05bb41db9d15cb4ed4c9de5f --network-alias redis -p 6379:6379 --health-cmd "redis-cli ping" --health-interval 10s --health-timeout 5s --health-retries 5 -e GITHUB_ACTIONS=true -e CI=true redis:alpine
84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c
/usr/bin/docker start 84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c
84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c
/usr/bin/docker ps --all --filter id=84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c --filter status=running --no-trunc --format "{{.ID}} {{.Status}}"
84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c Up Less than a second (health: starting)
/usr/bin/docker port 84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c
6379/tcp -> 0.0.0.0:6379
/usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" 84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c
starting
redis service is starting, waiting 2 seconds before checking again.
/usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" 84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c

Runner and Worker's Diagnostic Logs

logs_2.zip


👋 @actions/service-oncall for triage/prioritization

@francisfuzz francisfuzz added the bug Something isn't working label May 21, 2020
@thboop
Copy link
Collaborator

thboop commented May 27, 2020

Hey @francisfuzz thanks for the report!

There are actually two different things going on here:

  • Docker's built in health check system
  • The runner polling the docker container health via the docker inspect command

The --health-cmd "redis-cli ping" --health-interval 10s --health-timeout 5s --health-retries 5 applies to the docker container. Every 10 seconds it will run redis-cli ping to determine if the container is healthy or not.

The runner uses the docker inspect command to get the status of the container. Once all containers are started and healthy, it proceeds to then continue with the job, completing the setup job steps and moving forward into the actual job steps. We have an exponential backoff to power this, and do not use the docker container health values. The message you are seeing comes from here

The mainline scenario for the later is waiting for all container to have started before continuing the job. I don't think we actually want to use the healthcheck values for this, as we aren't doing a full healthcheck system, just querying the status.

That being said, is there something related to the healthcheck system causing issues with your job that we could help address? Or is this something you noticed and just wanted to make sure wasn't an issue?

@sfertman
Copy link

@thboop, thanks for the clarification. I originally posted the question/issue with github support about this. I think this is a miscommunication issue. The docs show an example with docker container health-check but the runner is actually doing docker inspect on its own unrelated clock. Would be nice to have that clarified in the documentation, as well as a clear specification of the schedule of this exponential backoff. This is important for optimizing job run time where service containers with long startup time are involved.

@francisfuzz
Copy link
Author

The mainline scenario for the later is waiting for all container to have started before continuing the job. I don't think we actually want to use the healthcheck values for this, as we aren't doing a full healthcheck system, just querying the status.

@thboop - Thanks for sharing that rich context here!

That being said, is there something related to the healthcheck system causing issues with your job that we could help address? Or is this something you noticed and just wanted to make sure wasn't an issue?

I opened this issue as a way of understanding this feature better as a follow-up from @sfertman's report (and their additional context in #496 (comment)).

@thboop
Copy link
Collaborator

thboop commented Jun 26, 2020

Thank you for your patience, I've followed up with our docs team to get the docs updated!

The algorithm for the exponential backoff can be found here and we seed it with the inputs here

At a quick glance:

  • The initial attempt, it will always wait 2 seconds. (the min)
  • The second, it will wait somewhere between 3.6 and 4.4 seconds following the formula (the min + (((2^attempt)-1)*rand(1.6,2.4)) or (2+((2^1)-1)*~2))
    • Attempt is started at 0, so the 2nd attempt is index 1.
  • The third, it will wait somewhere around 8 seconds (6.8 to 9.2) following the same formula
  • The fourth, it will wait somewhere around 16 seconds (13.2 to 18.8) following the same formula
  • The fifth, it will wait somewhere around 32 seconds(26-32), capped at 32 (the max)
  • Then, it will try again every 32 seconds.

We also log the wait times in the log, so you may be able to optimize based off of that as you see the results in real builds!

If you have any more questions please let me know and we can reopen the issue!

@vinayakkulkarni
Copy link

I am facing similar issue, I have,

options: >-
          --health-start-period 30s
          --health-interval 30s
          --health-timeout 10s
          --health-retries 5
          --expose 8080
          --publish 8080:8080

ideally, the start-period should be after 30s but actions runner dies after 2s (default backoff period)

Screenshot 2022-09-21 at 3 52 04 PM

@mziyut
Copy link

mziyut commented Oct 14, 2022

  • Then, it will try again every 32 seconds.

I checked the source code and could not confirm that such a specification is in place.
Does it really work according to the specifications...?

@taylorthurlow
Copy link

Even after reading this thread I was still incredibly confused about all this.

To clarify for anyone else who is in my position: The healthcheck commands being passed to Docker are still doing exactly what they are supposed to in order to determine the health of the container, but Actions has its own entirely separate mechanism for asking Docker what the current status of a given container is, and that is not configurable as far as I can tell.

So as a result you might be seeing extra-long startup wait times if your container's healthcheck interval is particularly high - the container itself could be ready a few seconds after Actions saw the container to not be ready, causing Actions to wait another 32 seconds before seeing the container as healthy.

My container's healthcheck command is super inexpensive so I just set the interval to 1 second and that avoids any extra wait on startup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants