Specifying a timeout option seems to be ignored by the runner #496

francisfuzz · 2020-05-21T00:50:56Z

Describe the bug

Specifying a timeout option of 5 seconds doesn't seem to be applied when initializing a service or a container in an Actions workflow.

To Reproduce
Steps to reproduce the behavior:

Create an Actions workflow that initializes either a service or a container with the following options for health check with a 5-second timeout. Here's an example:

    services:
      redis:
        image: redis:alpine
        options: --health-cmd "redis-cli ping" --health-interval 10s --health-timeout 5s --health-retries 5
        ports:
          - 6379:6379

Trigger a workflow.
Observe the workflow run logs will show (see below) an incremental backoff, just not 5 seconds apart for timeout as evidence in logs below.

Expected behavior

A clear and concise description of what you expected to happen.

I expected to see the health check performed in 5 second increments rather than a lower count of 1-4 seconds.

Runner Version and Platform

Latest in production as of May 20, 2020.
OS of the machine running the runner? Linux

What's not working?

Please include error messages and screenshots.

See below. ✌️

Job Log Output

https://github.com/francisfuzz/molecula/runs/694847523?check_suite_focus=true#step:2:39
logs_2.zip

/usr/bin/docker create --name 19f22998701d46b9948ff54476cadd73_redisalpine_f9b603 --label be76db --network github_network_1977d79f05bb41db9d15cb4ed4c9de5f --network-alias redis -p 6379:6379 --health-cmd "redis-cli ping" --health-interval 10s --health-timeout 5s --health-retries 5 -e GITHUB_ACTIONS=true -e CI=true redis:alpine
84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c
/usr/bin/docker start 84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c
84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c
/usr/bin/docker ps --all --filter id=84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c --filter status=running --no-trunc --format "{{.ID}} {{.Status}}"
84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c Up Less than a second (health: starting)
/usr/bin/docker port 84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c
6379/tcp -> 0.0.0.0:6379
/usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" 84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c
starting
redis service is starting, waiting 2 seconds before checking again.
/usr/bin/docker inspect --format="{{if .Config.Healthcheck}}{{print .State.Health.Status}}{{end}}" 84c14711d4e97bf675dfb0555aff4d62864fd005d18059bc2e5132abdf83641c

Runner and Worker's Diagnostic Logs

logs_2.zip

👋 @actions/service-oncall for triage/prioritization

The text was updated successfully, but these errors were encountered:

thboop · 2020-05-27T15:21:48Z

Hey @francisfuzz thanks for the report!

There are actually two different things going on here:

Docker's built in health check system
The runner polling the docker container health via the docker inspect command

The --health-cmd "redis-cli ping" --health-interval 10s --health-timeout 5s --health-retries 5 applies to the docker container. Every 10 seconds it will run redis-cli ping to determine if the container is healthy or not.

The runner uses the docker inspect command to get the status of the container. Once all containers are started and healthy, it proceeds to then continue with the job, completing the setup job steps and moving forward into the actual job steps. We have an exponential backoff to power this, and do not use the docker container health values. The message you are seeing comes from here

The mainline scenario for the later is waiting for all container to have started before continuing the job. I don't think we actually want to use the healthcheck values for this, as we aren't doing a full healthcheck system, just querying the status.

That being said, is there something related to the healthcheck system causing issues with your job that we could help address? Or is this something you noticed and just wanted to make sure wasn't an issue?

sfertman · 2020-05-28T00:14:05Z

@thboop, thanks for the clarification. I originally posted the question/issue with github support about this. I think this is a miscommunication issue. The docs show an example with docker container health-check but the runner is actually doing docker inspect on its own unrelated clock. Would be nice to have that clarified in the documentation, as well as a clear specification of the schedule of this exponential backoff. This is important for optimizing job run time where service containers with long startup time are involved.

francisfuzz · 2020-06-02T16:51:08Z

The mainline scenario for the later is waiting for all container to have started before continuing the job. I don't think we actually want to use the healthcheck values for this, as we aren't doing a full healthcheck system, just querying the status.

@thboop - Thanks for sharing that rich context here!

That being said, is there something related to the healthcheck system causing issues with your job that we could help address? Or is this something you noticed and just wanted to make sure wasn't an issue?

I opened this issue as a way of understanding this feature better as a follow-up from @sfertman's report (and their additional context in #496 (comment)).

thboop · 2020-06-26T04:32:03Z

Thank you for your patience, I've followed up with our docs team to get the docs updated!

The algorithm for the exponential backoff can be found here and we seed it with the inputs here

At a quick glance:

The initial attempt, it will always wait 2 seconds. (the min)
The second, it will wait somewhere between 3.6 and 4.4 seconds following the formula (the min + (((2^attempt)-1)*rand(1.6,2.4)) or (2+((2^1)-1)*~2))
- Attempt is started at 0, so the 2nd attempt is index 1.
The third, it will wait somewhere around 8 seconds (6.8 to 9.2) following the same formula
The fourth, it will wait somewhere around 16 seconds (13.2 to 18.8) following the same formula
The fifth, it will wait somewhere around 32 seconds(26-32), capped at 32 (the max)
Then, it will try again every 32 seconds.

We also log the wait times in the log, so you may be able to optimize based off of that as you see the results in real builds!

If you have any more questions please let me know and we can reopen the issue!

vinayakkulkarni · 2022-09-21T10:22:30Z

I am facing similar issue, I have,

options: >-
          --health-start-period 30s
          --health-interval 30s
          --health-timeout 10s
          --health-retries 5
          --expose 8080
          --publish 8080:8080

ideally, the start-period should be after 30s but actions runner dies after 2s (default backoff period)

mziyut · 2022-10-14T07:06:28Z

Then, it will try again every 32 seconds.

I checked the source code and could not confirm that such a specification is in place.
Does it really work according to the specifications...?

taylorthurlow · 2024-02-20T17:38:51Z

Even after reading this thread I was still incredibly confused about all this.

To clarify for anyone else who is in my position: The healthcheck commands being passed to Docker are still doing exactly what they are supposed to in order to determine the health of the container, but Actions has its own entirely separate mechanism for asking Docker what the current status of a given container is, and that is not configurable as far as I can tell.

So as a result you might be seeing extra-long startup wait times if your container's healthcheck interval is particularly high - the container itself could be ready a few seconds after Actions saw the container to not be ready, causing Actions to wait another 32 seconds before seeing the container as healthy.

My container's healthcheck command is super inexpensive so I just set the interval to 1 second and that avoids any extra wait on startup.

francisfuzz added the bug Something isn't working label May 21, 2020

hross assigned thboop May 22, 2020

TingluoHuang added documentation runner Runner ❤️ Container and removed bug Something isn't working runner labels Jun 8, 2020

thboop closed this as completed Jun 26, 2020

snyk-bot mentioned this issue Jun 21, 2022

[Snyk] Upgrade @zeit/ncc from 0.20.5 to 0.22.3 LadyK-21/runner#6

Closed

anakin87 mentioned this issue Dec 21, 2023

ci: make Tika service smoothly run on MacOs deepset-ai/haystack#6619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying a timeout option seems to be ignored by the runner #496

Specifying a timeout option seems to be ignored by the runner #496

francisfuzz commented May 21, 2020

thboop commented May 27, 2020

sfertman commented May 28, 2020

francisfuzz commented Jun 2, 2020

thboop commented Jun 26, 2020

vinayakkulkarni commented Sep 21, 2022

mziyut commented Oct 14, 2022

taylorthurlow commented Feb 20, 2024

Specifying a timeout option seems to be ignored by the runner #496

Specifying a timeout option seems to be ignored by the runner #496

Comments

francisfuzz commented May 21, 2020

Runner Version and Platform

What's not working?

Job Log Output

Runner and Worker's Diagnostic Logs

thboop commented May 27, 2020

sfertman commented May 28, 2020

francisfuzz commented Jun 2, 2020

thboop commented Jun 26, 2020

vinayakkulkarni commented Sep 21, 2022

mziyut commented Oct 14, 2022

taylorthurlow commented Feb 20, 2024