Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Ryuk fails to start due to port binding (colima, timing) #486

Closed
sondr3 opened this issue Mar 20, 2024 · 17 comments · Fixed by #543
Closed

Bug: Ryuk fails to start due to port binding (colima, timing) #486

sondr3 opened this issue Mar 20, 2024 · 17 comments · Fixed by #543

Comments

@sondr3
Copy link

sondr3 commented Mar 20, 2024

Describe the bug

I upgraded from 3.5.0 to 4.1.0 and the container itself fails to spawn because the Ryuk container setup fails. I've tried debugging the issue and it looks like it is trying to bind the port exposed on IPv6 to the port on IPv4 (the container_port variable is correct for IPv4), which are for some reason different ports.

$ docker ps
CONTAINER ID   IMAGE                       COMMAND       CREATED         STATUS         PORTS                                         NAMES
4f1bad20a38c   testcontainers/ryuk:0.5.1   "/bin/ryuk"   7 seconds ago   Up 5 seconds   0.0.0.0:33029->8080/tcp, :::32775->8080/tcp   testcontainers-ryuk-1cf580e2-54c4-496d-a1d7-17f495911219

To Reproduce

Provide a self-contained code snippet that illustrates the bug or unexpected behavior. Ideally, send a Pull Request to illustrate with a test that illustrates the problem.

>       Reaper._socket.connect((container_host, container_port))
E       ConnectionRefusedError: [Errno 61] Connection refused

Runtime environment

Provide a summary of your runtime environment. Which operating system, python version, and docker version are you using? What is the version of testcontainers-python you are using? You can run the following commands to get the relevant information.

$ uname -a
Darwin jupiter.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:30:44 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6000 arm64
$ python --version
Python 3.11.7
$ docker info
gine - Community
 Version:    25.0.4
 Context:    colima
 Debug Mode: false
 Plugins:
  compose: Docker Compose (Docker Inc.)
    Version:  2.25.0
    Path:     /Users/sondre/.docker/cli-plugins/docker-compose

Server:
 Containers: 4
  Running: 0
  Paused: 0
  Stopped: 4
 Images: 112
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
 runc version: v1.1.9-0-gccaecfc
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-21-generic
 Operating System: Ubuntu 23.10
 OSType: linux
 Architecture: aarch64
 CPUs: 2
 Total Memory: 3.817GiB
 Name: colima
 ID: ac2c6903-b356-409d-9301-b040440d1efd
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
@alexanderankin
Copy link
Collaborator

i cant reproduce on my m1 environment, but after what i saw on ipv6 working with compose i have no doubt there is a potential for an issue here.

@santi
Copy link
Collaborator

santi commented Apr 4, 2024

Hi @sondr3!

Sorry to hear that you are having problems. I am on an M3 setup myself, but haven't encountered the same problem that you have. Am I reading it right that you are using colima with a x86 VM as your Docker runtime on an M1 / arm64 system?

Do you encounter the same problem if you run with a native arm64 backend without virtualization/colima?

We'll follow you up closely on this one, as Ryuk is important for us to run smoothly on all architectures.

@sondr3
Copy link
Author

sondr3 commented Apr 4, 2024

@santi, correct. However, I can't run the image natively, I need to run the MSSQL Docker image for tests at $WORK and it only has amd64 images available :(

$ colima status
INFO[0000] colima is running using macOS Virtualization.Framework 
INFO[0000] arch: aarch64                                
INFO[0000] runtime: docker                              
INFO[0000] mountType: virtiofs                          
INFO[0000] socket: unix:///Users/sondre/.colima/default/docker.sock 

@santi
Copy link
Collaborator

santi commented Apr 4, 2024

Ah, try using the mcr.microsoft.com/azure-sql-edge:1.0.7 image instead. It has ARM64 support and the API is compatible with mssql (Note: I haven't tried it extensively, only used as part of testing in this repo).

This doesn't really solve your problem, but worth a try:

import sqlalchemy
from testcontainers.mssql import SqlServerContainer

with SqlServerContainer("mcr.microsoft.com/azure-sql-edge:1.0.7") as mssql:
    engine = sqlalchemy.create_engine(mssql.get_connection_url())
    with engine.begin() as connection:
        result = connection.execute(sqlalchemy.text("select @@VERSION"))

@sondr3
Copy link
Author

sondr3 commented Apr 4, 2024

Using that image work without emulating amd64, but ryuk still fails to start sadly. Interestingly it mostly works when I try to debug and step through, so it may be a timing issue? Not really sure, it works maybe 1 in 5 attempts.

update: I've run the tests a bunch of times on our Ubuntu 22.04 CI machines and it works fine there and on my colleagues Windows machine 🙈

@santi
Copy link
Collaborator

santi commented Apr 4, 2024

Having digged further into this, I strongly believe port bindings are not to blame for this problem. The 0.0.0.0:33029->8080/tcp, :::32775->8080/tcp of your docker ps output indicates that port 33029 on IPv4 and port 32775 on IPv6 are mapped to port 8080 on the inside of your container on their respective IP interfaces. Nothing wrong with that.

If this behavior appears randomly in some cases and consistently when using breakpoint(), I agree it more likely looks like a timing issue. The mysterious thing is that the wait strategy for the Ryuk container is identical to the wait strategy in the Java implementation, which doesn't report the same problem. At the point of the ConnectionRefusedError, are you 100% sure the Ryuk container is running at all? Only case I can think about is that the RYUK_RECONNECTION_TIMEOUT is set so low that Ryuk terminates before a socket is connected. Could you try updating to the latest release (4.3.1) and setting an env variable as RYUK_RECONNECTION_TIMEOUT=30s?

@bjchambers
Copy link

I experienced the same problem on a Mac. Downgrading testcontainers (4.3.2 -> 3.7.1) fixed the issue.

@damianoct
Copy link

The same happened to me. Fixed downgrading it to 3.7.1.
Why Ryuk container is not running during tests in the 3.7.1 version? It seems testcontainers >= 4 now uses ryuk in test execution

@pseidel-kcf
Copy link

RYUK_RECONNECTION_TIMEOUT=30s doesn't do anything for me on 4.3.3 but setting a breakpoint here and pausing for a split second reliably works.

@alexanderankin alexanderankin changed the title Bug: Ryuk fails to start due to port binding Bug: Ryuk fails to start due to port binding (colima, timing) Apr 12, 2024
@alexanderankin
Copy link
Collaborator

sounds like ports become available later on colima, and so we'd want to actually check those and not just wait on logs, if we wanted to be compatible with colima's differences with docker

alexanderankin added a commit that referenced this issue Apr 14, 2024
@alexanderankin
Copy link
Collaborator

does this tweak to retry for ~20 seconds help?

pip install git+https://github.com/testcontainers/testcontainers-python.git@issue486_explore_retry

@pseidel-kcf
Copy link

It doesn't because an unhandled OSError gets thrown.

Simply handling the OSError doesn't help either. I get these exceptions:

[Errno 61] Connection refused
[Errno 22] Invalid argument
[Errno 22] Invalid argument
...
[Errno 22] Invalid argument

Something like this appears to resolve it but I don't know enough about this library, Python, or sockets to know if it's the correct approach.

        last_connection_exception: Optional[OSError] = None
        for _ in range(50):
            try:
                Reaper._socket = socket()
                Reaper._socket.connect((container_host, container_port))
                last_connection_exception = None
                break
            except OSError as e:
                last_connection_exception = e
                from time import sleep

                sleep(0.5)
        if last_connection_exception:
            raise last_connection_exception

@alexanderankin
Copy link
Collaborator

@pseidel-kcf thanks for testing, ive updated my branch - i think from the perspective of maintenance of this library the missing insights are into colima - per the hypothesis that this is a colima timing bug (well bug in the sense its not matching the behavior of docker engine), this approach could be the one to go with

@pseidel-kcf
Copy link

Thanks @alexanderankin. I didn't do a great job explaining but I found that I needed to recreate the socket in addition to handling the exception type.

@rvem
Copy link

rvem commented Apr 16, 2024

which are for some reason different ports.

Looks like another instance of moby/moby#42442

@OverkillGuy
Copy link
Contributor

I had the same issue of ConnectionRefused on linux via Rancher Desktop (colima based), and using the issue486_explore_retry branch it's fixed for me.

Before I saw this thread, I investigated using a new breakpoint() around the socket connection, and even waiting <0.5s to continue, this fixed the connection refused, same as others above.
So I' m strongly leaning towards a timing issue in colima (behaviour deviating from docker engine), especially since the branch linked fixed it for me.

Note that the main branch fails most tests otherwise, due to missing Ryuk connection, and this ever since 4.1 introduced it! Same on work's Intel macbooks, testcontainers-py >=4.1 via Rancher Desktop is a no-go for us, had to pin to <=3.7

Suggest polishing this timing/retry branch, and consider merging, if it proves a good compromise.

@alexanderankin
Copy link
Collaborator

alright im going to merge the associated PR - this will close this issue. please try the next release (4.4.0) when its released in couple mins

and re-open/comment (and we'll reopen) this issue if needed.

alexanderankin pushed a commit that referenced this issue Apr 17, 2024
🤖 I have created a release *beep* *boop*
---


##
[4.4.0](testcontainers-v4.3.3...testcontainers-v4.4.0)
(2024-04-17)


### Features

* **labels:** Add common testcontainers labels
([#519](#519))
([e04b7ac](e04b7ac))
* **network:** Add network context manager
([#367](#367))
([11964de](11964de))


### Bug Fixes

* **core:**
[#486](#486)
for colima delay for port avail for connect
([#543](#543))
([90bb780](90bb780))
* **core:** add TESTCONTAINERS_HOST_OVERRIDE as alternative to TC_HOST
([#384](#384))
([8073874](8073874))
* **dependencies:** remove usage of `sqlalchemy` in DB extras. Add
default wait timeout for `wait_for_logs`
([#525](#525))
([fefb9d0](fefb9d0))
* tests for Kafka container running on ARM64 CPU
([#536](#536))
([29b5179](29b5179))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants