Nginx does not run #1

nlacasse · 2018-04-26T21:45:27Z

Depending on the configuration, you may see nginx fail with the error:

ioctl(FIOASYNC) failed while spawning "worker process" (25: Inappropriate ioctl for device)

Support for FIOASYNC is in progress, but it’s not available yet. For now, add
the line below to /etc/nginx/nginx.conf:

master_process off;

The text was updated successfully, but these errors were encountered:

seeekr · 2018-05-18T17:13:50Z

Am I correct in assuming that this will potentially make nginx behave in unexpected ways and thus it is not something one should attempt for any somewhat serious use of nginx + gvisor?

To quote the nginx docs at http://nginx.org/en/docs/ngx_core_module.html#master_process:

Syntax: | master_process on \| off;
-- | --
master_process on;
main

Determines whether worker processes are started. This directive is intended for nginx developers.

nlacasse · 2018-06-05T17:00:31Z

I'm not totally sure what the "master_process" directive does, but the docs do state that it should not be used in production.

http://nginx.org/en/docs/faq/daemon_master_process_off.html

We are still working on the FIOASYNC support, at which point this workaround will not be necessary.

tirumaraiselvan · 2018-06-22T12:36:05Z

Is there an example to repro this behaviour? I am not having an issue with nginx:latest.

nlacasse · 2018-06-22T15:46:32Z

gVisor now supports FIOASYNC ioctl. I just verified that nginx runs without the "master_process: off" directive.

Updates #1 PiperOrigin-RevId: 201760129

Updates google#1 PiperOrigin-RevId: 201760129 Change-Id: Ifd8ce9e0f93c6771083dc9bf8d35a2800c13481a

glibc's malloc also uses SYS_TIME. Permit it. #0 0x0000000000de6267 in time () #1 0x0000000000db19d8 in get_nprocs () #2 0x0000000000d8a31a in arena_get2.part () #3 0x0000000000d8ab4a in malloc () #4 0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) () #5 0x0000000000d4cd70 in __tsan_go_start () #6 0x00000000004617a3 in racecall () #7 0x00000000010f4ea0 in runtime.findfunctab () #8 0x000000000043f193 in runtime.racegostart () Signed-off-by: Dmitry Vyukov <[email protected]> [[email protected]: updated comments and commit message] Signed-off-by: Michael Pratt <[email protected]> Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a PiperOrigin-RevId: 203042627

glibc's malloc also uses SYS_TIME. Permit it. #0 0x0000000000de6267 in time () google#1 0x0000000000db19d8 in get_nprocs () google#2 0x0000000000d8a31a in arena_get2.part () google#3 0x0000000000d8ab4a in malloc () google#4 0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) () google#5 0x0000000000d4cd70 in __tsan_go_start () google#6 0x00000000004617a3 in racecall () google#7 0x00000000010f4ea0 in runtime.findfunctab () google#8 0x000000000043f193 in runtime.racegostart () Signed-off-by: Dmitry Vyukov <[email protected]> [[email protected]: updated comments and commit message] Signed-off-by: Michael Pratt <[email protected]> Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a PiperOrigin-RevId: 203042627

Updates #1 PiperOrigin-RevId: 201760129 Change-Id: Ifd8ce9e0f93c6771083dc9bf8d35a2800c13481a Upstream-commit: 9c0c4fd

glibc's malloc also uses SYS_TIME. Permit it. #0 0x0000000000de6267 in time () #1 0x0000000000db19d8 in get_nprocs () #2 0x0000000000d8a31a in arena_get2.part () #3 0x0000000000d8ab4a in malloc () google#4 0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) () google#5 0x0000000000d4cd70 in __tsan_go_start () google#6 0x00000000004617a3 in racecall () google#7 0x00000000010f4ea0 in runtime.findfunctab () google#8 0x000000000043f193 in runtime.racegostart () Signed-off-by: Dmitry Vyukov <[email protected]> [[email protected]: updated comments and commit message] Signed-off-by: Michael Pratt <[email protected]> Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a PiperOrigin-RevId: 203042627 Upstream-commit: 6144751

Below command under hostinet network will lead to panic: $ cat /proc/net/tcp It's caused by the wrong SizeOfTCPInfo. #0 runtime.panicindex() google#1 encoding/binary.littleEndian.Uint64 google#2 encoding/binary.(*littleEndian).Uint64 google#3 gvisor.dev/gvisor/pkg/binary.unmarshal google#4 gvisor.dev/gvisor/pkg/binary.unmarshal google#5 gvisor.dev/gvisor/pkg/binary.Unmarshal google#6 gvisor.dev/gvisor/pkg/sentry/socket/hostinet.(*socketOperations).State google#7 gvisor.dev/gvisor/pkg/sentry/fs/proc.(*netTCP).ReadSeqFileData Correct SizeOfTCPInfo from 104 to 192 to fix it. Fixes google#640 Signed-off-by: Jianfeng Tan <[email protected]>

catch up

This change adds more information about what needs to be done to implement `/dev/fuse`

This change adds more information about what needs to be done to implement `/dev/fuse` FUTURE_COPYBARA_INTEGRATE_REVIEW=#2855 from ridwanmsharif:ridwanmsharif/fuse-doc-edit 5173c96 PiperOrigin-RevId: 314428000

Distributed training isn't working with PyTorch on certain A100 nodes. Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training. ## Reproduction This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB. - **NVIDIA Driver Version**: 550.54.15 - **CUDA Version**: 12.4 - **NVIDIA device**: NVIDIA A100 80GB PCIe ### Steps 1. **Install gvisor** ```bash URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}" wget -nc "${URL}/runsc" "${URL}/runsc.sha512" chmod +x runsc sudo cp runsc /usr/local/bin/runsc sudo /usr/local/bin/runsc install sudo systemctl reload docker ``` 2. **Add GPU enabling gvisor options** ```json { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] }, "runsc": { "path": "/usr/local/bin/runsc", "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"] } } } ``` Reload configs with `sudo systemctl reload docker`. 3. **Run reproduction NCCL test** This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL. ```Dockerfile # Dockerfile FROM python:3.9.15-slim-bullseye RUN pip install torch numpy COPY <<EOF repro.py import argparse import datetime import os import torch import torch.distributed as dist import torch.multiprocessing as mp def setup(rank, world_size): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600)) torch.cuda.set_device(rank) def cleanup(): dist.destroy_process_group() def send_tensor(rank, world_size): try: setup(rank, world_size) # rank receiving all tensors target_rank = world_size - 1 dist.barrier() tensor = torch.ones(5).cuda(rank) if rank < target_rank: print(f"[RANK {rank}] sending tensor: {tensor}") dist.send(tensor=tensor, dst=target_rank) elif rank == target_rank: for other_rank in range(target_rank): tensor = torch.zeros(5).cuda(target_rank) dist.recv(tensor=tensor, src=other_rank) print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}") print("PASS: NCCL working.") except Exception as e: print(f"[RANK {rank}] error in send_tensor: {e}") raise finally: cleanup() def main(world_size: int = 2): mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True) if __name__ == "__main__": parser = argparse.ArgumentParser(description="Run torch-based NCCL tests") parser.add_argument("world_size", type=int, help="number of GPUs to run test on") args = parser.parse_args() if args.world_size < 2: raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}") main(args.world_size) EOF ENTRYPOINT ["python", "repro.py", "4"] ``` Build image with: ``` docker build -f Dockerfile . ``` Then run it with: ``` sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1 ``` #### Failure (truncated) ``` ... Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python) . This may indicate a possible application crash on rank 0 or a network set up issue. ... ``` ### Fix gvisor debug logs show: ``` W0702 20:36:17.577055 445833 uvm.go:148] [ 22: 84] nvproxy: unknown uvm ioctl 66 = 0x42 ``` I've implemented that ioctl in this PR. This is the output after the fix. ``` [RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2') [RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0') [RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1') [RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3') [RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3') [RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3') PASS: NCCL working. ``` FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734 PiperOrigin-RevId: 649146570

nlacasse assigned iangudger Apr 26, 2018

balasu mentioned this issue May 20, 2018

runsc runtime not working with centos 7.5 #55

Closed

huster-hh mentioned this issue Jun 13, 2018

How is the information of procfs and sysfs from? #65

Closed

nlacasse closed this as completed Jun 22, 2018

shentubot pushed a commit that referenced this issue Jun 22, 2018

Remove nginx failure note now that it works

ea70075

Updates #1 PiperOrigin-RevId: 201760129

dvyukov pushed a commit to dvyukov/gvisor that referenced this issue Jun 23, 2018

Remove nginx failure note now that it works

9c0c4fd

Updates google#1 PiperOrigin-RevId: 201760129 Change-Id: Ifd8ce9e0f93c6771083dc9bf8d35a2800c13481a

newmanwang mentioned this issue Jul 24, 2018

bazel build error: unrecognized import path: "golang.org/x/sys" #89

Closed

newmanwang mentioned this issue Sep 22, 2018

Test case failure: kvm_test failure #106

Closed

tonistiigi referenced this issue in tonistiigi/gvisor Jan 30, 2019

Remove nginx failure note now that it works

f49849a

Updates #1 PiperOrigin-RevId: 201760129 Change-Id: Ifd8ce9e0f93c6771083dc9bf8d35a2800c13481a Upstream-commit: 9c0c4fd

prattmic mentioned this issue Jun 5, 2019

Pipes/PipeTest_BlockPartialWriteClosed/namednonblocking fails with overlayfs #318

Closed

tanjianfeng mentioned this issue Aug 2, 2019

fix wrong SizeOfTCPInfo #643

Closed

mcowger mentioned this issue Dec 19, 2019

gvisor prevents AMQP sockets from opening (TCP_SYNCNT) #1441

Closed

copybara-service bot pushed a commit that referenced this issue Apr 27, 2020

Merge pull request #1 from google/master

e896ca5

catch up

copybara-service bot pushed a commit that referenced this issue May 12, 2020

Fix typo in README (#1)

7c3a00a

ridwanmsharif referenced this issue in ridwanmsharif/gvisor Jun 2, 2020

Add some detail to milestone #1

5173c96

This change adds more information about what needs to be done to implement `/dev/fuse`

copybara-service bot mentioned this issue Jun 2, 2020

Add some detail to milestone #1 #2858

Merged

DarcySail mentioned this issue Aug 18, 2020

Gvisor failed to ftruncate on deleted file #3654

Closed

chiraggupta06 mentioned this issue Sep 1, 2020

conainerD with gvisor not working(failed to connect: dial unix: missing address) #3820

Closed

jelischer mentioned this issue Sep 10, 2020

Use of WriteHeaderIncludePacket for ICMP reply packets causes test failures on raw packet receive. #3902

Closed

Kos-M mentioned this issue Feb 11, 2024

OCI runtime create failed: creating container: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF: unknown #9996

Closed

This was referenced Feb 22, 2024

Regression in handling of xxx | grep > /dev/null #10046

Closed

Inconsistent inode numbers for mounted files #10047

Open

TommyTran732 mentioned this issue Feb 23, 2024

gVisor stops working on Fedora CoreOS after release-20240122.0 #10062

Closed

p12tic mentioned this issue Feb 24, 2024

Parent process is not notified about exited child stdin being closed #10066

Closed

jcodybaker mentioned this issue Mar 11, 2024

--overlay2=none + supervisord - panic: interface conversion: interface {} is nil, not *gofer.lisafsDentry #10143

Closed

cncal mentioned this issue Apr 18, 2024

run container failed: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF: unknown. #10295

Closed

9Bakabaka mentioned this issue May 14, 2024

Can not run gvisor at rockylinux 9.3 with docker 26.1.2 #10441

Closed

jseba mentioned this issue May 15, 2024

Metrics server flag no longer exists #10455

Closed

sfc-gh-jyin mentioned this issue May 29, 2024

Python program running slower inside Gvisor sandbox with ARM64 #10487

Open

artemislena mentioned this issue Jun 2, 2024

Doesn't work with kernel.yama.ptrace_scope=3 #10495

Closed

q53 mentioned this issue Jul 5, 2024

Illegal instruction when gvisor running with --platform=kvm #10625

Closed

q53 mentioned this issue Jul 8, 2024

runsc --platform=systrap fails with "panic: seccomp failed: invalid argument" #10633

Closed

zpavlinovic mentioned this issue Aug 8, 2024

runsc (in docker): fork/exec /proc/self/exe: read-only file system #10747

Closed

chetan-reddy mentioned this issue Sep 2, 2024

runsc unable to run bash from a guix pack #10849

Open

chetan-reddy mentioned this issue Sep 19, 2024

emacs fails with "Could not open file: /dev/tty" #10925

Closed

apyrgio mentioned this issue Sep 23, 2024

Operation not permitted when mounting /proc to /tmp/proc #10944

Closed

nt mentioned this issue Oct 10, 2024

Port forwarding fails #11019

Closed

This was referenced Oct 11, 2024

'process/user/umask' from the OCI runtime spec not honoured #11022

Closed

gofer mount fails w/ "permission denied" using rootful Podman >= 5.2.0 (git bisected) #11040

Closed

apyrgio mentioned this issue Oct 30, 2024

Nested gVisor does not work with --directfs=false and Yama mode 2 #11091

Closed

markusthoemmes mentioned this issue Oct 31, 2024

ollama runs into segmentation faults in gvisor + nvproxy #11098

Closed

BinaryKhaos mentioned this issue Nov 8, 2024

Add support for O_TMPFILE #11143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nginx does not run #1

Nginx does not run #1

nlacasse commented Apr 26, 2018

seeekr commented May 18, 2018

nlacasse commented Jun 5, 2018

tirumaraiselvan commented Jun 22, 2018

nlacasse commented Jun 22, 2018

Nginx does not run #1

Nginx does not run #1

Comments

nlacasse commented Apr 26, 2018

seeekr commented May 18, 2018

nlacasse commented Jun 5, 2018

tirumaraiselvan commented Jun 22, 2018

nlacasse commented Jun 22, 2018