Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accidentally dora command is unresponsive and stuck #253

Closed
meua opened this issue Apr 18, 2023 · 16 comments · Fixed by #664
Closed

Accidentally dora command is unresponsive and stuck #253

meua opened this issue Apr 18, 2023 · 16 comments · Fixed by #664
Labels
bug Something isn't working cli CLI

Comments

@meua
Copy link
Contributor

meua commented Apr 18, 2023

Describe the bug
Accidentally dora command is unresponsive and stuck

To Reproduce
Steps to reproduce the behavior:

  1. Few cases,when executing command dora up , dora start dataflow.yaml, dora stop, dora destroy

Environments (please complete the following information):

@phil-opp
Copy link
Collaborator

Could you give us more details on how to reproduce this issue?

@haixuanTao
Copy link
Collaborator

haixuanTao commented Apr 19, 2023

I think I can reproduce the issue with:

dora up
# started dora coordinator
# started dora daemon

dora destroy
# Send destroy command to dora-coordinator

dora up # <--- This hangs

I think it is due the coordinator waiting for something which makes it unable to respond to other request.

@phil-opp
Copy link
Collaborator

Hmm, I tried it multiple times but I couldn't reproduce the issue on the main branch. Which dora version are you using @haixuanTao ?

@haixuanTao
Copy link
Collaborator

haixuanTao commented Apr 19, 2023

Yep, I think, I will investigate on my end if you cannot reproduce. I used the main branch.

@phil-opp
Copy link
Collaborator

Thanks!

@meua
Copy link
Contributor Author

meua commented Apr 20, 2023

image
reproduce the problem again, the steps are as shown above.

The premise is that dora start has an exception, as shown below:
image

@meua
Copy link
Contributor Author

meua commented Apr 20, 2023

@phil-opp

@meua
Copy link
Contributor Author

meua commented Apr 20, 2023

image reproduce the problem again, the steps are as shown above.

The premise is that dora start has an exception, as shown below: image

After this dora-cli will not respond

@haixuanTao
Copy link
Collaborator

haixuanTao commented Apr 21, 2023

I think it's probably linked to the operator yolov5 not accessing github being the GFW, stucking the initialisation function. But it's going to be very hard for Philipp to reproduce.

@haixuanTao haixuanTao added cli CLI bug Something isn't working labels Apr 25, 2023
@haixuanTao
Copy link
Collaborator

haixuanTao commented Apr 25, 2023

Having retested this issue, this is the stack trace:

(base) ~/D/C/dora ❯❯❯ RUST_LOG=trace dora destroy                                           (base) fix-coordinator-loop ✭
  2023-04-25T08:21:11.484181Z TRACE dora_coordinator::control: Control connection closed
    at binaries/coordinator/src/control.rs:90

  2023-04-25T08:21:11.484197Z TRACE dora_coordinator: Handling event Control(IncomingRequest { request: Destroy, reply_sender: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } })
    at binaries/coordinator/src/lib.rs:142

  2023-04-25T08:21:11.484227Z  INFO dora_coordinator: Received destroy command
    at binaries/coordinator/src/lib.rs:403

  2023-04-25T08:21:11.484359Z  INFO dora_daemon: received destroy command -> exiting
    at binaries/daemon/src/lib.rs:331
    in dora_daemon::run_inner with self.machine_id: 

Send destroy command to dora-coordinator
  2023-04-25T08:21:11.484604Z TRACE dora_coordinator::control: Control connection closed
    at binaries/coordinator/src/control.rs:90

It seems to be due to this TRACE: Control connection closed which happens because there is an ErrorKind::UnexpectedEof

But, looking at running process the dora daemon has exited.

This is probably linked to an error on sending a confirmation of the dora daemon to the coordinator to have been successfully destroyed.

@phil-opp
Copy link
Collaborator

It seems to be due to this TRACE: Control connection closed which happens because there is an ErrorKind::UnexpectedEof

This is expected, as the CLI closes it's control connection to the coordinator when it exits.

@phil-opp
Copy link
Collaborator

The premise is that dora start has an exception, as shown below: image

This seems to be the real issue here. The python operator seems to require GLIBCXX_3.4.29 (required by matplotlib), but it is not found. This error brings down the whole runtime node. I'm not sure why the daemon does not detect this error, but my guess is that it is stuck waiting for the node to finish initialization (for the synchronized start introduced in #236).

So I think there are two things that we need to look into:

  • Why is the mentioned GLIBCXX version required, but not found?
  • Why doesn't the dora daemon detect the operator/node initialization error?

@haixuanTao
Copy link
Collaborator

I opened #271 to track this issue: Why doesn't the dora daemon detect the operator/node initialization error?

@phil-opp
Copy link
Collaborator

Does this issue still happen on the latest version (i.e. with #271 merged)?

@meua
Copy link
Contributor Author

meua commented Jun 27, 2023

Does this issue still happen on the latest version (i.e. with #271 merged)?

The situation described above has not happened again, but there is still a situation where dora stop cannot stop dataflow. This problem occurs because an exception occurs inside an operator that dataflow depends on, as shown below:

(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [YOLOv8] 4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora stop 
> Choose dataflow to stop: [YOLOv8] 4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [YOLOv8] 4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora -V
dora-cli 0.2.3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ 

vi webcam_yolov8.yaml

nodes:
  - id: webcam
    operator:
      python: ../../operators/webcam_op.py
      inputs:
        tick: dora/timer/millis/100
      outputs:
        - image
    env:
      DEVICE_INDEX: 2

  - id: yolov8
    operator: 
      outputs:
        - bbox
      inputs:
        image: webcam/image
      python: ../../operators/yolov8_op.py
    env:
      PYTORCH_DEVICE: "cuda"
#      YOLOV8_PATH: $DORA_DEP_HOME/dependencies/YOLOv8/
#      YOLOV8_WEIGHT_PATH: $DORA_DEP_HOME/dependencies/YOLOv8/weights/yolov8n.pt

  - id: plot
    operator:
      python: ../../operators/plot.py
      inputs:
        image: webcam/image
        obstacles_bbox: yolov8/bbox
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ RUST_LOG=true dora start graphs/tutorials/webcam_yolov8.yaml --attach --hot-reload --name YOLOv8
4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ 

dora logs 4aba7bb7-7966-4839-921d-72c575f7ea33 yolov8

...
─────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     │ Logs from yolov8.
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Ultralytics YOLOv8.0.122 🚀 Python-3.7.16 torch-1.11.0 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12037MiB)
   2 │ YOLOv8n summary (fused): 168 layers, 3151904 parameters, 0 gradients, 8.7 GFLOPs
   3 │ ^Mval: Scanning /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/labels/val2017.cache... 0 images, 0 backgrounds, 5000 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]^Mva
l: Scanning /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/labels/val2017.cache... 0 images, 0 backgrounds, 5000 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]
   4 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000139.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000139.jpg'
   5 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000285.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000285.jpg'
   6 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000632.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000632.jpg'
   7 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000724.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000724.jpg'
   8 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000776.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000776.jpg'
   9 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000785.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000785.jpg'
  10 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000802.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000802.jpg'
  11 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000872.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000872.jpg'
  12 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000885.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000885.jpg'
  13 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001000.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001000.jpg'
  14 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001268.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001268.jpg'
  15 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001296.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001296.jpg'
  16 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001353.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001353.jpg'
  17 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001425.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001425.jpg'
  18 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001490.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001490.jpg'
  19 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001503.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001503.jpg'
  20 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001532.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001532.jpg'
  21 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001584.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001584.jpg'
  22 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001675.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001675.jpg'
  23 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001761.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001761.jpg'
...

@meua
Copy link
Contributor Author

meua commented Jun 27, 2023

Does this issue still happen on the latest version (i.e. with #271 merged)?

v0.2.3 problem still exists, dora-cli unresponsive.

(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora destroy
  2023-06-27T09:32:54.865121Z  WARN dora_daemon::node_communication: failed to send event to daemon

Location:
    /home/runner/work/dora/dora/binaries/daemon/src/node_communication/mod.rs:490:26
    at binaries/daemon/src/node_communication/mod.rs:253

  2023-06-27T09:32:54.865152Z  WARN dora_daemon::node_communication: failed to receive reply from daemon

Location:
    /home/runner/work/dora/dora/binaries/daemon/src/node_communication/mod.rs:494:30
    at binaries/daemon/src/node_communication/mod.rs:253

Send destroy command to dora-coordinator
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
   

To Reproduce
Steps to reproduce the behavior:

  1. Dora start daemon: dora up
  2. Start a new dataflow: dora start graphs/tutorials/webcam_yolov5.yaml --attachml --attach

Screenshots or Video
image
企业微信截图_16878598167164

Environments (please complete the following information):

  • System info: ubuntu 20.04 LTS
  • Dora version: v0.2.3

You need to kill the coodinator and restart it to return to normal.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cli CLI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants