-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多机实验hang住问题 #565
Comments
+1 同样会遇到这个问题 |
同样问题+1 ,自己配的环境有这个问题 |
It is due to deadlock by huggingface tokenizers. Can you follow the error message and set |
RDMA is normally auto-enabled by NCCL. Can you check which part leads to the hang? |
排查到问题可能是由于某次加载的视频文件帧数过多或图片文件过大导致内存爆掉了,dataloader的某个进程被kill掉,多机等待某个节点表现出hang住的问题。 |
建议代码里改成按只读取视频中被sample的几帧,而不是全部加载。 |
谢谢 请问有文档写过数据集的采样逻辑吗 感觉这部分代码不是很好懂 也不太好找到改的地方 |
I think I have solved this issue by rewrite |
Moreover, I found there are videos more than 5k frames which is too much for memory. Maybe some pre-cutting would also work for this issue. |
could you please share the codes of read_video_cv2? I meet a similiar problem |
Sorry, I still got deadlock or dataloader issue by my function. |
Thanks, maybe you just need to filter the very long videos for training the model |
1 similar comment
Thanks, maybe you just need to filter the very long videos for training the model |
In fact, I have tried filter out videos longer than 300 frames, it does help to train longer but still get stuck. |
maybe you could try these methods:
Certainly this is just my speculation, and I think it is memory problem |
Thanks for your insight but I was wondering these are all work around. Finally I will need trianing on long videos. |
This issue is stale because it has been open for 7 days with no activity. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
Have you solved this? I met the same issue with multi-node training. Thanks! |
Hello! What's you progress on solving this? I'm having the same issue. Thanks! |
多机实验,没有明显的报错,节点与master失联,以及1.2这个版本的RDMA怎么开启?
2024-06-30 18:08:30
tokenizers
before the fork if possible2024-06-30 18:08:30
2024-06-30 18:08:30
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
2024-06-30 18:08:30
To disable this warning, you can either:
2024-06-30 18:08:30
tokenizers
before the fork if possible2024-06-30 18:08:30
2024-06-30 18:08:30
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
2024-06-30 18:08:30
To disable this warning, you can either:
2024-06-30 18:08:30
tokenizers
before the fork if possible2024-06-30 18:08:30
2024-06-30 18:08:30
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
2024-06-30 18:08:30
To disable this warning, you can either:
2024-06-30 18:08:30
tokenizers
before the fork if possible2024-06-30 18:08:30
2024-07-01 12:39:44
2024-07-01 12:39:44
opensorav12-720p-22x8-12-worker-19:92:173 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer 10-201-8-92.opensorav12-720p-22x8-12-worker-15.default.svc.cluster.local<53744>
The text was updated successfully, but these errors were encountered: