多机实验hang住问题 #565

nebuladream · 2024-07-01T11:36:35Z

多机实验，没有明显的报错，节点与master失联，以及1.2这个版本的RDMA怎么开启？
2024-06-30 18:08:30

Avoid using tokenizers before the fork if possible
2024-06-30 18:08:30
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2024-06-30 18:08:30
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
2024-06-30 18:08:30
To disable this warning, you can either:
2024-06-30 18:08:30
Avoid using tokenizers before the fork if possible
2024-06-30 18:08:30
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2024-06-30 18:08:30
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
2024-06-30 18:08:30
To disable this warning, you can either:
2024-06-30 18:08:30
Avoid using tokenizers before the fork if possible
2024-06-30 18:08:30
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2024-06-30 18:08:30
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
2024-06-30 18:08:30
To disable this warning, you can either:
2024-06-30 18:08:30
Avoid using tokenizers before the fork if possible
2024-06-30 18:08:30
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2024-07-01 12:39:44
2024-07-01 12:39:44
opensorav12-720p-22x8-12-worker-19:92:173 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer 10-201-8-92.opensorav12-720p-22x8-12-worker-15.default.svc.cluster.local<53744>

The text was updated successfully, but these errors were encountered:

CacacaLalala · 2024-07-04T16:47:04Z

+1 同样会遇到这个问题
一般迭代到一两百次，代码就会hang住

CIntellifusion · 2024-07-08T16:45:37Z

同样问题+1 ，自己配的环境有这个问题
但是luchenyun上的镜像没有这个问题

JThh · 2024-07-08T20:38:22Z

It is due to deadlock by huggingface tokenizers. Can you follow the error message and set export TOKENIZERS_PARALLELISM=false?

FrankLeeeee · 2024-07-10T02:33:32Z

RDMA is normally auto-enabled by NCCL. Can you check which part leads to the hang?

nebuladream · 2024-07-11T03:42:11Z

排查到问题可能是由于某次加载的视频文件帧数过多或图片文件过大导致内存爆掉了，dataloader的某个进程被kill掉，多机等待某个节点表现出hang住的问题。

nebuladream · 2024-07-11T03:46:11Z

建议代码里改成按只读取视频中被sample的几帧，而不是全部加载。

CIntellifusion · 2024-07-11T15:57:23Z

建议代码里改成按只读取视频中被sample的几帧，而不是全部加载。

谢谢请问有文档写过数据集的采样逻辑吗感觉这部分代码不是很好懂也不太好找到改的地方

CIntellifusion · 2024-07-13T13:31:05Z

I think I have solved this issue by rewrite read_video_cv2

CIntellifusion · 2024-07-13T13:34:24Z

Moreover, I found there are videos more than 5k frames which is too much for memory. Maybe some pre-cutting would also work for this issue.

leonardodora · 2024-07-13T16:02:37Z

read_video_cv2

could you please share the codes of read_video_cv2? I meet a similiar problem

CIntellifusion · 2024-07-15T15:46:43Z

read_video_cv2

could you please share the codes of read_video_cv2? I meet a similiar problem

Sorry, I still got deadlock or dataloader issue by my function.
After everything is done, I will share it.
Or hopefully, there would be an official pr by that time.

leonardodora · 2024-07-18T01:24:12Z

read_video_cv2

could you please share the codes of read_video_cv2? I meet a similiar problem

Sorry, I still got deadlock or dataloader issue by my function. After everything is done, I will share it. Or hopefully, there would be an official pr by that time.

Thanks, maybe you just need to filter the very long videos for training the model

leonardodora · 2024-07-18T01:24:18Z

read_video_cv2

could you please share the codes of read_video_cv2? I meet a similiar problem

Sorry, I still got deadlock or dataloader issue by my function. After everything is done, I will share it. Or hopefully, there would be an official pr by that time.

Thanks, maybe you just need to filter the very long videos for training the model

CIntellifusion · 2024-07-18T02:04:48Z

read_video_cv2

could you please share the codes of read_video_cv2? I meet a similiar problem

Sorry, I still got deadlock or dataloader issue by my function. After everything is done, I will share it. Or hopefully, there would be an official pr by that time.

Thanks, maybe you just need to filter the very long videos for training the model

In fact, I have tried filter out videos longer than 300 frames, it does help to train longer but still get stuck.
My machine has 8*H100 with 1.0T mem. I am worrying this mem is not enough.

leonardodora · 2024-07-18T06:03:53Z

read_video_cv2

could you please share the codes of read_video_cv2? I meet a similiar problem

Sorry, I still got deadlock or dataloader issue by my function. After everything is done, I will share it. Or hopefully, there would be an official pr by that time.

Thanks, maybe you just need to filter the very long videos for training the model

In fact, I have tried filter out videos longer than 300 frames, it does help to train longer but still get stuck. My machine has 8*H100 with 1.0T mem. I am worrying this mem is not enough.

maybe you could try these methods:

filter out or down sample the high-resolution video
decrease the video batch size
extract the vae feature(and t5 feature) offline.

Certainly this is just my speculation, and I think it is memory problem

CIntellifusion · 2024-07-18T07:35:34Z

Thanks for your insight but I was wondering these are all work around. Finally I will need trianing on long videos.

github-actions · 2024-09-16T02:00:02Z

This issue is stale because it has been open for 7 days with no activity.

github-actions · 2024-09-25T01:59:57Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

ZeWang95 · 2024-10-15T19:37:47Z

+1 同样会遇到这个问题一般迭代到一两百次，代码就会hang住

Have you solved this? I met the same issue with multi-node training. Thanks!

ZeWang95 · 2024-10-15T19:38:41Z

Thanks for your insight but I was wondering these are all work around. Finally I will need trianing on long videos.

Hello! What's you progress on solving this? I'm having the same issue. Thanks!

JThh added the bug Something isn't working label Jul 8, 2024

CIntellifusion mentioned this issue Jul 10, 2024

镜像的Python是3.10 文档的Python是3.9 #589

Closed

github-actions bot added the stale label Sep 16, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多机实验hang住问题 #565

多机实验hang住问题 #565

nebuladream commented Jul 1, 2024

CacacaLalala commented Jul 4, 2024

CIntellifusion commented Jul 8, 2024

JThh commented Jul 8, 2024

FrankLeeeee commented Jul 10, 2024

nebuladream commented Jul 11, 2024

nebuladream commented Jul 11, 2024

CIntellifusion commented Jul 11, 2024

CIntellifusion commented Jul 13, 2024

CIntellifusion commented Jul 13, 2024

leonardodora commented Jul 13, 2024

CIntellifusion commented Jul 15, 2024

leonardodora commented Jul 18, 2024

leonardodora commented Jul 18, 2024

CIntellifusion commented Jul 18, 2024

leonardodora commented Jul 18, 2024 •

edited

Loading

CIntellifusion commented Jul 18, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 25, 2024

ZeWang95 commented Oct 15, 2024

ZeWang95 commented Oct 15, 2024

多机实验hang住问题 #565

多机实验hang住问题 #565

Comments

nebuladream commented Jul 1, 2024

CacacaLalala commented Jul 4, 2024

CIntellifusion commented Jul 8, 2024

JThh commented Jul 8, 2024

FrankLeeeee commented Jul 10, 2024

nebuladream commented Jul 11, 2024

nebuladream commented Jul 11, 2024

CIntellifusion commented Jul 11, 2024

CIntellifusion commented Jul 13, 2024

CIntellifusion commented Jul 13, 2024

leonardodora commented Jul 13, 2024

CIntellifusion commented Jul 15, 2024

leonardodora commented Jul 18, 2024

leonardodora commented Jul 18, 2024

CIntellifusion commented Jul 18, 2024

leonardodora commented Jul 18, 2024 • edited Loading

CIntellifusion commented Jul 18, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 25, 2024

ZeWang95 commented Oct 15, 2024

ZeWang95 commented Oct 15, 2024

leonardodora commented Jul 18, 2024 •

edited

Loading