-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize()
with num_workers > 1
leads to deletion issues
#245
Comments
Some more evidence in another (more rare flaky) test that uses num_workers=2: https://github.com/Lightning-AI/litdata/actions/runs/10013150667/job/27680138130
|
Hi, I also met a similar issue with |
I also found the issue happens when |
I'm also experiencing this issue
This error goes away when I set my |
It seems I can bypass this error with
|
For the second issue related to I think, this is only a weird bug. We are using I don't think it has anything to do with num_workers. I faced this issue couple of times, and from what I remember, it only used to fail in macos. So, I added couple of |
Same issue on my Ubuntu 16 server with num_workers=16. It doesn't always happen, and one way to solve it is to just rerun the code. psesudo code: from PIL import Image
import os
import litdata as ld
def process_patch(input_data):
img_patch, mask_patch, color2label = input_data
img_patch = img_patch.convert("RGB")
mask_patch = mask_patch.convert("RGB")
w, h = mask_patch.size
pixel = mask_patch.getpixel((w//2, h//2))
label_text = color2label.get(pixel, "BG")
if label_text == "BG": return None
label = list(color2label.keys()).index(pixel)
return (img_patch, label)
for slide_id in slide_ids:
img_path = slide_id + "_HE.jpg"
mask_path = slide_id + "_mask.jpg"
img = Image.open(img_path)
mask = Image.open(mask_path)
img_patches = split_image_into_patches(img, patch_size, stride_size)
mask_patches = split_image_into_patches(mask, patch_size, stride_size)
input_data = [(img, mask, color2label) for img, mask in zip(img_patches, mask_patches)]
ld.optimize(
fn=process_patch,
inputs=input_data,
output_dir=os.path.join(patch_dir, slide_id),
num_workers=min(os.cpu_count(), 16),
mode='overwrite',
compression="zstd",
chunk_bytes="64MB"
) |
🐛 Bug
In the LitData tests, we only ever call
optimize()
withnum_workers=1
. In the PR #237 I found that if optimize is called with more workers, then we get a race condition (??) causing some chunks to be deleted and then streaming fails.#237 (comment)
This happens in this test:
litdata/tests/streaming/test_dataset.py
Line 826 in c58b673
(see ToDo comments).
The test fails with
when setting
optimize(num_workers=4)
. This needs to be investigated. However, not possible so far to reproduce locally (only observed in CI)!The text was updated successfully, but these errors were encountered: