Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[attempt 2] Eager import pandas in ray data for python >= 3.7 #33103

Merged
merged 3 commits into from
Mar 7, 2023

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Mar 7, 2023

Why are these changes needed?

The original fix was reverted: #32533

Closes #32435

Signed-off-by: Eric Liang <[email protected]>
@jianoaix
Copy link
Contributor

jianoaix commented Mar 7, 2023

There seems a new failure of python/ray/data/tests/test_dataset_image.py.

@ericl
Copy link
Contributor Author

ericl commented Mar 7, 2023

It seems to be broken in master for me.

@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 7, 2023
@c21
Copy link
Contributor

c21 commented Mar 7, 2023

Weird, I didn't see the failure on https://flakey-tests.ray.io/, or on my split PR - #33101 . Could we rebase to latest master to be sure?

@jianoaix
Copy link
Contributor

jianoaix commented Mar 7, 2023

Yeah the one with python/ray/data/tests/test_dataset_image.py looks new:

ray_start_regular_shared = <python.ray.data.tests.test_dataset_image.TestReadImages object at 0x7fc69d1dca90>

    def test_dynamic_block_split(ray_start_regular_shared):
        ctx = ray.data.context.DatasetContext.get_current()
        target_max_block_size = ctx.target_max_block_size
        block_splitting_enabled = ctx.block_splitting_enabled
        # Reduce target max block size to trigger block splitting on small input.
        # Otherwise we have to generate big input files, which is unnecessary.
        ctx.target_max_block_size = 1
        ctx.block_splitting_enabled = True
        try:
            root = "example://image-datasets/simple"
            ds = ray.data.read_images(root, parallelism=1)
            assert ds.num_blocks() == 1
            ds.fully_executed()
            # Verify dynamic block splitting taking effect to generate more blocks.
            assert ds.num_blocks() == 3
    
            # NOTE: Need to wait for 1 second before checking stats, because we report
            # stats to stats actors asynchronously when returning the blocks metadata.
            # TODO(chengsu): clean it up after refactoring lazy block list.
            time.sleep(1)
>           assert "3 blocks executed" in ds.stats()
E           AssertionError: assert '3 blocks executed' in 'Stage 0 read: 1/1 blocks split from parent in 0s\n* Output num rows: 3 min, 3 max, 3 mean, 3 total\n* Output size bytes: 2724 min, 2724 max, 2724 mean, 2724 total\n'
E            +  where 'Stage 0 read: 1/1 blocks split from parent in 0s\n* Output num rows: 3 min, 3 max, 3 mean, 3 total\n* Output size bytes: 2724 min, 2724 max, 2724 mean, 2724 total\n' = <bound method Dataset.stats of Dataset(num_blocks=3, num_rows=3, schema={image: ArrowTensorType(shape=(32, 32, 3), dtype=uint8)})>()
E            +    where <bound method Dataset.stats of Dataset(num_blocks=3, num_rows=3, schema={image: ArrowTensorType(shape=(32, 32, 3), dtype=uint8)})> = Dataset(num_blocks=3, num_rows=3, schema={image: ArrowTensorType(shape=(32, 32, 3), dtype=uint8)}).stats

python/ray/data/tests/test_dataset_image.py:201: AssertionError

@c21
Copy link
Contributor

c21 commented Mar 7, 2023

ah I guess in this case, the eager import causes longer execution time for test, so sleep for 1 second is not enough:

# NOTE: Need to wait for 1 second before checking stats, because we report
# stats to stats actors asynchronously when returning the blocks metadata.
# TODO(chengsu): clean it up after refactoring lazy block list.
time.sleep(1)

@ericl - could you try increasing the sleep time, e.g. 5 seconds?

Signed-off-by: Eric Liang <[email protected]>
@ericl
Copy link
Contributor Author

ericl commented Mar 7, 2023

I changed it to a wait for condition, but I think the failure is not related. I can reliably reproduce on master

@ericl ericl merged commit 1ba4729 into ray-project:master Mar 7, 2023
@c21
Copy link
Contributor

c21 commented Mar 7, 2023

I changed it to a wait for condition, but I think the failure is not related. I can reliably reproduce on master

I guess one possible reason might be due to different laptop environment, so some of us can reproduce locally but not all.

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] [data] Pandas has no attribute 'core' likely due to race with import thread
4 participants