Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert "[Core] Support Arrow zerocopy serialization in object… #36153

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

Deegue
Copy link
Contributor

@Deegue Deegue commented Jun 7, 2023

… store (#35110)" (#36000)"

This reverts commit 822904b.

Why are these changes needed?

Last PR was reverted since test_advanced_9 failed. However, I tested test_advanced_9 on local device and it passed.

image

This assert failure looks unrelated to this PR.

import numpy # noqa: F401
from threadpoolctl import threadpool_info
for pool_info in threadpool_info():
assert pool_info["num_threads"] == 2

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Deegue
Copy link
Contributor Author

Deegue commented Jun 7, 2023

It's weird test_advanced_9 of (Medium A-J) failed in CI but worked well on my local device..

@Deegue
Copy link
Contributor Author

Deegue commented Jun 12, 2023

Gentle ping @ericl @rkooo567 for help, thanks~

@ericl
Copy link
Contributor

ericl commented Jun 12, 2023

Not really sure, but perhaps try disabling parts of your PR to see what is causing the issue? It may be worth adding more debug logs for where OMP_NUM_THREADS is being set as well.

@rickyyx rickyyx self-assigned this Jun 27, 2023
@@ -420,7 +420,7 @@ def test_omp_threads_set_third_party(ray_start_cluster, monkeypatch):
cluster.add_node(num_cpus=4)
ray.init(address=cluster.address)

@ray.remote(num_cpus=2)
@ray.remote(num_cpus=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the reason that it broke the test I guess?

We use num_cpus for OMP_NUM_THREADS for a task - why did it get changed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible reason that it might break is if any new import introduced in the PR eagerly imports numpy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, this change is for test only, and it won't affect the result since I tested before.

This PR imports pyarrow, yet I cannot get the relationship with the break.. Could you find where the 4 comes from in the break assertation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely the 4 comes from the num_cpus passed into init.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's fixed on the current HEAD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found it a little bit tricky since it's fixed after I moved import pyarrow into functions..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other cases failed, and I have no idea whether theses are related..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could ignore the metric heads test - these should be fixed in master btw.

The arrow test failure looks relevant?

Copy link
Contributor Author

@Deegue Deegue Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it , just found WorkerCrashedError in Arrow 6, it seems no details.

@rickyyx rickyyx added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Jul 6, 2023
@Deegue
Copy link
Contributor Author

Deegue commented Jul 12, 2023

Could we also tag other reviews who have context on the original PR for review? @Deegue

And let me know if there's anything I can help or I don't think I am the best person to approve it.

Ok, here is the original PR #35110 and gentle ping @ericl for review since the previous failed case was passed.

Thanks again @rickyyx , could you help to find the fail reason of Arrow 6? I'm not sure if it's because of import pyarrow since there is no stacktrace.

@Deegue
Copy link
Contributor Author

Deegue commented Jul 12, 2023

can you let me know what were the major changes from the PR that was merged?

Thanks for comment, the major change is using arrow serialization for object store read and write, which is zero-copy and brings better performance.

@rickyyx
Copy link
Contributor

rickyyx commented Jul 12, 2023

I got this stacktrace from one of the workers in the failed test for arrow 6

*** SIGSEGV received at time=1689148296 on cpu 1 ***
PC: @     0x7fdda5605b0c  (unknown)  arrow::(anonymous namespace)::NullArrayFactory::CreateChild()
    @     0x7fdde91d81d5        208  absl::lts_20220623::WriteFailureInfo()
    @     0x7fdde91d7f18         64  absl::lts_20220623::AbslFailureSignalHandler()
    @     0x7fddea56c420  (unknown)  (unknown)
    @     0x7fddea2a45f3        160  (unknown)
    @     0x7fdda56064a7        208  arrow::VisitTypeInline<>()
    @ ... and at least 1 more frames
[2023-07-12 07:51:36,709 E 19106 19106] logging.cc:361: *** SIGSEGV received at time=1689148296 on cpu 1 ***
[2023-07-12 07:51:36,709 E 19106 19106] logging.cc:361: PC: @     0x7fdda5605b0c  (unknown)  arrow::(anonymous namespace)::NullArrayFactory::CreateChild()
[2023-07-12 07:51:36,709 E 19106 19106] logging.cc:361:     @     0x7fdde91d81d5        208  absl::lts_20220623::WriteFailureInfo()
[2023-07-12 07:51:36,710 E 19106 19106] logging.cc:361:     @     0x7fdde91d7f31         64  absl::lts_20220623::AbslFailureSignalHandler()
[2023-07-12 07:51:36,710 E 19106 19106] logging.cc:361:     @     0x7fddea56c420  (unknown)  (unknown)
[2023-07-12 07:51:36,711 E 19106 19106] logging.cc:361:     @     0x7fddea2a45f3        160  (unknown)
[2023-07-12 07:51:36,711 E 19106 19106] logging.cc:361:     @     0x7fdda56064a7        208  arrow::VisitTypeInline<>()
[2023-07-12 07:51:36,711 E 19106 19106] logging.cc:361:     @ ... and at least 1 more frames
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/opt/miniconda/lib/python3.8/site-packages/pyarrow/compute.py", line 625 in take
  File "/ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 41 in take_table
  File "/ray/python/ray/data/_internal/arrow_block.py", line 320 in take
  File "/ray/python/ray/data/_internal/arrow_block.py", line 220 in random_shuffle
  File "/ray/python/ray/data/_internal/shuffle_and_partition.py", line 88 in reduce
  File "/ray/python/ray/_private/worker.py", line 777 in main_loop
  File "/ray/python/ray/_private/workers/default_worker.py", line 262 in <module>


@kira-lin
Copy link
Contributor

@rickyyx I'm helping to fix the CI. The arrow related tests are passed. Other failures seems unrelated?

@kira-lin
Copy link
Contributor

Hi @ericl @rkooo567 , please review again, thanks

@ericl
Copy link
Contributor

ericl commented Jul 31, 2023

Core tests look good now. Can you rebase with master to rerun the Java tests? It looks like

java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/util/JacksonFeature

is a potential dependency issue.

@Deegue Deegue requested a review from SongGuyang as a code owner August 8, 2023 03:23
@kira-lin
Copy link
Contributor

hi @jovany-wang ,
Do you have any idea about this NoClassDefFoundError? We only add a few Arrow dependencies, I don't see how this error is related to this PR. Thanks for your help

@stale
Copy link

stale bot commented Sep 16, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 16, 2023
@rkooo567 rkooo567 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 3, 2023
Copy link

stale bot commented Dec 15, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 15, 2023
@stale stale bot removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Dec 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants