Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update arrow version to fix plasma bugs #4127

Merged
merged 5 commits into from
Mar 8, 2019

Conversation

guoyuhong
Copy link
Contributor

What do these changes do?

This PR updates arrow version to b22848952f09d6f9487feaff80ee358ca41b1562 which contains apache/arrow#3656 and apache/arrow#3682 and this PR will fix 2 TODOs from Plasma Java Client.

Related issue number

N/A

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@robertnishihara
Copy link
Collaborator

@guoyuhong Did this work for you? @pcmoritz is also updating the Arrow version in #3898 and it's gotten a lot more complex because the Arrow build system has changed.

@guoyuhong
Copy link
Contributor Author

@pcmoritz I rebased the original ref branch arrow-one-mmap-file to the new base of b22848952f09d6f9487feaff80ee358ca41b1562 and resolve the conflicts and push it to arrow-one-mmap-file-on-b228489. Is the processes right?

@guoyuhong
Copy link
Contributor Author

@robertnishihara His PR seems not update the Arrow version. Do you mean that we hold the PR until that PR gets merged?

@guoyuhong
Copy link
Contributor Author

@robertnishihara That PR seems to use Arrow directly from OSS...

@robertnishihara
Copy link
Collaborator

@guoyuhong I believe it uses Arrow from an S3 bucket that we host. We have to separately build Arrow using https://github.com/ray-project/arrow-build I think, which uploads to the S3 bucket and then we can get it from the bucket. This is done to address issues @pcmoritz ran into when trying to get the Bazel build to work with Arrow.

@guoyuhong
Copy link
Contributor Author

@robertnishihara When will @pcmoritz 's PR get merged? This PR contains 2 fixes along with the Plasma Java Client, so this PR cannot be closed.

@robertnishihara
Copy link
Collaborator

@guoyuhong I'm hoping to merge @pcmoritz's PR in the morning (in about 8 hours). It's basically ready, but there were a couple small changes that needed to be made.

@guoyuhong
Copy link
Contributor Author

@robertnishihara Thanks for the information. Let's hold this PR until then.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12224/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

@robertnishihara Considering that the Bazel PR is not merged, I think I can send a PR to update https://github.com/ray-project/arrow-build 's Arrow version to 54618466ca04f4427dd8cfc3ce31b14ec54e8999 which is the same as this PR and generate the new wheel files. Then, @pcmoritz 's PR will compatible with this PR using the new wheel files.

@pcmoritz
Copy link
Contributor

@guoyuhong Thanks for looking into this, see my comment in ray-project/arrow-build#1

I have now a plan how we can go forward on #4129 and make sure the latest arrow features are available and the clashes with tensorflow are removed. It involves building an arrow wheel that uses ubuntu 14.04 as a base (the same as tensorflow) and I'll prioritize this first (it will allow us to remove CMake) and can help you with this PR after.

@guoyuhong
Copy link
Contributor Author

@pcmoritz #4182 had been merged. How about this PR? Since Java building still goes the cmake way, we need to update is correspondingly. It is better to keep the version of cmake and bazel version the same.

@robertnishihara
Copy link
Collaborator

@guoyuhong, this PR updates the version of Arrow used by cmake to the commit b22848952f09d6f9487feaff80ee358ca41b1562 which is identical to a commit in the Arrow repository. That means that it does not have the patch ray-project/arrow#3, which was needed to make Ray and TensorFlow compatible.

The tests are passing in this PR probably because when they ran, we were using conda install tensorflow in the Travis tests. However, the current master uses pip install tensorflow in the Travis tests and so I think it may fail (I rebased the PR to check).

If we merge this, we need to use a commit that includes the patch ray-project/arrow#3.

@guoyuhong
Copy link
Contributor Author

@robertnishihara Sorry for the expression. What I mean is that this PR will update to the version that ray-project/arrow#3 rebased on
b22848952f09d6f9487feaff80ee358ca41b1562. And the commit is 54618466ca04f4427dd8cfc3ce31b14ec54e8999.

@guoyuhong
Copy link
Contributor Author

I will do a rebase for this PR to current master.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12408/
Test FAILed.

@robertnishihara
Copy link
Collaborator

Oh, I apologize, I meant to check the commit from the PR but I guess I checked the one from the commit message. My mistake!

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12410/
Test FAILed.

@pcmoritz
Copy link
Contributor

pcmoritz commented Mar 1, 2019

Great @guoyuhong thanks for the PR! If it passes on the rebase, let's include it in the point release we are about to finish! Let's however hold off merging big PRs that touch the backend before the release candidate is drafted (besides #4154).

@guoyuhong
Copy link
Contributor Author

There is build error in Arrow CMake and it is fixed by apache/arrow#3737 . I need to rebase the commit to this commit.

@robertnishihara
Copy link
Collaborator

The std::invalid_argument might be a new error.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12439/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

This PR may help to resolve this problem. apache/arrow#3758

@guoyuhong
Copy link
Contributor Author

I changed the commit to ray-project/arrow@68299c5 of branch arrow-one-mmap-file-on-c0a2e73 which is based on the cmake fix of apache/arrow#3758 .

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12457/
Test FAILed.

@guoyuhong
Copy link
Contributor Author

guoyuhong commented Mar 2, 2019

@ericl Since apache/arrow#3423 and the fix apache/arrow#3758 have been merged, the arrow serialization is different. There is a checked-in file https://github.com/ray-project/ray/blob/master/python/ray/rllib/test/data/cartpole_small/output-2019-02-03_20-27-20_worker-0_0.json which is used in Jenkins test but throws exception during deserialization when I try to update arrow version. All the Travis tests passed. Could you help to update that? CC: @pcmoritz
The deserialization code tries to parse ndarray to an int8_t value. Maybe the content of the json file should be updated.

@guoyuhong
Copy link
Contributor Author

@ericl Ping... Could you help to take a look?

@ericl
Copy link
Contributor

ericl commented Mar 6, 2019

I see, feel free to disable the test. I can fix it in a follow-up PR.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12597/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12606/
Test FAILed.

@@ -353,8 +355,10 @@ docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
/ray/python/ray/rllib/tests/run_silent.sh examples/cartpole_lstm.py --stop=200 --use-prev-action-reward

docker run --rm --shm-size=${SHM_SIZE} --memory=${MEMORY_SIZE} $DOCKER_SHA \
/ray/python/ray/rllib/tests/run_silent.sh examples/custom_loss.py --iters=2
# TODO(ericl): reenable the test after fix the arrow serialization error.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl I disabled 3 cases in this script and left a TODO comment for you. Now the Jenkins test has no such failures.

@guoyuhong
Copy link
Contributor Author

Hi all, I have rebased the PR and disable the Jenkins cases that caused by the Arrow serialization change which @ericl promised to fix later. Please help to review.

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@robertnishihara
Copy link
Collaborator

Looks fine to me assuming tests pass. There is some merge conflict.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12643/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12660/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12687/
Test FAILed.

@guoyuhong guoyuhong merged commit d5fb7b7 into ray-project:master Mar 8, 2019
@guoyuhong guoyuhong deleted the updateArrow branch March 8, 2019 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants