Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add MongoDB as a data source #28550

Merged
merged 66 commits into from
Nov 14, 2022
Merged

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Sep 15, 2022

Signed-off-by: jianoaix [email protected]
Signed-off-by: Jiajun Yao [email protected]

Why are these changes needed?

MongoDB is the 5th most popular database (and the most popular non-SQL db) per https://db-engines.com/en/ranking.
We have user who's using MongoDB for batch prediction -- and because of no connector to MongoDB, they have to build batch prediction on Ray Core directly, which is much more demanding (than e.g. build on Datasets).

Related issue number

Closes #28874

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

shawnpanda and others added 7 commits September 15, 2022 15:21
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
@jjyao
Copy link
Collaborator Author

jjyao commented Sep 19, 2022

cc @shawnpanda @jianoaix

Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
@jianoaix jianoaix changed the title [Dataset] Added mongodb as a data source [Datasets] Add MongoDB as a data source Sep 29, 2022
@jianoaix jianoaix marked this pull request as ready for review September 29, 2022 20:30
@jianoaix
Copy link
Contributor

jianoaix commented Oct 7, 2022

Thank you @krfricke for helping me understand CI, please take a look if that's a proper way to handle this case, thanks!

Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
.buildkite/pipeline.ml.yml Outdated Show resolved Hide resolved
Signed-off-by: jianoaix <[email protected]>
@jianoaix
Copy link
Contributor

jianoaix commented Oct 8, 2022

FYI, this PR is now just blocked by pyarrow 9.0 upgrade (#29161), which is needed for features we implemented in this PR that are based on pymongoarrow.

Technically we can use pymongoarrow 0.2.0 (released in Jan 6, 2022) which only required pyarrow 6.0, but that version had very small set of features (e.g. cannot even support string type, cannot write to Mongo etc), so we will have to trim down this PR significantly into something not really useful.

Btw, the failing CI test of test_mongo_dataset is also due to pyarrow version.

@jianoaix jianoaix added the do-not-merge Do not merge this PR! label Oct 8, 2022
@stale
Copy link

stale bot commented Nov 7, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 7, 2022
@jjyao jjyao removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 7, 2022
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
@jianoaix jianoaix removed the do-not-merge Do not merge this PR! label Nov 10, 2022
@jianoaix
Copy link
Contributor

The tests passing. This is ready to review/merge. Thanks.

Copy link
Collaborator Author

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! (Github doesn't allow me to approve my own PR).

schema: Optional["pymongoarrow.api.Schema"] = None,
parallelism: int = -1,
ray_remote_args: Dict[str, Any] = None,
**mongo_args,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we have name conflicts in general, what if mongo_args also has a parallelism arg? Should we accept mongo_args: Dict[str, Any]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pymongoarrow will fail it since it doesn' have this arg.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can unpack it when we pass to pymongoarrow.

@richardliaw richardliaw merged commit 9837ee2 into ray-project:master Nov 14, 2022
@c21
Copy link
Contributor

c21 commented Nov 14, 2022

@jianoaix - as followup for this PR, do you want to add the user documentation for this MongoDB data source? Example for read_tfrecords in https://github.com/ray-project/ray/pull/28430/files .

@jjyao jjyao deleted the jjyao/mongo branch November 14, 2022 19:37
@jianoaix
Copy link
Contributor

Yep, already plan to add.

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
Co-authored-by: Shawn Pan <[email protected]>
Co-authored-by: jianoaix <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Support MongoDB as a datasource for Dataset
8 participants