-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add MongoDB as a data source #28550
Conversation
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Thank you @krfricke for helping me understand CI, please take a look if that's a proper way to handle this case, thanks! |
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
FYI, this PR is now just blocked by pyarrow 9.0 upgrade (#29161), which is needed for features we implemented in this PR that are based on pymongoarrow. Technically we can use pymongoarrow 0.2.0 (released in Jan 6, 2022) which only required pyarrow 6.0, but that version had very small set of features (e.g. cannot even support string type, cannot write to Mongo etc), so we will have to trim down this PR significantly into something not really useful. Btw, the failing CI test of test_mongo_dataset is also due to pyarrow version. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
Signed-off-by: jianoaix <[email protected]>
The tests passing. This is ready to review/merge. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved! (Github doesn't allow me to approve my own PR).
schema: Optional["pymongoarrow.api.Schema"] = None, | ||
parallelism: int = -1, | ||
ray_remote_args: Dict[str, Any] = None, | ||
**mongo_args, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we have name conflicts in general, what if mongo_args also has a parallelism arg? Should we accept mongo_args: Dict[str, Any]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pymongoarrow will fail it since it doesn' have this arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can unpack it when we pass to pymongoarrow.
@jianoaix - as followup for this PR, do you want to add the user documentation for this MongoDB data source? Example for |
Yep, already plan to add. |
Co-authored-by: Shawn Pan <[email protected]> Co-authored-by: jianoaix <[email protected]> Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: jianoaix [email protected]
Signed-off-by: Jiajun Yao [email protected]
Why are these changes needed?
MongoDB is the 5th most popular database (and the most popular non-SQL db) per https://db-engines.com/en/ranking.
We have user who's using MongoDB for batch prediction -- and because of no connector to MongoDB, they have to build batch prediction on Ray Core directly, which is much more demanding (than e.g. build on Datasets).
Related issue number
Closes #28874
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.