Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoToS3Operator failed when running with a single query (not aggregate pipeline) #15680

Conversation

amatellanes
Copy link
Contributor

Remove invalid allowDiskUse argument when calling find. This method does not expect that argument according to official documentation here.

closes #15679

@boring-cyborg boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels May 5, 2021
@amatellanes amatellanes force-pushed the mongotos3operator-failed-when-running-with-a-single-query-not-aggregate-pipeline branch from 992fd32 to fc14342 Compare May 5, 2021 17:36
@uranusjr
Copy link
Member

uranusjr commented May 5, 2021

Makes sense, allowDiskUse is an aggregate-only option.

@@ -117,7 +117,6 @@ def execute(self, context) -> bool:
mongo_collection=self.mongo_collection,
query=cast(dict, self.mongo_query),
mongo_db=self.mongo_db,
allowDiskUse=self.allow_disk_use,
Copy link
Contributor

@xinbinhuang xinbinhuang May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is an interesting one. The .find method seems to supports the argument but in snake case i.e. allow_disk_use.
I'm not familiar with Mongo. Can you double-check if allow_disk_use is supported in both pymongo and MongoDB itself?

Copy link
Contributor

@xinbinhuang xinbinhuang May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's added into the .find method in MongoDB version 4.4 and pymongo 3.11

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allow_disk_use argument in .find() maps to MongoDB’s cursor.allowDiskUse, while .aggregate()’s allowDiskUse corresponds to allowDiskUse in the aggregation pipeline. I’m honestly not familiar with cursor.allowDiskUse (in fact I didn’t know it existed until today), but from the documentation the two are quite different.

I think whether we should set find(allow_disk_use=True) depends on what we want MongoToS3Operator.allow_disk_use to mean. The docstring says

allow_disk_use: in the case you are retrieving a lot of data, you may have to use the disk to save it instead of saving all in the RAM

which seems to indicate it probably makes sense to set find(allow_disk_use=True) from it. But then the question becomes how we can pass it only to MongoDB (not pymongo!) 4.4+ (released in July 2020) because it would crash on earlier versions.

Copy link
Contributor

@xinbinhuang xinbinhuang May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whether we should set find(allow_disk_use=True) depends on what we want MongoToS3Operator.allow_disk_use to mean.

Though the wording seems quite different, they seem to achieve the same thing: use temporary files to store data when it exceeds certain memory limits (both 100 MB RAM) - one is for aggregation stage while the other is for blocking sort operation. (internally, I guess they are pretty similar) But again, I'm not familiar with Mongo enough to be confident in what I said. It will be nice if someone already using Mongo can verify this.

But then the question becomes how we can pass it only to MongoDB (not pymongo!) 4.4+ (released in July 2020) because it would crash on earlier versions.

I think a note in the docstring saying that it requires MongoDB 4.4+ for .find and a link to the doc should be enough. It's neither sensible for us to control the version of an external system nor that we can. IMO, the users should know and control what MongoDB version they are using.

Copy link
Contributor

@xinbinhuang xinbinhuang May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our side ("client"), we can do something like:

  1. Not ideal, but this works:
find_method = partial(MongoHook(self.mongo_conn_id).find, 
                mongo_collection=self.mongo_collection,
                query=cast(dict, self.mongo_query),
                mongo_db=self.mongo_db)

if self.allow_disk_use:
    results = find_method(allow_disk_use=self.allow_disk_use)
else:
    results = find_method()
  1. Or inspect the version of pymongo, then decide or what to do with the allow_disk_use parameter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amatellanes I'm ok with just removes it for now, and we can think about how to add it back in another PR in the future when someone requests it. Or we can handle the branching logic in this PR so that allow_disk_use can play with single query as well.

If you wanna go with the former, please update the docstring to reflect the fact that allow_disk_use doesn't apply to single query (.find).

Copy link
Member

@uranusjr uranusjr May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a note in the docstring saying that it requires MongoDB 4.4+ for .find and a link to the doc should be enough. It's neither sensible for us to control the version of an external system nor that we can. IMO, the users should know and control what MongoDB version they are using.

The issue is aggregate(allowDiskUse=True) does work prior to MongoDB 4.4, and users running older MongoDB setups may still want to use that. If we always also pass the argument to find(), they will have to create two MongoHook instances, one with allow_disk_use (for aggregate()) and one without (for find()), which doesn’t feel like good interface design to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why two MongoHook instances? I think our discussion only applies to the .find, and the .aggregate will stay as it's.

Copy link
Member

@uranusjr uranusjr May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misread and thought allow_disk_use were a property on MongoHook (it’s on MongoToS3Operator instead). Nevermind, I think your proposed solution above would work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinbinhuang I've just updated the docstring as you suggested.

@amatellanes amatellanes force-pushed the mongotos3operator-failed-when-running-with-a-single-query-not-aggregate-pipeline branch from fc14342 to 1385387 Compare May 9, 2021 13:27
@amatellanes amatellanes force-pushed the mongotos3operator-failed-when-running-with-a-single-query-not-aggregate-pipeline branch from 1385387 to e91d698 Compare May 9, 2021 14:08
@xinbinhuang xinbinhuang merged commit dab10d9 into apache:master May 10, 2021
@xinbinhuang
Copy link
Contributor

Thank you for the contribution!

@amatellanes amatellanes deleted the mongotos3operator-failed-when-running-with-a-single-query-not-aggregate-pipeline branch May 10, 2021 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MongoToS3Operator failed when running with a single query (not aggregate pipeline)
3 participants