MongoToS3Operator failed when running with a single query (not aggregate pipeline) #15680

amatellanes · 2021-05-05T17:14:15Z

Remove invalid allowDiskUse argument when calling find. This method does not expect that argument according to official documentation here.

closes #15679

uranusjr · 2021-05-05T17:39:15Z

Makes sense, allowDiskUse is an aggregate-only option.

xinbinhuang · 2021-05-06T02:46:34Z

airflow/providers/amazon/aws/transfers/mongo_to_s3.py

@@ -117,7 +117,6 @@ def execute(self, context) -> bool:
                mongo_collection=self.mongo_collection,
                query=cast(dict, self.mongo_query),
                mongo_db=self.mongo_db,
-                allowDiskUse=self.allow_disk_use,


hmm, this is an interesting one. The .find method seems to supports the argument but in snake case i.e. allow_disk_use.
I'm not familiar with Mongo. Can you double-check if allow_disk_use is supported in both pymongo and MongoDB itself?

It's added into the .find method in MongoDB version 4.4 and pymongo 3.11

The allow_disk_use argument in .find() maps to MongoDB’s cursor.allowDiskUse, while .aggregate()’s allowDiskUse corresponds to allowDiskUse in the aggregation pipeline. I’m honestly not familiar with cursor.allowDiskUse (in fact I didn’t know it existed until today), but from the documentation the two are quite different.

I think whether we should set find(allow_disk_use=True) depends on what we want MongoToS3Operator.allow_disk_use to mean. The docstring says

allow_disk_use: in the case you are retrieving a lot of data, you may have to use the disk to save it instead of saving all in the RAM

which seems to indicate it probably makes sense to set find(allow_disk_use=True) from it. But then the question becomes how we can pass it only to MongoDB (not pymongo!) 4.4+ (released in July 2020) because it would crash on earlier versions.

I think whether we should set find(allow_disk_use=True) depends on what we want MongoToS3Operator.allow_disk_use to mean.

Though the wording seems quite different, they seem to achieve the same thing: use temporary files to store data when it exceeds certain memory limits (both 100 MB RAM) - one is for aggregation stage while the other is for blocking sort operation. (internally, I guess they are pretty similar) But again, I'm not familiar with Mongo enough to be confident in what I said. It will be nice if someone already using Mongo can verify this.

But then the question becomes how we can pass it only to MongoDB (not pymongo!) 4.4+ (released in July 2020) because it would crash on earlier versions.

I think a note in the docstring saying that it requires MongoDB 4.4+ for .find and a link to the doc should be enough. It's neither sensible for us to control the version of an external system nor that we can. IMO, the users should know and control what MongoDB version they are using.

From our side ("client"), we can do something like:

Not ideal, but this works:

find_method = partial(MongoHook(self.mongo_conn_id).find, mongo_collection=self.mongo_collection, query=cast(dict, self.mongo_query), mongo_db=self.mongo_db) if self.allow_disk_use: results = find_method(allow_disk_use=self.allow_disk_use) else: results = find_method()

Or inspect the version of pymongo, then decide or what to do with the allow_disk_use parameter

@amatellanes I'm ok with just removes it for now, and we can think about how to add it back in another PR in the future when someone requests it. Or we can handle the branching logic in this PR so that allow_disk_use can play with single query as well.

If you wanna go with the former, please update the docstring to reflect the fact that allow_disk_use doesn't apply to single query (.find).

I think a note in the docstring saying that it requires MongoDB 4.4+ for .find and a link to the doc should be enough. It's neither sensible for us to control the version of an external system nor that we can. IMO, the users should know and control what MongoDB version they are using.

The issue is aggregate(allowDiskUse=True) does work prior to MongoDB 4.4, and users running older MongoDB setups may still want to use that. If we always also pass the argument to find(), they will have to create two MongoHook instances, one with allow_disk_use (for aggregate()) and one without (for find()), which doesn’t feel like good interface design to me.

Why two MongoHook instances? I think our discussion only applies to the .find, and the .aggregate will stay as it's.

Ah, I misread and thought allow_disk_use were a property on MongoHook (it’s on MongoToS3Operator instead). Nevermind, I think your proposed solution above would work.

@xinbinhuang I've just updated the docstring as you suggested.

airflow/providers/amazon/aws/transfers/mongo_to_s3.py

Co-authored-by: Xinbin Huang <[email protected]>

xinbinhuang · 2021-05-10T14:13:35Z

Thank you for the contribution!

boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels May 5, 2021

amatellanes force-pushed the mongotos3operator-failed-when-running-with-a-single-query-not-aggregate-pipeline branch from 992fd32 to fc14342 Compare May 5, 2021 17:36

xinbinhuang reviewed May 6, 2021

View reviewed changes

uranusjr mentioned this pull request May 6, 2021

MongoToS3Operator failed when running with a single query (not aggregate pipeline) #15679

Closed

amatellanes force-pushed the mongotos3operator-failed-when-running-with-a-single-query-not-aggregate-pipeline branch from fc14342 to 1385387 Compare May 9, 2021 13:27

Remove allowDiskUse argument from mongo hook find

e91d698

amatellanes force-pushed the mongotos3operator-failed-when-running-with-a-single-query-not-aggregate-pipeline branch from 1385387 to e91d698 Compare May 9, 2021 14:08

xinbinhuang reviewed May 9, 2021

View reviewed changes

airflow/providers/amazon/aws/transfers/mongo_to_s3.py Outdated Show resolved Hide resolved

amatellanes and others added 2 commits May 10, 2021 08:14

Update airflow/providers/amazon/aws/transfers/mongo_to_s3.py

5f0ad56

Co-authored-by: Xinbin Huang <[email protected]>

Update mongo_query docstring type

f6f4bca

xinbinhuang approved these changes May 10, 2021

View reviewed changes

xinbinhuang merged commit dab10d9 into apache:master May 10, 2021

amatellanes deleted the mongotos3operator-failed-when-running-with-a-single-query-not-aggregate-pipeline branch May 10, 2021 14:25

potiuk mentioned this pull request Jun 15, 2021

Status of testing Providers that were prepared on June 18, 2021 #16456

Closed

73 tasks

eladkal mentioned this pull request Jun 16, 2021

Invalid airflow.providers.amazon.aws.transfers.mongo_to_s3.MongoToS3Operator execute function signature #16479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MongoToS3Operator failed when running with a single query (not aggregate pipeline) #15680

MongoToS3Operator failed when running with a single query (not aggregate pipeline) #15680

amatellanes commented May 5, 2021

uranusjr commented May 5, 2021

xinbinhuang May 6, 2021 •

edited

Loading

xinbinhuang May 6, 2021 •

edited

Loading

uranusjr May 6, 2021

xinbinhuang May 6, 2021 •

edited

Loading

xinbinhuang May 6, 2021 •

edited

Loading

xinbinhuang May 6, 2021

uranusjr May 6, 2021 •

edited

Loading

xinbinhuang May 6, 2021

uranusjr May 6, 2021 •

edited

Loading

amatellanes May 9, 2021

xinbinhuang commented May 10, 2021

MongoToS3Operator failed when running with a single query (not aggregate pipeline) #15680

MongoToS3Operator failed when running with a single query (not aggregate pipeline) #15680

Conversation

amatellanes commented May 5, 2021

uranusjr commented May 5, 2021

xinbinhuang May 6, 2021 • edited Loading

Choose a reason for hiding this comment

xinbinhuang May 6, 2021 • edited Loading

Choose a reason for hiding this comment

uranusjr May 6, 2021

Choose a reason for hiding this comment

xinbinhuang May 6, 2021 • edited Loading

Choose a reason for hiding this comment

xinbinhuang May 6, 2021 • edited Loading

Choose a reason for hiding this comment

xinbinhuang May 6, 2021

Choose a reason for hiding this comment

uranusjr May 6, 2021 • edited Loading

Choose a reason for hiding this comment

xinbinhuang May 6, 2021

Choose a reason for hiding this comment

uranusjr May 6, 2021 • edited Loading

Choose a reason for hiding this comment

amatellanes May 9, 2021

Choose a reason for hiding this comment

xinbinhuang commented May 10, 2021

xinbinhuang May 6, 2021 •

edited

Loading

xinbinhuang May 6, 2021 •

edited

Loading

xinbinhuang May 6, 2021 •

edited

Loading

xinbinhuang May 6, 2021 •

edited

Loading

uranusjr May 6, 2021 •

edited

Loading

uranusjr May 6, 2021 •

edited

Loading