Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage in S3Hook #35449

Closed
1 task done
Taragolis opened this issue Nov 4, 2023 · 5 comments · Fixed by #37886
Closed
1 task done

Reduce memory usage in S3Hook #35449

Taragolis opened this issue Nov 4, 2023 · 5 comments · Fixed by #37886
Assignees
Labels
area:providers kind:meta High-level information important to the community provider:amazon-aws AWS/Amazon - related issues

Comments

@Taragolis
Copy link
Contributor

Body

Original stacktrace from the Slack

Error:
 File "/usr/local/airflow/plugins/plugins/others/data_source_monitor.py", line 53, in retrieve_data
get_time_query = s3_hook.read_key(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 64, in wrapper
return func(*bound_args.args, **bound_args.kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 92, in wrapper
return func(*bound_args.args, **bound_args.kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 514, in read_key
obj = self.get_key(key, bucket_name)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 64, in wrapper
return func(*bound_args.args, **bound_args.kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 92, in wrapper
return func(*bound_args.args, **bound_args.kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 493, in get_key
s3_resource = self.get_session().resource(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/boto3/session.py", line 446, in resource
client = self.client(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/boto3/session.py", line 299, in client
return self._session.create_client(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/session.py", line 976, in create_client
client = client_creator.create_client(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/client.py", line 116, in create_client
endpoints_ruleset_data = self._load_service_endpoints_ruleset(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/client.py", line 220, in _load_service_endpoints_ruleset
return self._loader.load_service_model(
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/loaders.py", line 142, in _wrapper
data = func(self, *args, **kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/loaders.py", line 406, in load_service_model
known_services = self.list_available_services(type_name)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/loaders.py", line 142, in _wrapper
data = func(self, *args, **kwargs)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/loaders.py", line 311, in list_available_services
api_versions = os.listdir(full_dirname)
OSError: [Errno 12] Cannot allocate memory: '/usr/local/airflow/.local/lib/python3.10/site-packages/botocore/data/efs'

The reason of this error simple, for some operations S3Hook create resource (High Level client) in addition to S3.Client and this resource created every time when some method of S3Hook called as result additional memory required, for example if run S3Hook.download_file into the loop it might be reason for this error

As usual there are at least two solutions:
Option 1: use caching into the internal methods of S3Hook
Option 2: Get rid of resource usage in S3 hook and replace it by S3.Client methods. It might be better solution:

  • Seems like resources do not actively maintained in boto3
  • It required for about 30-40 MB of memory for create new resource object, however everything (and even more) could be done by S3.Client

Committer

  • I acknowledge that I am a maintainer/committer of the Apache Airflow project.
@Taragolis Taragolis added provider:amazon-aws AWS/Amazon - related issues area:providers kind:meta High-level information important to the community labels Nov 4, 2023
@ferruzzi
Copy link
Contributor

I'd love to see someone take/make the time to do a full rewrite on the S3Hook. It's so convoluted and nothing like any of the other hooks. Knowing what we know now about how to create a hook, I bet we could greatly simplify it.

@ellisms
Copy link
Contributor

ellisms commented Jan 10, 2024

@ferruzzi I'd like to take a stab at this one if it's available.

@ferruzzi
Copy link
Contributor

@ellisms - That would be great. You can look at the newer hooks and see lots of examples of how this could be reworked to use boto3.client instead of boto3.resources, and they should really help clean up a LOT of this spaghetti code. let me know if you need any help.

@ellisms
Copy link
Contributor

ellisms commented Feb 14, 2024

@ferruzzi Can you assign this one to me? Finally have some time to start looking at it.

@ellisms
Copy link
Contributor

ellisms commented Feb 14, 2024

I started digging into this, and converting boto3.resource to boto3.client could introduce a breaking change. For example, get_key returns a resource.Object (and the object.load is nothing more than a wrapper around client.head_object), but a client.head_object returns a dictionary. It's easy enough to modify the hook and operator, but it would break for anyone that directly calls hook.get_key(). Wanted to throw this out there for discussion. @Taragolis I thought maybe you'd have some thoughts. Are hook methods considered part of the public API?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:meta High-level information important to the community provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants