-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow S3 Download for multiple files #174
Comments
Supply the region regardless, S3 is odd as all buckets in us-east-1 are "global" but other regions will tell you to change region with an odd status code (handled in the background) if the region is wrong. I ran the following code to test downloading a 65MB file 3 times synchronously, and then in parallel with gather import time
import aioboto3
import asyncio
# import uvloop
# uvloop.install()
async def get_file(client):
resp = await client.get_object(
Bucket='test-terry-pipeline',
Key='sample_files.zip'
)
data = await resp['Body'].read(1048576)
while data:
data = await resp['Body'].read(1048576)
def get_file_sync(client):
resp = client.get_object(
Bucket='test-terry-pipeline',
Key='sample_files.zip'
)
data = resp['Body'].read(1048576)
while data:
data = resp['Body'].read(1048576)
async def main():
client = aioboto3.client('s3', region_name='eu-west-1',
aws_access_key_id='*',
aws_secret_access_key='*')
client_sync = boto3.client('s3', region_name='eu-west-1',
aws_access_key_id='*',
aws_secret_access_key='*')
print('Starting sync, downloading 65.8MB file 3 times with sync get_object')
start = time.perf_counter()
get_file_sync(client_sync)
get_file_sync(client_sync)
get_file_sync(client_sync)
end = time.perf_counter()
print('Finished, took: {0}s'.format(round(end-start, 3)))
print('Starting sync, downloading 65.8MB file 3 times with async get_object')
start = time.perf_counter()
await get_file(client)
await get_file(client)
await get_file(client)
end = time.perf_counter()
print('Finished, took: {0}s'.format(round(end-start, 3)))
print('Starting async, downloading 65.8MB file 3 times with async get_object')
coros = [get_file(client), get_file(client), get_file(client)]
start = time.perf_counter()
await asyncio.gather(*coros)
end = time.perf_counter()
print('Finished, took: {0}s'.format(round(end-start, 3)))
await client.close()
asyncio.get_event_loop().run_until_complete(main()) I get the following results (bear in mind this is going from London to Ireland)
As you can see the time saving for getting a file in parallel is not a third of the time, there is some slight overhead, which is more apparent as the file gets smaller. Reading the file over the network is what takes the longest, when the files are small, the files are read quicker so the overhead is more apparent. This is the same code as above downloading 3 602KB files
I have a feeling you're instantiating the resource/client every time, you don't want to do that its relatively expensive. |
I will try your suggestion for instantiating the client. This is actually inside of a WebSocket. The web socket on the server side is persistent and handles a number of calls. So kind of in a loop. I will try this approach of not reinitialising the client each time and see if that improves performance. |
Description
We have a Django app that needs to download 3 files for each id. There are N ids.
Doing it synchronously takes ~50s for 23 ids, so about 2 seconds per id.
Doing this asynchronously with aioboto3 takes the same length of time.
What I Did
I am following the example code from @Burrhank's answer from this issue... #137
The tasks are of the form....
The tasks are managed as follows. When jobId is provided, we want to make 3 calls to S3 to get 3 different files that are all in the same bucket/folder. The tasks getRequest, getResponse and getImage are all the same task, but the file retrieved is a different file, max size is ~500Kb.
Any advice on what could be going wrong here? Async time === Sync time. :(
Update....
I have tried with the client, instead of resource, as per the documentation. This approach only saves me around 5s. What am i missing/doing wrong?
Does the region need to be specified? Or does aioboto know that S3 buckets are global?
I have also specified a region in the s3 client, but this does not affect performance.
The text was updated successfully, but these errors were encountered: