Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow S3 Download for multiple files #174

Closed
Neuroforge opened this issue Jul 3, 2019 · 2 comments
Closed

Slow S3 Download for multiple files #174

Neuroforge opened this issue Jul 3, 2019 · 2 comments

Comments

@Neuroforge
Copy link

Neuroforge commented Jul 3, 2019

  • Async AWS SDK for Python version: 6.4.1
  • Python version: 3.6
  • Operating System: MacOS

Description

We have a Django app that needs to download 3 files for each id. There are N ids.

Doing it synchronously takes ~50s for 23 ids, so about 2 seconds per id.

Doing this asynchronously with aioboto3 takes the same length of time.

What I Did

I am following the example code from @Burrhank's answer from this issue... #137

The tasks are of the form....

                async def getRequest(jobKey):
                    async with aioboto3.resource(
                        "s3",
                        aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
                        aws_secret_access_key=settings.AWS_SECRECT_ACCESS_KEY,
                    ) as s3:
                        s3ObjectName = "{}_request.json".format(jobKey)
                        obj = s3.Object(
                            bucket, "{}{}".format(s3FolderKey, s3ObjectName)
                        )
                        obj = await obj.get()
                        data = await obj["Body"].read()
                        return json.loads(data.decode("utf-8"))

The tasks are managed as follows. When jobId is provided, we want to make 3 calls to S3 to get 3 different files that are all in the same bucket/folder. The tasks getRequest, getResponse and getImage are all the same task, but the file retrieved is a different file, max size is ~500Kb.

                requestTask = self.loop.create_task(getRequest(jobId))
                responseTask = self.loop.create_task(getResponse(jobId))
                imageTask = self.loop.create_task(getImage(jobId))
                self.loop.run_until_complete(
                    asyncio.wait([requestTask, responseTask, imageTask])
                )
                response = {}
                response["request"] = requestTask.result()
                response["response"] = responseTask.result()
                response["image"] = imageTask.result()

Any advice on what could be going wrong here? Async time === Sync time. :(

Update....

I have tried with the client, instead of resource, as per the documentation. This approach only saves me around 5s. What am i missing/doing wrong?

                async def getRequest(jobKey):
                    async with aioboto3.client(
                        "s3",
                        aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
                        aws_secret_access_key=settings.AWS_SECRECT_ACCESS_KEY,
                    ) as s3:
                        s3ObjectName = "{}_request.json".format(jobId)
                        obj = await s3.get_object(
                            Bucket=bucket, Key="{}{}".format(s3FolderKey, s3ObjectName)
                        )
                        data = await obj["Body"].read()
                        return json.loads(data.decode("utf-8"))

Does the region need to be specified? Or does aioboto know that S3 buckets are global?
I have also specified a region in the s3 client, but this does not affect performance.

@terricain
Copy link
Owner

Supply the region regardless, S3 is odd as all buckets in us-east-1 are "global" but other regions will tell you to change region with an odd status code (handled in the background) if the region is wrong.

I ran the following code to test downloading a 65MB file 3 times synchronously, and then in parallel with gather

import time
import aioboto3
import asyncio
# import uvloop
# uvloop.install()

async def get_file(client):

    resp = await client.get_object(
        Bucket='test-terry-pipeline',
        Key='sample_files.zip'
    )
    data = await resp['Body'].read(1048576)
    while data:
        data = await resp['Body'].read(1048576)

def get_file_sync(client):
    resp = client.get_object(
        Bucket='test-terry-pipeline',
        Key='sample_files.zip'
    )
    data = resp['Body'].read(1048576)
    while data:
        data = resp['Body'].read(1048576)

async def main():
    client = aioboto3.client('s3', region_name='eu-west-1',
                             aws_access_key_id='*',
                             aws_secret_access_key='*')
    client_sync = boto3.client('s3', region_name='eu-west-1',
                             aws_access_key_id='*',
                             aws_secret_access_key='*')

    print('Starting sync, downloading 65.8MB file 3 times with sync get_object')
    start = time.perf_counter()
    get_file_sync(client_sync)
    get_file_sync(client_sync)
    get_file_sync(client_sync)
    end = time.perf_counter()
    print('Finished, took: {0}s'.format(round(end-start, 3)))

    print('Starting sync, downloading 65.8MB file 3 times with async get_object')
    start = time.perf_counter()
    await get_file(client)
    await get_file(client)
    await get_file(client)
    end = time.perf_counter()
    print('Finished, took: {0}s'.format(round(end-start, 3)))

    print('Starting async, downloading 65.8MB file 3 times with async get_object')
    coros = [get_file(client), get_file(client), get_file(client)]
    start = time.perf_counter()
    await asyncio.gather(*coros)
    end = time.perf_counter()
    print('Finished, took: {0}s'.format(round(end-start, 3)))

    await client.close()

asyncio.get_event_loop().run_until_complete(main())

I get the following results (bear in mind this is going from London to Ireland)

Finished, took: 5.19s
Starting sync, downloading 65.8MB file 3 times with async get_object
Finished, took: 5.489s
Starting async, downloading 65.8MB file 3 times with async get_object
Finished, took: 3.133s

As you can see the time saving for getting a file in parallel is not a third of the time, there is some slight overhead, which is more apparent as the file gets smaller. Reading the file over the network is what takes the longest, when the files are small, the files are read quicker so the overhead is more apparent.

This is the same code as above downloading 3 602KB files

Starting sync, downloading 602KB file 3 times with sync get_object
Finished, took: 0.309s
Starting sync, downloading 602KB file 3 times with async get_object
Finished, took: 0.216s
Starting async, downloading 602KB file 3 times with async get_object
Finished, took: 0.136s

I have a feeling you're instantiating the resource/client every time, you don't want to do that its relatively expensive.
Are you getting N sets of 3 files in a loop?

@Neuroforge
Copy link
Author

I will try your suggestion for instantiating the client.

This is actually inside of a WebSocket. The web socket on the server side is persistent and handles a number of calls. So kind of in a loop.

I will try this approach of not reinitialising the client each time and see if that improves performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants