Slow S3 Download for multiple files #174

Neuroforge · 2019-07-03T07:18:46Z

Async AWS SDK for Python version: 6.4.1
Python version: 3.6
Operating System: MacOS

Description

We have a Django app that needs to download 3 files for each id. There are N ids.

Doing it synchronously takes ~50s for 23 ids, so about 2 seconds per id.

Doing this asynchronously with aioboto3 takes the same length of time.

What I Did

I am following the example code from @Burrhank's answer from this issue... #137

The tasks are of the form....

                async def getRequest(jobKey):
                    async with aioboto3.resource(
                        "s3",
                        aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
                        aws_secret_access_key=settings.AWS_SECRECT_ACCESS_KEY,
                    ) as s3:
                        s3ObjectName = "{}_request.json".format(jobKey)
                        obj = s3.Object(
                            bucket, "{}{}".format(s3FolderKey, s3ObjectName)
                        )
                        obj = await obj.get()
                        data = await obj["Body"].read()
                        return json.loads(data.decode("utf-8"))

The tasks are managed as follows. When jobId is provided, we want to make 3 calls to S3 to get 3 different files that are all in the same bucket/folder. The tasks getRequest, getResponse and getImage are all the same task, but the file retrieved is a different file, max size is ~500Kb.

                requestTask = self.loop.create_task(getRequest(jobId))
                responseTask = self.loop.create_task(getResponse(jobId))
                imageTask = self.loop.create_task(getImage(jobId))
                self.loop.run_until_complete(
                    asyncio.wait([requestTask, responseTask, imageTask])
                )
                response = {}
                response["request"] = requestTask.result()
                response["response"] = responseTask.result()
                response["image"] = imageTask.result()

Any advice on what could be going wrong here? Async time === Sync time. :(

Update....

I have tried with the client, instead of resource, as per the documentation. This approach only saves me around 5s. What am i missing/doing wrong?

                async def getRequest(jobKey):
                    async with aioboto3.client(
                        "s3",
                        aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
                        aws_secret_access_key=settings.AWS_SECRECT_ACCESS_KEY,
                    ) as s3:
                        s3ObjectName = "{}_request.json".format(jobId)
                        obj = await s3.get_object(
                            Bucket=bucket, Key="{}{}".format(s3FolderKey, s3ObjectName)
                        )
                        data = await obj["Body"].read()
                        return json.loads(data.decode("utf-8"))

Does the region need to be specified? Or does aioboto know that S3 buckets are global?
I have also specified a region in the s3 client, but this does not affect performance.

The text was updated successfully, but these errors were encountered:

terricain · 2019-07-03T18:51:11Z

Supply the region regardless, S3 is odd as all buckets in us-east-1 are "global" but other regions will tell you to change region with an odd status code (handled in the background) if the region is wrong.

I ran the following code to test downloading a 65MB file 3 times synchronously, and then in parallel with gather

import time
import aioboto3
import asyncio
# import uvloop
# uvloop.install()

async def get_file(client):

    resp = await client.get_object(
        Bucket='test-terry-pipeline',
        Key='sample_files.zip'
    )
    data = await resp['Body'].read(1048576)
    while data:
        data = await resp['Body'].read(1048576)

def get_file_sync(client):
    resp = client.get_object(
        Bucket='test-terry-pipeline',
        Key='sample_files.zip'
    )
    data = resp['Body'].read(1048576)
    while data:
        data = resp['Body'].read(1048576)

async def main():
    client = aioboto3.client('s3', region_name='eu-west-1',
                             aws_access_key_id='*',
                             aws_secret_access_key='*')
    client_sync = boto3.client('s3', region_name='eu-west-1',
                             aws_access_key_id='*',
                             aws_secret_access_key='*')

    print('Starting sync, downloading 65.8MB file 3 times with sync get_object')
    start = time.perf_counter()
    get_file_sync(client_sync)
    get_file_sync(client_sync)
    get_file_sync(client_sync)
    end = time.perf_counter()
    print('Finished, took: {0}s'.format(round(end-start, 3)))

    print('Starting sync, downloading 65.8MB file 3 times with async get_object')
    start = time.perf_counter()
    await get_file(client)
    await get_file(client)
    await get_file(client)
    end = time.perf_counter()
    print('Finished, took: {0}s'.format(round(end-start, 3)))

    print('Starting async, downloading 65.8MB file 3 times with async get_object')
    coros = [get_file(client), get_file(client), get_file(client)]
    start = time.perf_counter()
    await asyncio.gather(*coros)
    end = time.perf_counter()
    print('Finished, took: {0}s'.format(round(end-start, 3)))

    await client.close()

asyncio.get_event_loop().run_until_complete(main())

I get the following results (bear in mind this is going from London to Ireland)

Finished, took: 5.19s
Starting sync, downloading 65.8MB file 3 times with async get_object
Finished, took: 5.489s
Starting async, downloading 65.8MB file 3 times with async get_object
Finished, took: 3.133s

As you can see the time saving for getting a file in parallel is not a third of the time, there is some slight overhead, which is more apparent as the file gets smaller. Reading the file over the network is what takes the longest, when the files are small, the files are read quicker so the overhead is more apparent.

This is the same code as above downloading 3 602KB files

Starting sync, downloading 602KB file 3 times with sync get_object
Finished, took: 0.309s
Starting sync, downloading 602KB file 3 times with async get_object
Finished, took: 0.216s
Starting async, downloading 602KB file 3 times with async get_object
Finished, took: 0.136s

I have a feeling you're instantiating the resource/client every time, you don't want to do that its relatively expensive.
Are you getting N sets of 3 files in a loop?

Neuroforge · 2019-07-03T23:37:20Z

I will try your suggestion for instantiating the client.

This is actually inside of a WebSocket. The web socket on the server side is persistent and handles a number of calls. So kind of in a loop.

I will try this approach of not reinitialising the client each time and see if that improves performance.

terricain closed this as completed Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow S3 Download for multiple files #174

Slow S3 Download for multiple files #174

Neuroforge commented Jul 3, 2019 •

edited

Loading

terricain commented Jul 3, 2019

Neuroforge commented Jul 3, 2019

Slow S3 Download for multiple files #174

Slow S3 Download for multiple files #174

Comments

Neuroforge commented Jul 3, 2019 • edited Loading

Description

What I Did

terricain commented Jul 3, 2019

Neuroforge commented Jul 3, 2019

Neuroforge commented Jul 3, 2019 •

edited

Loading