Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: make it possible to keep docker container warm #239

Closed
jandockx opened this issue Dec 22, 2017 · 81 comments
Closed

Feature request: make it possible to keep docker container warm #239

jandockx opened this issue Dec 22, 2017 · 81 comments

Comments

@jandockx
Copy link

I understand from other issues that a new docker container is started for each request. This makes some experiments or automated tests undoable in practice. SAM Local is much too slow in the context where more then 1 request is to be handled.

I suspect that hot reloading depends on this feature.

I think it would be a good idea to make it possible to choose, while this project evolves further, to forego hot reloading, but to keep the docker container warm.

Something like

sam local start-api -p <PORT> --profile <AWS PROFILE> --keep-it-warm

This would broaden the applicability of sam local enormously.

Thank you for considering this suggestion. This looks like an awesome project.

@aldegoeij
Copy link

+1 Python container takes too long to start for simple debugging...

@zippadd
Copy link

zippadd commented Jan 5, 2018

+1. This currently makes local automated testing painful at best.

Thanks for the continued work on this project!

@dannymcpherson
Copy link

Have there been any eyes on this? The benefit would be so huge.

@cagoi
Copy link

cagoi commented Apr 19, 2018

+1

1 similar comment
@hobotroid
Copy link

+1

@daveykane
Copy link

+1

1 similar comment
@adrians5j
Copy link

+1

@CRogers
Copy link

CRogers commented Jun 16, 2018

+1, even a simple hello world java8 lambda takes 3/4 seconds for each request!

@CRogers
Copy link

CRogers commented Jun 18, 2018

My sketch proposal to make warm containers work and maintain all the existing nice hot reload/memory usage etc functionality around them:

Currently, the container is simply run with handler argument and the event passed in via an environment variable. The containers logs are then piped to the console stdout/stderr and it just records how much memory is used.

Instead, we can start the container with bash as the entrypoint and -c "sleep infinity" as the argument, so it runs effectively nothing and keeps container alive. We record the container id in an (expiring) dict so we can reuse it again. When we want to run the lambda we run docker exec that runs the previously used lambda entrypoint and the correct environment. Since we run one lambda per container we can still record memory usage. If we key the running containers by the version of the lambda code we're running we can ensure hot reload still works. As always with caches the invalidation would be the interesting part - you probably want to kill out of date containers and kill containers when the tool exits.

@monofonik
Copy link

+1

@luisvsm
Copy link

luisvsm commented Aug 15, 2018

+1 Very interested in this feature

@luketn
Copy link

luketn commented Aug 24, 2018

+1 Yes please!

@nodeit
Copy link

nodeit commented Sep 6, 2018

+1, throwing my hat in the ring on this too

@jfuss
Copy link
Contributor

jfuss commented Sep 6, 2018

As a note: Please use the reaction feature on the top comment. We do look at issues sorted by thumbs up (as well as other reactions). Commenting +1 does not good for that and adds noise to the issue.

@scoates
Copy link

scoates commented Sep 6, 2018

@jfuss I agree (and had done this). Any feedback from your team would be helpful here, though. The closest thing we had to knowing if this is on your radar (before your comment) was duplicate issue consolidation and labeling.

@ejoncas
Copy link

ejoncas commented Sep 24, 2018

+1, this would be very beneficial for people using java + spring boot.

@thoratou
Copy link

thoratou commented Oct 6, 2018

+1, around 1s for golang case

@kevanpng
Copy link

I did an experiment with container reuse. This is just with a lambda in python, I'm developing on ubuntu 16.04. In summary, docker container spinning up only takes an extra second. So it is not worth making the feature for container reuse. Link to my code https://github.com/kevanpng/aws-sam-local .

For a fixed query, both my colleague and I have 4s invocation time on sam local. His is a windows machine. With giving the profile flag and the container reuse, it goes down to 2.5s in my ubuntu.

My colleague is running on mac and when he tried the same query with lambda reuse and profile flag, he still had 11-14 seconds to run.

Maybe it could be that docker is slow on mac?

@ghost
Copy link

ghost commented Oct 11, 2018

1 second is a world's difference when building an API and you expect to serve more than 1 request.

I think it's well worth the feature.

@sanathkr
Copy link
Contributor

@kevanpng Hey I was looking through your code to understand what exactly you did.. So basically, you create the container once with a fixed name, run the function, and on next invocation look for container with same name and simply container.exec_run instead of creating it from scratch again. Is my summary correct?

I am super surprised Docker container creation makes this big of a difference. We can certainly look deeper into this if it is becoming usability blocker.

@scoates
Copy link

scoates commented Oct 11, 2018

@sanathkr. Thanks for looking at this. FWIW, it's a huge usability blocker for me:

~/src/faculty/buildshot$ time curl -s http://127.0.0.1:3000/ >/dev/null # SAM container via Docker

real	0m6.891s
user	0m0.012s
sys	0m0.021s
~/src/faculty/buildshot$ time curl -s http://127.0.0.1:5000/ >/dev/null # regular python app via flask dev/debug server (slow)

real	0m0.039s
user	0m0.012s
sys	0m0.019s

And the Instancing.. is quick. It's Docker (and the way Docker is used here) that's slow. The (slow) werkzeug-based dev server is ~175x faster than waiting around for Docker. And this is for every request, not just startup. (And yes, this is from my Mac.)

@sanathkr
Copy link
Contributor

@scoates Thanks for the comparison. Its not apples-to-apples to compare vanilla Flask to Docker-based app. But the 6 second duration with SAM CLI is definitely not what I would expect..

  • Did you have the Docker container already downloaded?
  • Also, can you start SAM CLI with --skip-pull-image flag? This will prevent the CLI to ask Docker for latest image version on every invoke. Do share your numbers again with this flag set.

Thinking ahead:
I think we need to add more instrumentation to SAM CLI codebase in order to understand the parts that contribute to the high latency. It could be cool if we can run the instrumented code in a Travis build with every PR so we can assess the performance impact of new code changes. We also need to run this on variety of platforms to understand the real difference between Mac/Ubuntu.

@sanathkr
Copy link
Contributor

sanathkr commented Oct 11, 2018

I did some more profiling by crudely commenting out parts of the codebase. Also this is not run multiple times. So the numbers are ballpark estimates. I ran sam init and ran sam local-start-api on a simple HelloWorld Lambda function created by the init template.

Platform: MacOSX
Docker version: 18.06.0

WARNING: Very crude measurements.

Total execution time (sam local start-api): 2.67 seconds
Skip pull images (sam local start-api --skip-pull-image): 1.45 seconds
Create container, run it, and return immediately without waiting for function terminate: 1.05 seconds
Create container, don't run it: 0.2 seconds
SAM CLI code overhead (don't create container at all): 0.045 seconds

Based on the above numbers, I arrived at a rough estimate for each step of the invoke path by assuming:

Total execution = SAM CLI overhead + Docker Image pull + Create container + Run Container + Run function

Then, here is how much each steps took:

SAM CLI Overhead: 0.045 seconds
Docker Image Pull Check: 1.3 seconds
Create Container: 0.15 seconds
Run container: 0.85 seconds
Run function: 0.45 seconds

The most interesting part is Create vs Run container durations. Run is 5x of Create. So it is better if we optimized for the Run duration.

If we were to do a warm start, then we would be saving some fraction of the 0.85 seconds it took to run the container. We should be keeping the runtime process up and running inside the container and re-run just the function in-place. Otherwise we aren't going to save much.

@scoates
Copy link

scoates commented Oct 17, 2018

Hi. Sorry for the late reply. I was traveling last week and forgot to get to this when I returned.

I agree absolutely that apigw and flask aren't apples-to-apples, and crude measurements are definitely where we're at right now.

With --skip-pull-image, I still get request starts in the 5+ second range. Entirely possible there's slow stuff in my code (though it's small, so I'm not sure where that would come from; it really does seem like docker). Here are the relevant bits of a request (on a warm start; this is several requests into sam local start-api --skip-pull-image):

[ 0.00] 2018-10-16 20:18:44 Starting new HTTP connection (1): 169.254.169.254
[ 1.01] 2018-10-16 20:18:45 Requested to skip pulling images ...
[ 0.00]
[ 0.00] 2018-10-16 20:18:45 Mounting /Users/sean/src/faculty/buildshot/buildshot/build as /var/task:ro inside runtime container
[!5.32] START RequestId: 13e564e9-1160-4c0e-b1e2-b31bbadd899a Version: $LATEST
[ 0.00] Instancing..
[ 0.00] [DEBUG]	2018-10-17T00:18:50.714Z	13e564e9-1160-4c0e-b1e2-b31bbadd899a	Zappa Event: {'body': None, 'httpMethod': 'GET', 'resource': '/', 'queryStringParameters': None, 'requestContext': {'httpMethod': 'GET', 'requestId': 'c6af9ac6-7b61-11e6-9a41-93e8deadbeef', 'path': '/', 'extendedRequestId': None, 'resourceId': '123456', 'apiId': '1234567890', 'stage': 'prod', 'resourcePath': '/', 'identity': {'accountId': None, 'apiKey': None, 'userArn': None, 'cognitoAuthenticationProvider': None, 'cognitoIdentityPoolId': None, 'userAgent': 'Custom User Agent String', 'caller': None, 'cognitoAuthenticationType': None, 'sourceIp': '127.0.0.1', 'user': None}, 'accountId': '123456789012'}, 'headers': {'X-Forwarded-Port': '3000', 'Host': 'localhost:3000', 'X-Forwarded-Proto': 'http', 'Accept': '*/*', 'User-Agent': 'curl/7.54.0'}, 'stageVariables': None, 'path': '/', 'pathParameters': None, 'isBase64Encoded': True}
[ 0.00]
[ 0.00] [INFO]	2018-10-17T00:18:50.731Z	13e564e9-1160-4c0e-b1e2-b31bbadd899a	127.0.0.1 - - [17/Oct/2018:00:18:50 +0000] "GET / HTTP/1.1" 200 15 "" "curl/7.54.0" 0/16.916
[ 0.00]
[ 0.00] END RequestId: 13e564e9-1160-4c0e-b1e2-b31bbadd899a
[ 0.00] REPORT RequestId: 13e564e9-1160-4c0e-b1e2-b31bbadd899a Duration: 4684 ms Billed Duration: 4700 ms Memory Size: 128 MB Max Memory Used: 42 MB
[ 0.58] 2018-10-16 20:18:51 127.0.0.1 - - [16/Oct/2018 20:18:51] "GET / HTTP/1.1" 200 -

The [ 0.xx] prefix is returned by a util I have that shows elapsed time between stdout lines. Here's the important part, I think:

[!5.32] START RequestId: 13e564e9-1160-4c0e-b1e2-b31bbadd899a Version: $LATEST
[ 0.00] Instancing..

I acknowledge that Instancing.. might just not be output until it's complete, so that by itself isn't a valid measurement point. Just wanted to pass on that I'm seeing 5s of lag in my requests.

I'm not sure how to measure much deeper than that.

More info:

$ docker --version
Docker version 18.06.1-ce, build e68fc7
$ uname -a
Darwin sarcosm.local 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64 i386 MacBookPro11,4 Darwin
$ sam --version
SAM CLI, version 0.5.0

I also agree that if I can get this down to sub-1s request times, it's probably usable. 5s+ is painful, still, though.

(Edit: adding in case anyone looking for Zappa info stumbles on this. I'm using an experimental fork of the Zappa handler runtime. This doesn't really apply to Zappa-actual. At least not right now.)

@OFranke
Copy link

OFranke commented Apr 9, 2020

If sam is using the same docker image under the hood, would it theoretically be possible to just pass the DOCKER_LAMBDA_STAY_OPEN=1 variable via sam environments.json?
Right now I observed that for some reason I cannot randomly add any variables to environments.json, just some that I defined before in the template.yaml.

When I hardcode the environment variable in my template.yaml like that:

SrvApigraphqlapi8D508D37:
    Type: AWS::Lambda::Function
    Properties:
      Code: SrvApigraphqlapi8D508D37
      Handler: base.handler
      Role:
        Fn::GetAtt:
        - SrvApigraphqlapiServiceRoleFD44AE9E
        - Arn
      Runtime: nodejs12.x
      Environment:
        Variables:
          DB_HOST:
            Fn::GetAtt:
            - SrvDatabasecdkgraphilelambdaexampledbD17C7F0B
            - Endpoint.Address
          DB_PORT:
            Fn::GetAtt:
            - SrvDatabasecdkgraphilelambdaexampledbD17C7F0B
            - Endpoint.Port
          DB_NAME: postgres
          DB_USERNAME: postgres
          DB_PASSWORD: postgres
          AWS_STAGE: prod
          DOCKER_LAMBDA_STAY_OPEN: 1

The whole thing crashes giving me that error message:

Lambda API listening on port 9001...
Function 'SrvApigraphqlapi8D508D37' timed out after 20 seconds
<class 'samcli.local.apigw.local_apigw_service.LambdaResponseParseException'>

@flache
Copy link

flache commented Apr 15, 2020

Are there any updates or is there a timeline on this? This is the single biggest blocker for us (and I can imagine for many others) to do more with AWS Lambda because this makes it almost impossible to develop and test stuff locally. Even with --skip-pull-image, a delay of ~5 seconds for each request makes it just unusable. Also there is the problem with global context not being preserved.

I understand that features must be prioritized but I am having a hard time to understand that everything that is running on lambda cannot be tested locally is not a high priority issue. Or am I missing something?

@literakl
Copy link

literakl commented Apr 15, 2020 via email

@jfuss
Copy link
Contributor

jfuss commented Apr 15, 2020

Update: The team is working on other prioritizes at the moment. We know the time it takes for invoking locally is a pain point for many and we have plans to address it in the future. We do not have an ETA as of now.

@OFranke
Copy link

OFranke commented Apr 25, 2020

@flache
I've moved away from sam as it seems to not play so well with cdk at the moment, see #1911. I worked around it having an app that I run on docker locally but let cdk deploy it. Therefore I just use different application entries, which are not so different at all.

// lambda entry
import { Response, Request } from 'express';

const awsServerlessExpress = require('aws-serverless-express');
const express = require('express');

const app = express();
const handler = (req: Request, res: Response): void => {
  try {
    app(
      req,
      res,
      (err: { status: number; statusCode: number; message: string }) => {
        if (err) {
          if (!res.headersSent) {
            res.statusCode = err.status || err.statusCode || 500;
            res.setHeader('Content-Type', 'application/json');
          }
          res.end(JSON.stringify({ errors: [{ message: `${err.message}` }] }));
          return;
        }
        if (!res.finished) {
          if (!res.headersSent) {
            res.statusCode = 404;
          }
          res.end(`'${req.url}' not found`);
        }
      },
    );
  } catch (err) {
    res.end(JSON.stringify({ errors: [{ message: `${err.message}` }] }));
  }
};

const server = awsServerlessExpress.createServer(handler, undefined);
exports.handler = (event: unknown, context: unknown): unknown =>
  awsServerlessExpress.proxy(server, event, context);
// docker entry
import express from 'express';

const main = async () => {
  const app = express();

  app.listen(5000, '0.0.0.0');
};

try {
  void main();
} catch (e) {
  console.error('Fatal error occurred starting server!');
  console.error(e);
  process.exit(101);
}

I have built a whole graphql service like that, and run it for a few weeks on AWS now. Seems to be fine.

@elthrasher
Copy link

For those who are very comfortable with Docker and docker-compose, I created a proxy image that works with the underlying SAM (lambci) images and can bring your lambda function into existing docker-compose workflows as a long-lived function. https://github.com/elthrasher/http-lambda-invoker

@literakl
Copy link

I have personally switched from AWS Lamda to NodeJS+Express+nodemon and my productivity and happiness boosted.

@duartemendes
Copy link

Spent the last week writing a CLI tool to help with this issue, just 2 days ago I published the first version.

It's available in npm for download and installation. It provides both DOCKER_LAMBDA_STAY_OPEN and DOCKER_LAMBDA_WATCH environment variables to the underlying containers, mitigating cold starts after the first invocation and watching code changes.

I think the tool is easy to use (takes one command to run your api locally) but it's in a very early stage. It works very well for my APIs but I'm pretty sure I didn't take all use cases into consideration. So, give it a go, report any issues you find and please leave some feedback.

@S-Cardenas
Copy link

@duartemendes that tool is amazing! Congratulations and let me know if you need any help.

Does your tool currently support layers?

@duartemendes
Copy link

Thanks @S-Cardenas. It doesn't but it's something I'm happy to take a loot at 👍

@kingferiol
Copy link

This is really a road blocker for this technology for us. Too painfully.

It is not sustainable to wait 10 seconds per each request during development. Without any action on this, I think that we have to reconsider our approach to this technology.

@jfuss
Copy link
Contributor

jfuss commented May 20, 2020

Update: We have prioritized some work that will help with the slow request time and provided a better warm invoke experience. I do not have timelines or ETAs to share at this point but wanted to communicate that we are starting to look at what we can do in this space.

@ianballard
Copy link

@jfuss any updates?

@guichafy
Copy link

I'm very excited to see this feature.

@awsjeffg awsjeffg added the stage/pm-review Waiting for review by our Product Manager, please don't work on this yet label Aug 12, 2020
@leonardobork
Copy link

@jfuss any news?

@S-Cardenas
Copy link

Ditto. Would be great if this was officially released. Currently using https://github.com/elthrasher/http-lambda-invoker as a substitute.

@moelasmar moelasmar mentioned this issue Nov 17, 2020
7 tasks
@awsjeffg awsjeffg added stage/in-progress A fix is being worked on and removed stage/pm-review Waiting for review by our Product Manager, please don't work on this yet labels Nov 19, 2020
@OGoodness
Copy link

🤞 Let's hope we can see this soon

@S-Cardenas
Copy link

Seems like it's getting very close to being approved and merged. Would love to get a notification when/if it does.

@millsy
Copy link

millsy commented Dec 16, 2020

Fingers crossed this is soon added

@kaarejoergensen
Copy link

This feature has been added to the newest release (https://github.com/aws/aws-sam-cli/releases/tag/v1.14.0) 🎉

@mndeveci
Copy link
Contributor

(As @kaarejoergensen mentioned 😄 ) Happy to inform that, this has been released with v1.14, resolving the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests