Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway behaves poorly when the first updateCompositionConfig call throws #335

Closed
glasser opened this issue Jul 31, 2020 · 7 comments · Fixed by apollographql/apollo-server#4981
Assignees

Comments

@glasser
Copy link
Member

glasser commented Jul 31, 2020

If the initial call to updateCompositionConfig in ApolloGateway.load() throws (eg, network error fetching managed federation data), the system behaves poorly.

We never set up the federation service polling loop, so the situation will never recover. apollographql/apollo-server#4277 was supposed to fix this but didn't.

If we fixed that issue in the most straightforward way, we would end up with strange behavior where the schema does load into ApolloServer, but various things we expect to happen after the schema loads don't happen:

  • The "Gateway successfully loaded schema" message logged by ApolloGateway.load() never happens
  • More importantly, serverWillStart plugins never get called!

One answer would be that we should just "hard fail" if the first updateCompositionConfig call failed. If ApolloServer's API was different such that there was an async function you were required to call and await at the top level before starting your server, that would make sense, and maybe we'll get there in AC3. Instead, AC2 just has a synchronous constructor, plus some mechanisms to make later operations wait on this.schemaDerivedData, plus the protected willStart function invoked by the http server integration libraries to trigger serverWillStart.

The way that willStart gets used is a little strange. What willStart does is:

  • If we're using a gateway, block until the initial call to ApolloGateway.load() completes. If that call rejected, return immediately (ignoring the error)
  • Call and await all serverWillStart plugins, throwing any errors from them.

The http server integration libraries all call willStart when you apply middleware to their framework. What they actually do with the promise returned from willStart is inconsistent:

  • Some of them (eg hapi) immediately await it. That means any error from a plugin will cause the applyMiddleware to throw.
  • Some of them (eg express) do the equivalent of awaiting it at the beginning of every HTTP request. That means that any error from a plugin will cause any HTTP request (whether GraphQL or something else like CORS or fetching playground) to fail. Note though that comments make it clear that we do NOT intend errors from ApolloGateway.load() to cause these non-GraphQL requests to fail, though they should at least wait until that load call is done.
  • Some of them (eg Azure functions) only await the promise while handling GraphQL requests. So other things like CORS or playground don't even wait for load() to finish and errors from plugins don't affect these other requests.

My proposal is that we should run serverWillStart plugins once both of these things have happened:

  • HTTP server integration library has called willStart
  • Loading the schema has successfully occurred (possibly not the first poll)

If this happens during the willStart call (because the schema was already loaded when we called willStart, or because the inial load completes successfully during willStart), then it'll work just like now, and any errors from plugins will be thrown by the willStart promise.

If on the other hand, the initial load fails but a later polled load succeeds, we should call serverWillStart plugins at that point. If those plugins throw, we should make actual GraphQL operations throw via ApolloServer.graphQLServerOptions throwing, just like that function currently throws if there is no schema (where it awaits this.schemaDerivedData).

Note that it's probably worth fixing apollographql/apollo-server#4428 in conjunction with this.

@glasser
Copy link
Member Author

glasser commented Sep 23, 2020

Also see apollographql/apollo-server#4588. Perhaps we should actually call serverWillStart once both of these things has happened:

  • HTTP server integration library has called willStart OR something has called graphQLServerOptions (eg because they're using calling the old express middleware directly instead of using AS.getMiddleware/applyMiddleware?)
  • Loading the schema has successfully occurred (possibly not the first poll)

@glasser
Copy link
Member Author

glasser commented Sep 23, 2020

Eh never mind on the last comment — the old middleware is not actually part of the 2.x API.

@abernix abernix transferred this issue from apollographql/apollo-server Jan 15, 2021
@glasser
Copy link
Member Author

glasser commented Jan 27, 2021

(Another alternative is that we fix this in a major version bump to Apollo Server and we introduce an async ApolloServer.start() in that new version.)

@glasser
Copy link
Member Author

glasser commented Feb 3, 2021

We should also strongly consider making the built-in health check fail when the schema is not loaded.

@glasser
Copy link
Member Author

glasser commented Feb 4, 2021

I think the best fix will be to introduce async ApolloServer.start(), which will be optional in AS2 and can become required in AS3. We can continue to have the behavior where things start kinda randomly if you don't call it, but if you want control over when your server starts (including loading the schema) and the ability to handle the error, you can call this.

@glasser glasser self-assigned this Feb 11, 2021
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 3, 2021
Full description to come, but tl;dr:

Previously server startup worked like this:

- AS constructor runs
- If no gateway, calculate schema and schema derived data immediately
- If gateway, kick off gateway.load from the end of the constructor, and if it
  async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but not await)
  protected willStart, which is an async function that first waits for the
  gateway to load the schema if necessary and then runs serverWillStart plugin
  functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that promise.
  And also if there's no schema fail with an error.

Now server startup works like this:
- There's an explicit state machine situation inside AS
- AS constructor initializes state with schema directly if no gateway
- If there is a gateway the constructor DOES NOT KICK OFF gateway.load
- You can now call `await server.start()` yourself, which will first await
  gateway.load if necessary, and then await all serverWillStart calls
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling graphqlServerOptions
- graphqlServerOptions now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

Overall changes:
- If you're using `apollo-server`, startup errors will cause `listen` to reject,
  nothing else necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself after the constructor, which will let you detect
  startup errors.
- But if you don't do that, the server will still start up with similar
  properties. gateway.load won't start until the integration's `ensureStarting`
  or graphqlServerOptions' `ensuresStarted` though. Errors will be logged.
- Also if you don't call `server.start()`, the full stack trace of any startup
  error will be logged on *every* failed graphql request instead of just a short
  message suggesting there's more in logs (but it's only at the beginning of
  your logs)

Yes this is the tl;dr version that I jotted off at the end of my work day off
the top of my head without going back and skimming through the PR for details :)

Fixes #4921. Fixes apollographql/federation#335.
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 3, 2021
Full description to come, but tl;dr:

Previously server startup worked like this:

- AS constructor runs
- If no gateway, calculate schema and schema derived data immediately
- If gateway, kick off gateway.load from the end of the constructor, and if it
  async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but not await)
  protected willStart, which is an async function that first waits for the
  gateway to load the schema if necessary and then runs serverWillStart plugin
  functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that promise.
  And also if there's no schema fail with an error.

Now server startup works like this:
- There's an explicit state machine situation inside AS
- AS constructor initializes state with schema directly if no gateway
- If there is a gateway the constructor DOES NOT KICK OFF gateway.load
- You can now call `await server.start()` yourself, which will first await
  gateway.load if necessary, and then await all serverWillStart calls
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling graphqlServerOptions
- graphqlServerOptions now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

Overall changes:
- If you're using `apollo-server`, startup errors will cause `listen` to reject,
  nothing else necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself after the constructor, which will let you detect
  startup errors.
- But if you don't do that, the server will still start up with similar
  properties. gateway.load won't start until the integration's `ensureStarting`
  or graphqlServerOptions' `ensuresStarted` though. Errors will be logged.
- Also if you don't call `server.start()`, the full stack trace of any startup
  error will be logged on *every* failed graphql request instead of just a short
  message suggesting there's more in logs (but it's only at the beginning of
  your logs)

Yes this is the tl;dr version that I jotted off at the end of my work day off
the top of my head without going back and skimming through the PR for details :)

Fixes #4921. Fixes apollographql/federation#335.
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 4, 2021
Full description to come, but tl;dr:

Previously server startup worked like this:

- AS constructor runs
- If no gateway, calculate schema and schema derived data immediately
- If gateway, kick off gateway.load from the end of the constructor, and if it
  async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but not await)
  protected willStart, which is an async function that first waits for the
  gateway to load the schema if necessary and then runs serverWillStart plugin
  functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that promise.
  And also if there's no schema fail with an error.

Now server startup works like this:
- There's an explicit state machine situation inside AS
- AS constructor initializes state with schema directly if no gateway
- If there is a gateway the constructor DOES NOT KICK OFF gateway.load
- You can now call `await server.start()` yourself, which will first await
  gateway.load if necessary, and then await all serverWillStart calls
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling graphqlServerOptions
- graphqlServerOptions now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

Overall changes:
- If you're using `apollo-server`, startup errors will cause `listen` to reject,
  nothing else necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself after the constructor, which will let you detect
  startup errors.
- But if you don't do that, the server will still start up with similar
  properties. gateway.load won't start until the integration's `ensureStarting`
  or graphqlServerOptions' `ensuresStarted` though. Errors will be logged.
- Also if you don't call `server.start()`, the full stack trace of any startup
  error will be logged on *every* failed graphql request instead of just a short
  message suggesting there's more in logs (but it's only at the beginning of
  your logs)

Yes this is the tl;dr version that I jotted off at the end of my work day off
the top of my head without going back and skimming through the PR for details :)

Fixes #4921. Fixes apollographql/federation#335.
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 4, 2021
Previously, server startup worked like this:

- `new ApolloServer`
  - If no gateway, calculate schema and schema derived data immediately
  - If gateway, kick off gateway.load from the end of the constructor, and if it
    async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but don't await)
  the protected `willStart` function, which is an async function that first
  waits for the gateway to load the schema if necessary and then runs
  serverWillStart plugin functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that Promise.
  And also, if there's no schema, fail with an error.

Now server startup works like this:
- ApolloServer represents its state explicitly with a new ServerState
-`new ApolloServer`
  - If no gateway, initialize all the schema-derived state directly like
    before (though the state now lives inside ServerState)
  - If gateway, the constructor DOES NOT KICK OFF `gateway.load()`
- You can now call `await server.start()` yourself, which will first await
  `gateway.load` if necessary, and then await all serverWillStart calls.
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations!
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling `graphqlServerOptions`
- `graphqlServerOptions` now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

The overall change to user experience:
- If you're using `apollo-server`, startup errors will cause `listen` to reject;
  no code changes are necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself immediately after the constructor, which will let
  you detect startup errors.
- But if you don't do that, the server will call `start` itself eventually. When
  you try to execute your first GraphQL request, `start` will happen if it
  hasn't already. Also an integration call like `server.applyMiddleware` will
  initiate a background `start`. If startup fails, the startup error will be
  logged on *every* failed graphql request, not just the first time like
  happened before.
- If you have your own ApolloServer subclass that calls the protected
  `willStart` method, it won't work before that method is gone. Consider whether
  you can eliminate that call by just calling `start`, or perhaps call
  `ensureStarting` instead.

This is close enough to backwards-compatible to be appropriate for a v2 minor
release. We are likely to make `start()` required in Apollo Server 3 (other than
for `apollo-server`).

Also:
- Previously we used the deprecated `ApolloServer.schema` field to determine
  whether to install ApolloServerPluginInlineTrace, which we want to have active
  by default for federated schemas only. If you're using a gateway, this field
  isn't actually set at the time that ensurePluginInstantiation reads it.
  That's basically OK because we don't want to turn on the plugin automatically
  in the gateway, but in the interest of avoiding use of the deprecated field, I
  refactored it so that `ApolloServerPluginInlineTrace` is installed by default
  (ie, if you don't install your own version or install
  `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and
  then (if it's installed automatically) it decides whether or not to be active
  by checking the schema at `serverWillStart` time.
- Similarly, schema reporting now throws in its `serverWillStart` if the schema
  is federated, instead of in `ensurePluginInstantiation`. (This does mean that
  if you're not using the new `start()` or `apollo-server`, that failure won't
  make your app fail as fast as if the `ApolloServer` constructor threw.)
- Fix some fastify tests that used a fixed listen port to not do that.
- I am doing my best to never accidentally run `prettier` on whole files and
  instead to very carefully select specific blocks of the file to format them
  several times per minute. Apparently I screwed up once and ran it once on
  `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier
  changes" to "actual changes" in that file is low enough that I'd rather just
  leave the changes in this PR rather than spending time carefully reverting
  them. (It's one of the files I work on the most and being able to keep it
  prettier-clean will be helpful anyway.)
- Replace a hacky workaround for the lack of `start` in the op reg tests!
- Replace a use of a `Barrier` class I added recently in tests with the
  `@josephg/resolvable` npm package, which does basically the same thing.
  Use that package in new tests and in the core state machine itself.
- While running tests I found that some test files hung if run separately due to
  lack of cleanup. I ended up refactoring the cache tests to:
  - make who is responsible for calling cache.close more consistent
  - make the Redis client mocks self-contained mocks of the ioredis API instead
    of starting with an actual ioredis implementation and mocking out some
    internals
  - clean up Jest fake timers when a certain test is done
  I'm not super certain exactly which of these changes fixed the hangs but it
  does seem better this way. (Specifically I think the fake timer fix, which I
  did last, is what actually fixed it, but the other changes made it easier for
  me to reason about what was going on.) Can factor out into another PR if
  helpful.

Fixes #4921. Fixes apollographql/federation#335.

TODO:
- [ ] Go through all docs and READMEs that have 'FIXME start' and add calls to
  start. This involves verifying that you can actually do top-level await in
  the contexts that matter. (eg if it turns out that you really can't call await
  before you assign a handler in Lambda, that's interesting and may require some
  other changes to this PR!)
- [ ] Actually document start() in the apollo-server reference
- [ ] Document start() in all the integrations references
- [ ] CHANGELOG
- [ ] consider whether removing the protected willStart function is OK
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 4, 2021
Previously, server startup worked like this:

- `new ApolloServer`
  - If no gateway, calculate schema and schema derived data immediately
  - If gateway, kick off gateway.load from the end of the constructor, and if it
    async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but don't await)
  the protected `willStart` function, which is an async function that first
  waits for the gateway to load the schema if necessary and then runs
  serverWillStart plugin functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that Promise.
  And also, if there's no schema, fail with an error.

Now server startup works like this:
- ApolloServer represents its state explicitly with a new ServerState
- `new ApolloServer`
  - If no gateway, initialize all the schema-derived state directly like
    before (though the state now lives inside ServerState)
  - If gateway, the constructor DOES NOT KICK OFF `gateway.load()`
- You can now call `await server.start()` yourself, which will first await
  `gateway.load` if necessary, and then await all serverWillStart calls.
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations!
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling `graphqlServerOptions`
- `graphqlServerOptions` now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

The overall change to user experience:
- If you're using `apollo-server`, startup errors will cause `listen` to reject;
  no code changes are necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself immediately after the constructor, which will let
  you detect startup errors.
- But if you don't do that, the server will call `start` itself eventually. When
  you try to execute your first GraphQL request, `start` will happen if it
  hasn't already. Also an integration call like `server.applyMiddleware` will
  initiate a background `start`. If startup fails, the startup error will be
  logged on *every* failed graphql request, not just the first time like
  happened before.
- If you have your own ApolloServer subclass that calls the protected
  `willStart` method, it won't work before that method is gone. Consider whether
  you can eliminate that call by just calling `start`, or perhaps call
  `ensureStarting` instead.

This is close enough to backwards-compatible to be appropriate for a v2 minor
release. We are likely to make `start()` required in Apollo Server 3 (other than
for `apollo-server`).

Also:
- Previously we used the deprecated `ApolloServer.schema` field to determine
  whether to install ApolloServerPluginInlineTrace, which we want to have active
  by default for federated schemas only. If you're using a gateway, this field
  isn't actually set at the time that ensurePluginInstantiation reads it.
  That's basically OK because we don't want to turn on the plugin automatically
  in the gateway, but in the interest of avoiding use of the deprecated field, I
  refactored it so that `ApolloServerPluginInlineTrace` is installed by default
  (ie, if you don't install your own version or install
  `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and
  then (if it's installed automatically) it decides whether or not to be active
  by checking the schema at `serverWillStart` time.
- Similarly, schema reporting now throws in its `serverWillStart` if the schema
  is federated, instead of in `ensurePluginInstantiation`. (This does mean that
  if you're not using the new `start()` or `apollo-server`, that failure won't
  make your app fail as fast as if the `ApolloServer` constructor threw.)
- Fix some fastify tests that used a fixed listen port to not do that.
- I am doing my best to never accidentally run `prettier` on whole files and
  instead to very carefully select specific blocks of the file to format them
  several times per minute. Apparently I screwed up once and ran it once on
  `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier
  changes" to "actual changes" in that file is low enough that I'd rather just
  leave the changes in this PR rather than spending time carefully reverting
  them. (It's one of the files I work on the most and being able to keep it
  prettier-clean will be helpful anyway.)
- Replace a hacky workaround for the lack of `start` in the op reg tests!
- Replace a use of a `Barrier` class I added recently in tests with the
  `@josephg/resolvable` npm package, which does basically the same thing.
  Use that package in new tests and in the core state machine itself.
- While running tests I found that some test files hung if run separately due to
  lack of cleanup. I ended up refactoring the cache tests to:
  - make who is responsible for calling cache.close more consistent
  - make the Redis client mocks self-contained mocks of the ioredis API instead
    of starting with an actual ioredis implementation and mocking out some
    internals
  - clean up Jest fake timers when a certain test is done
  I'm not super certain exactly which of these changes fixed the hangs but it
  does seem better this way. (Specifically I think the fake timer fix, which I
  did last, is what actually fixed it, but the other changes made it easier for
  me to reason about what was going on.) Can factor out into another PR if
  helpful.

Fixes #4921. Fixes apollographql/federation#335.

TODO:
- [ ] Go through all docs and READMEs that have 'FIXME start' and add calls to
  start. This involves verifying that you can actually do top-level await in
  the contexts that matter. (eg if it turns out that you really can't call await
  before you assign a handler in Lambda, that's interesting and may require some
  other changes to this PR!)
- [ ] Actually document start() in the apollo-server reference
- [ ] Document start() in all the integrations references
- [ ] CHANGELOG
- [ ] consider whether removing the protected willStart function is OK
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 16, 2021
Previously, server startup worked like this:

- `new ApolloServer`
  - If no gateway, calculate schema and schema derived data immediately
  - If gateway, kick off gateway.load from the end of the constructor, and if it
    async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but don't await)
  the protected `willStart` function, which is an async function that first
  waits for the gateway to load the schema if necessary and then runs
  serverWillStart plugin functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that Promise.
  And also, if there's no schema, fail with an error.

Now server startup works like this:
- ApolloServer represents its state explicitly with a new ServerState
- `new ApolloServer`
  - If no gateway, initialize all the schema-derived state directly like
    before (though the state now lives inside ServerState)
  - If gateway, the constructor DOES NOT KICK OFF `gateway.load()`
- You can now call `await server.start()` yourself, which will first await
  `gateway.load` if necessary, and then await all serverWillStart calls.
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations!
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling `graphqlServerOptions`
- `graphqlServerOptions` now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

The overall change to user experience:
- If you're using `apollo-server`, startup errors will cause `listen` to reject;
  no code changes are necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself immediately after the constructor, which will let
  you detect startup errors.
- But if you don't do that, the server will call `start` itself eventually. When
  you try to execute your first GraphQL request, `start` will happen if it
  hasn't already. Also an integration call like `server.applyMiddleware` will
  initiate a background `start`. If startup fails, the startup error will be
  logged on *every* failed graphql request, not just the first time like
  happened before.
- If you have your own ApolloServer subclass that calls the protected
  `willStart` method, it won't work before that method is gone. Consider whether
  you can eliminate that call by just calling `start`, or perhaps call
  `ensureStarting` instead.

This is close enough to backwards-compatible to be appropriate for a v2 minor
release. We are likely to make `start()` required in Apollo Server 3 (other than
for `apollo-server`).

Also:
- Previously we used the deprecated `ApolloServer.schema` field to determine
  whether to install ApolloServerPluginInlineTrace, which we want to have active
  by default for federated schemas only. If you're using a gateway, this field
  isn't actually set at the time that ensurePluginInstantiation reads it.
  That's basically OK because we don't want to turn on the plugin automatically
  in the gateway, but in the interest of avoiding use of the deprecated field, I
  refactored it so that `ApolloServerPluginInlineTrace` is installed by default
  (ie, if you don't install your own version or install
  `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and
  then (if it's installed automatically) it decides whether or not to be active
  by checking the schema at `serverWillStart` time.
- Similarly, schema reporting now throws in its `serverWillStart` if the schema
  is federated, instead of in `ensurePluginInstantiation`. (This does mean that
  if you're not using the new `start()` or `apollo-server`, that failure won't
  make your app fail as fast as if the `ApolloServer` constructor threw.)
- Fix some fastify tests that used a fixed listen port to not do that.
- I am doing my best to never accidentally run `prettier` on whole files and
  instead to very carefully select specific blocks of the file to format them
  several times per minute. Apparently I screwed up once and ran it once on
  `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier
  changes" to "actual changes" in that file is low enough that I'd rather just
  leave the changes in this PR rather than spending time carefully reverting
  them. (It's one of the files I work on the most and being able to keep it
  prettier-clean will be helpful anyway.)
- Replace a hacky workaround for the lack of `start` in the op reg tests!
- Replace a use of a `Barrier` class I added recently in tests with the
  `@josephg/resolvable` npm package, which does basically the same thing.
  Use that package in new tests and in the core state machine itself.
- While running tests I found that some test files hung if run separately due to
  lack of cleanup. I ended up refactoring the cache tests to:
  - make who is responsible for calling cache.close more consistent
  - make the Redis client mocks self-contained mocks of the ioredis API instead
    of starting with an actual ioredis implementation and mocking out some
    internals
  - clean up Jest fake timers when a certain test is done
  I'm not super certain exactly which of these changes fixed the hangs but it
  does seem better this way. (Specifically I think the fake timer fix, which I
  did last, is what actually fixed it, but the other changes made it easier for
  me to reason about what was going on.) Can factor out into another PR if
  helpful.

Fixes #4921. Fixes apollographql/federation#335.

TODO:
- [ ] Go through all docs and READMEs that have 'FIXME start' and add calls to
  start. This involves verifying that you can actually do top-level await in
  the contexts that matter. (eg if it turns out that you really can't call await
  before you assign a handler in Lambda, that's interesting and may require some
  other changes to this PR!)
- [ ] Actually document start() in the apollo-server reference
- [ ] Document start() in all the integrations references
- [ ] CHANGELOG
- [ ] consider whether removing the protected willStart function is OK
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 16, 2021
Previously, server startup worked like this:

- `new ApolloServer`
  - If no gateway, calculate schema and schema derived data immediately
  - If gateway, kick off gateway.load from the end of the constructor, and if it
    async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but don't await)
  the protected `willStart` function, which is an async function that first
  waits for the gateway to load the schema if necessary and then runs
  serverWillStart plugin functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that Promise.
  And also, if there's no schema, fail with an error.

Now server startup works like this:
- ApolloServer represents its state explicitly with a new ServerState
- `new ApolloServer`
  - If no gateway, initialize all the schema-derived state directly like
    before (though the state now lives inside ServerState)
  - If gateway, the constructor DOES NOT KICK OFF `gateway.load()`
- You can now call `await server.start()` yourself, which will first await
  `gateway.load` if necessary, and then await all serverWillStart calls.
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations!
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling `graphqlServerOptions`
- `graphqlServerOptions` now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

The overall change to user experience:
- If you're using `apollo-server`, startup errors will cause `listen` to reject;
  no code changes are necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself immediately after the constructor, which will let
  you detect startup errors.
- But if you don't do that, the server will call `start` itself eventually. When
  you try to execute your first GraphQL request, `start` will happen if it
  hasn't already. Also an integration call like `server.applyMiddleware` will
  initiate a background `start`. If startup fails, the startup error will be
  logged on *every* failed graphql request, not just the first time like
  happened before.
- If you have your own ApolloServer subclass that calls the protected
  `willStart` method, it won't work before that method is gone. Consider whether
  you can eliminate that call by just calling `start`, or perhaps call
  `ensureStarting` instead.

This is close enough to backwards-compatible to be appropriate for a v2 minor
release. We are likely to make `start()` required in Apollo Server 3 (other than
for `apollo-server`).

Also:
- Previously we used the deprecated `ApolloServer.schema` field to determine
  whether to install ApolloServerPluginInlineTrace, which we want to have active
  by default for federated schemas only. If you're using a gateway, this field
  isn't actually set at the time that ensurePluginInstantiation reads it.
  That's basically OK because we don't want to turn on the plugin automatically
  in the gateway, but in the interest of avoiding use of the deprecated field, I
  refactored it so that `ApolloServerPluginInlineTrace` is installed by default
  (ie, if you don't install your own version or install
  `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and
  then (if it's installed automatically) it decides whether or not to be active
  by checking the schema at `serverWillStart` time.
- Similarly, schema reporting now throws in its `serverWillStart` if the schema
  is federated, instead of in `ensurePluginInstantiation`. (This does mean that
  if you're not using the new `start()` or `apollo-server`, that failure won't
  make your app fail as fast as if the `ApolloServer` constructor threw.)
- Fix some fastify tests that used a fixed listen port to not do that.
- I am doing my best to never accidentally run `prettier` on whole files and
  instead to very carefully select specific blocks of the file to format them
  several times per minute. Apparently I screwed up once and ran it once on
  `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier
  changes" to "actual changes" in that file is low enough that I'd rather just
  leave the changes in this PR rather than spending time carefully reverting
  them. (It's one of the files I work on the most and being able to keep it
  prettier-clean will be helpful anyway.)
- Replace a hacky workaround for the lack of `start` in the op reg tests!
- Replace a use of a `Barrier` class I added recently in tests with the
  `@josephg/resolvable` npm package, which does basically the same thing.
  Use that package in new tests and in the core state machine itself.
- While running tests I found that some test files hung if run separately due to
  lack of cleanup. I ended up refactoring the cache tests to:
  - make who is responsible for calling cache.close more consistent
  - make the Redis client mocks self-contained mocks of the ioredis API instead
    of starting with an actual ioredis implementation and mocking out some
    internals
  - clean up Jest fake timers when a certain test is done
  I'm not super certain exactly which of these changes fixed the hangs but it
  does seem better this way. (Specifically I think the fake timer fix, which I
  did last, is what actually fixed it, but the other changes made it easier for
  me to reason about what was going on.) Can factor out into another PR if
  helpful.

Fixes #4921. Fixes apollographql/federation#335.

TODO:
- [ ] Go through all docs and READMEs that have 'FIXME start' and add calls to
  start. This involves verifying that you can actually do top-level await in
  the contexts that matter. (eg if it turns out that you really can't call await
  before you assign a handler in Lambda, that's interesting and may require some
  other changes to this PR!)
- [ ] Actually document start() in the apollo-server reference
- [ ] Document start() in all the integrations references
- [ ] CHANGELOG
- [ ] consider whether removing the protected willStart function is OK
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 16, 2021
Previously, server startup worked like this:

- `new ApolloServer`
  - If no gateway, calculate schema and schema derived data immediately
  - If gateway, kick off gateway.load from the end of the constructor, and if it
    async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but don't await)
  the protected `willStart` function, which is an async function that first
  waits for the gateway to load the schema if necessary and then runs
  serverWillStart plugin functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that Promise.
  And also, if there's no schema, fail with an error.

Now server startup works like this:
- ApolloServer represents its state explicitly with a new ServerState
- `new ApolloServer`
  - If no gateway, initialize all the schema-derived state directly like
    before (though the state now lives inside ServerState)
  - If gateway, the constructor DOES NOT KICK OFF `gateway.load()`
- You can now call `await server.start()` yourself, which will first await
  `gateway.load` if necessary, and then await all serverWillStart calls.
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations!
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling `graphqlServerOptions`
- `graphqlServerOptions` now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

The overall change to user experience:
- If you're using `apollo-server`, startup errors will cause `listen` to reject;
  no code changes are necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself immediately after the constructor, which will let
  you detect startup errors.
- But if you don't do that, the server will call `start` itself eventually. When
  you try to execute your first GraphQL request, `start` will happen if it
  hasn't already. Also an integration call like `server.applyMiddleware` will
  initiate a background `start`. If startup fails, the startup error will be
  logged on *every* failed graphql request, not just the first time like
  happened before.
- If you have your own ApolloServer subclass that calls the protected
  `willStart` method, it won't work before that method is gone. Consider whether
  you can eliminate that call by just calling `start`, or perhaps call
  `ensureStarting` instead.

This is close enough to backwards-compatible to be appropriate for a v2 minor
release. We are likely to make `start()` required in Apollo Server 3 (other than
for `apollo-server`).

Also:
- Previously we used the deprecated `ApolloServer.schema` field to determine
  whether to install ApolloServerPluginInlineTrace, which we want to have active
  by default for federated schemas only. If you're using a gateway, this field
  isn't actually set at the time that ensurePluginInstantiation reads it.
  That's basically OK because we don't want to turn on the plugin automatically
  in the gateway, but in the interest of avoiding use of the deprecated field, I
  refactored it so that `ApolloServerPluginInlineTrace` is installed by default
  (ie, if you don't install your own version or install
  `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and
  then (if it's installed automatically) it decides whether or not to be active
  by checking the schema at `serverWillStart` time.
- Similarly, schema reporting now throws in its `serverWillStart` if the schema
  is federated, instead of in `ensurePluginInstantiation`. (This does mean that
  if you're not using the new `start()` or `apollo-server`, that failure won't
  make your app fail as fast as if the `ApolloServer` constructor threw.)
- Fix some fastify tests that used a fixed listen port to not do that.
- I am doing my best to never accidentally run `prettier` on whole files and
  instead to very carefully select specific blocks of the file to format them
  several times per minute. Apparently I screwed up once and ran it once on
  `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier
  changes" to "actual changes" in that file is low enough that I'd rather just
  leave the changes in this PR rather than spending time carefully reverting
  them. (It's one of the files I work on the most and being able to keep it
  prettier-clean will be helpful anyway.)
- Replace a hacky workaround for the lack of `start` in the op reg tests!
- Replace a use of a `Barrier` class I added recently in tests with the
  `@josephg/resolvable` npm package, which does basically the same thing.
  Use that package in new tests and in the core state machine itself.
- While running tests I found that some test files hung if run separately due to
  lack of cleanup. I ended up refactoring the cache tests to:
  - make who is responsible for calling cache.close more consistent
  - make the Redis client mocks self-contained mocks of the ioredis API instead
    of starting with an actual ioredis implementation and mocking out some
    internals
  - clean up Jest fake timers when a certain test is done
  I'm not super certain exactly which of these changes fixed the hangs but it
  does seem better this way. (Specifically I think the fake timer fix, which I
  did last, is what actually fixed it, but the other changes made it easier for
  me to reason about what was going on.) Can factor out into another PR if
  helpful.

Fixes #4921. Fixes apollographql/federation#335.

TODO:
- [ ] Go through all docs and READMEs that have 'FIXME start' and add calls to
  start. This involves verifying that you can actually do top-level await in
  the contexts that matter. (eg if it turns out that you really can't call await
  before you assign a handler in Lambda, that's interesting and may require some
  other changes to this PR!)
- [ ] Actually document start() in the apollo-server reference
- [ ] Document start() in all the integrations references
- [ ] CHANGELOG
- [ ] consider whether removing the protected willStart function is OK
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 18, 2021
Previously, server startup worked like this:

- `new ApolloServer`
  - If no gateway, calculate schema and schema derived data immediately
  - If gateway, kick off gateway.load from the end of the constructor, and if it
    async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but don't await)
  the protected `willStart` function, which is an async function that first
  waits for the gateway to load the schema if necessary and then runs
  serverWillStart plugin functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that Promise.
  And also, if there's no schema, fail with an error.

Now server startup works like this:
- ApolloServer represents its state explicitly with a new ServerState
- `new ApolloServer`
  - If no gateway, initialize all the schema-derived state directly like
    before (though the state now lives inside ServerState)
  - If gateway, the constructor DOES NOT KICK OFF `gateway.load()`
- You can now call `await server.start()` yourself, which will first await
  `gateway.load` if necessary, and then await all serverWillStart calls.
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations!
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling `graphqlServerOptions`
- `graphqlServerOptions` now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

The overall change to user experience:
- If you're using `apollo-server`, startup errors will cause `listen` to reject;
  no code changes are necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself immediately after the constructor, which will let
  you detect startup errors.
- But if you don't do that, the server will call `start` itself eventually. When
  you try to execute your first GraphQL request, `start` will happen if it
  hasn't already. Also an integration call like `server.applyMiddleware` will
  initiate a background `start`. If startup fails, the startup error will be
  logged on *every* failed graphql request, not just the first time like
  happened before.
- If you have your own ApolloServer subclass that calls the protected
  `willStart` method, it won't work before that method is gone. Consider whether
  you can eliminate that call by just calling `start`, or perhaps call
  `ensureStarting` instead.

This is close enough to backwards-compatible to be appropriate for a v2 minor
release. We are likely to make `start()` required in Apollo Server 3 (other than
for `apollo-server`).

Also:
- Previously we used the deprecated `ApolloServer.schema` field to determine
  whether to install ApolloServerPluginInlineTrace, which we want to have active
  by default for federated schemas only. If you're using a gateway, this field
  isn't actually set at the time that ensurePluginInstantiation reads it.
  That's basically OK because we don't want to turn on the plugin automatically
  in the gateway, but in the interest of avoiding use of the deprecated field, I
  refactored it so that `ApolloServerPluginInlineTrace` is installed by default
  (ie, if you don't install your own version or install
  `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and
  then (if it's installed automatically) it decides whether or not to be active
  by checking the schema at `serverWillStart` time.
- Similarly, schema reporting now throws in its `serverWillStart` if the schema
  is federated, instead of in `ensurePluginInstantiation`. (This does mean that
  if you're not using the new `start()` or `apollo-server`, that failure won't
  make your app fail as fast as if the `ApolloServer` constructor threw.)
- Fix some fastify tests that used a fixed listen port to not do that.
- I am doing my best to never accidentally run `prettier` on whole files and
  instead to very carefully select specific blocks of the file to format them
  several times per minute. Apparently I screwed up once and ran it once on
  `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier
  changes" to "actual changes" in that file is low enough that I'd rather just
  leave the changes in this PR rather than spending time carefully reverting
  them. (It's one of the files I work on the most and being able to keep it
  prettier-clean will be helpful anyway.)
- Replace a hacky workaround for the lack of `start` in the op reg tests!
- Replace a use of a `Barrier` class I added recently in tests with the
  `@josephg/resolvable` npm package, which does basically the same thing.
  Use that package in new tests and in the core state machine itself.
- While running tests I found that some test files hung if run separately due to
  lack of cleanup. I ended up refactoring the cache tests to:
  - make who is responsible for calling cache.close more consistent
  - make the Redis client mocks self-contained mocks of the ioredis API instead
    of starting with an actual ioredis implementation and mocking out some
    internals
  - clean up Jest fake timers when a certain test is done
  I'm not super certain exactly which of these changes fixed the hangs but it
  does seem better this way. (Specifically I think the fake timer fix, which I
  did last, is what actually fixed it, but the other changes made it easier for
  me to reason about what was going on.) Can factor out into another PR if
  helpful.

Fixes #4921. Fixes apollographql/federation#335.

TODO:
- [ ] Go through all docs and READMEs that have 'FIXME start' and add calls to
  start. This involves verifying that you can actually do top-level await in
  the contexts that matter. (eg if it turns out that you really can't call await
  before you assign a handler in Lambda, that's interesting and may require some
  other changes to this PR!)
- [ ] Actually document start() in the apollo-server reference
- [ ] Document start() in all the integrations references
- [ ] CHANGELOG
- [ ] consider whether removing the protected willStart function is OK
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 19, 2021
Previously, server startup worked like this:

- `new ApolloServer`
  - If no gateway, calculate schema and schema derived data immediately
  - If gateway, kick off gateway.load from the end of the constructor, and if it
    async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but don't await)
  the protected `willStart` function, which is an async function that first
  waits for the gateway to load the schema if necessary and then runs
  serverWillStart plugin functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that Promise.
  And also, if there's no schema, fail with an error.

Now server startup works like this:
- ApolloServer represents its state explicitly with a new ServerState
- `new ApolloServer`
  - If no gateway, initialize all the schema-derived state directly like
    before (though the state now lives inside ServerState)
  - If gateway, the constructor DOES NOT KICK OFF `gateway.load()`
- You can now call `await server.start()` yourself, which will first await
  `gateway.load` if necessary, and then await all serverWillStart calls.
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations!
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling `graphqlServerOptions`
- `graphqlServerOptions` now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

The overall change to user experience:
- If you're using `apollo-server`, startup errors will cause `listen` to reject;
  no code changes are necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself immediately after the constructor, which will let
  you detect startup errors.
- But if you don't do that, the server will call `start` itself eventually. When
  you try to execute your first GraphQL request, `start` will happen if it
  hasn't already. Also an integration call like `server.applyMiddleware` will
  initiate a background `start`. If startup fails, the startup error will be
  logged on *every* failed graphql request, not just the first time like
  happened before.
- If you have your own ApolloServer subclass that calls the protected
  `willStart` method, it won't work before that method is gone. Consider whether
  you can eliminate that call by just calling `start`, or perhaps call
  `ensureStarting` instead.

This is close enough to backwards-compatible to be appropriate for a v2 minor
release. We are likely to make `start()` required in Apollo Server 3 (other than
for `apollo-server`).

Also:
- Previously we used the deprecated `ApolloServer.schema` field to determine
  whether to install ApolloServerPluginInlineTrace, which we want to have active
  by default for federated schemas only. If you're using a gateway, this field
  isn't actually set at the time that ensurePluginInstantiation reads it.
  That's basically OK because we don't want to turn on the plugin automatically
  in the gateway, but in the interest of avoiding use of the deprecated field, I
  refactored it so that `ApolloServerPluginInlineTrace` is installed by default
  (ie, if you don't install your own version or install
  `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and
  then (if it's installed automatically) it decides whether or not to be active
  by checking the schema at `serverWillStart` time.
- Similarly, schema reporting now throws in its `serverWillStart` if the schema
  is federated, instead of in `ensurePluginInstantiation`. (This does mean that
  if you're not using the new `start()` or `apollo-server`, that failure won't
  make your app fail as fast as if the `ApolloServer` constructor threw.)
- Fix some fastify tests that used a fixed listen port to not do that.
- I am doing my best to never accidentally run `prettier` on whole files and
  instead to very carefully select specific blocks of the file to format them
  several times per minute. Apparently I screwed up once and ran it once on
  `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier
  changes" to "actual changes" in that file is low enough that I'd rather just
  leave the changes in this PR rather than spending time carefully reverting
  them. (It's one of the files I work on the most and being able to keep it
  prettier-clean will be helpful anyway.)
- Replace a hacky workaround for the lack of `start` in the op reg tests!
- Replace a use of a `Barrier` class I added recently in tests with the
  `@josephg/resolvable` npm package, which does basically the same thing.
  Use that package in new tests and in the core state machine itself.
- While running tests I found that some test files hung if run separately due to
  lack of cleanup. I ended up refactoring the cache tests to:
  - make who is responsible for calling cache.close more consistent
  - make the Redis client mocks self-contained mocks of the ioredis API instead
    of starting with an actual ioredis implementation and mocking out some
    internals
  - clean up Jest fake timers when a certain test is done
  I'm not super certain exactly which of these changes fixed the hangs but it
  does seem better this way. (Specifically I think the fake timer fix, which I
  did last, is what actually fixed it, but the other changes made it easier for
  me to reason about what was going on.) Can factor out into another PR if
  helpful.

Fixes #4921. Fixes apollographql/federation#335.

TODO:
- [ ] Go through all docs and READMEs that have 'FIXME start' and add calls to
  start. This involves verifying that you can actually do top-level await in
  the contexts that matter. (eg if it turns out that you really can't call await
  before you assign a handler in Lambda, that's interesting and may require some
  other changes to this PR!)
- [ ] Actually document start() in the apollo-server reference
- [ ] Document start() in all the integrations references
- [ ] CHANGELOG
- [ ] consider whether removing the protected willStart function is OK
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 22, 2021
Previously, server startup worked like this:

- `new ApolloServer`
  - If no gateway, calculate schema and schema derived data immediately
  - If gateway, kick off gateway.load from the end of the constructor, and if it
    async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but don't await)
  the protected `willStart` function, which is an async function that first
  waits for the gateway to load the schema if necessary and then runs
  serverWillStart plugin functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that Promise.
  And also, if there's no schema, fail with an error.

Now server startup works like this:
- ApolloServer represents its state explicitly with a new ServerState
- `new ApolloServer`
  - If no gateway, initialize all the schema-derived state directly like
    before (though the state now lives inside ServerState)
  - If gateway, the constructor DOES NOT KICK OFF `gateway.load()`
- You can now call `await server.start()` yourself, which will first await
  `gateway.load` if necessary, and then await all serverWillStart calls.
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations!
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling `graphqlServerOptions`
- `graphqlServerOptions` now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

The overall change to user experience:
- If you're using `apollo-server`, startup errors will cause `listen` to reject;
  no code changes are necessary.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself immediately after the constructor, which will let
  you detect startup errors.
- But if you don't do that, the server will call `start` itself eventually. When
  you try to execute your first GraphQL request, `start` will happen if it
  hasn't already. Also an integration call like `server.applyMiddleware` will
  initiate a background `start`. If startup fails, the startup error will be
  logged on *every* failed graphql request, not just the first time like
  happened before.
- If you have your own ApolloServer subclass that calls the protected
  `willStart` method, it won't work before that method is gone. Consider whether
  you can eliminate that call by just calling `start`, or perhaps call
  `ensureStarting` instead.

This is close enough to backwards-compatible to be appropriate for a v2 minor
release. We are likely to make `start()` required in Apollo Server 3 (other than
for `apollo-server`).

Also:
- Previously we used the deprecated `ApolloServer.schema` field to determine
  whether to install ApolloServerPluginInlineTrace, which we want to have active
  by default for federated schemas only. If you're using a gateway, this field
  isn't actually set at the time that ensurePluginInstantiation reads it.
  That's basically OK because we don't want to turn on the plugin automatically
  in the gateway, but in the interest of avoiding use of the deprecated field, I
  refactored it so that `ApolloServerPluginInlineTrace` is installed by default
  (ie, if you don't install your own version or install
  `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and
  then (if it's installed automatically) it decides whether or not to be active
  by checking the schema at `serverWillStart` time.
- Similarly, schema reporting now throws in its `serverWillStart` if the schema
  is federated, instead of in `ensurePluginInstantiation`. (This does mean that
  if you're not using the new `start()` or `apollo-server`, that failure won't
  make your app fail as fast as if the `ApolloServer` constructor threw.)
- Fix some fastify tests that used a fixed listen port to not do that.
- I am doing my best to never accidentally run `prettier` on whole files and
  instead to very carefully select specific blocks of the file to format them
  several times per minute. Apparently I screwed up once and ran it once on
  `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier
  changes" to "actual changes" in that file is low enough that I'd rather just
  leave the changes in this PR rather than spending time carefully reverting
  them. (It's one of the files I work on the most and being able to keep it
  prettier-clean will be helpful anyway.)
- Replace a hacky workaround for the lack of `start` in the op reg tests!
- Replace a use of a `Barrier` class I added recently in tests with the
  `@josephg/resolvable` npm package, which does basically the same thing.
  Use that package in new tests and in the core state machine itself.
- While running tests I found that some test files hung if run separately due to
  lack of cleanup. I ended up refactoring the cache tests to:
  - make who is responsible for calling cache.close more consistent
  - make the Redis client mocks self-contained mocks of the ioredis API instead
    of starting with an actual ioredis implementation and mocking out some
    internals
  - clean up Jest fake timers when a certain test is done
  I'm not super certain exactly which of these changes fixed the hangs but it
  does seem better this way. (Specifically I think the fake timer fix, which I
  did last, is what actually fixed it, but the other changes made it easier for
  me to reason about what was going on.) Can factor out into another PR if
  helpful.

Fixes #4921. Fixes apollographql/federation#335.

TODO:
- [ ] Go through all docs and READMEs that have 'FIXME start' and add calls to
  start. This involves verifying that you can actually do top-level await in
  the contexts that matter. (eg if it turns out that you really can't call await
  before you assign a handler in Lambda, that's interesting and may require some
  other changes to this PR!)
- [ ] Actually document start() in the apollo-server reference
- [ ] Document start() in all the integrations references
- [ ] CHANGELOG
- [ ] consider whether removing the protected willStart function is OK
glasser added a commit to apollographql/apollo-server that referenced this issue Mar 22, 2021
Previously, server startup worked like this:

- `new ApolloServer`
  - If no gateway, calculate schema and schema derived data immediately
  - If gateway, kick off gateway.load from the end of the constructor, and if it
    async-throws, log an error once and make the server kinda broken forever
- At various spots in the framework integration code, call (but don't await)
  the protected `willStart` function, which is an async function that first
  waits for the gateway to load the schema if necessary and then runs
  serverWillStart plugin functions; save the Promise returned by calling this.
- At request time in the framework integration code, await that Promise.
  And also, if there's no schema, fail with an error.

Now server startup works like this:
- ApolloServer represents its state explicitly with a new ServerState
- `new ApolloServer`
  - If no gateway, initialize all the schema-derived state directly like
    before (though the state now lives inside ServerState)
  - If gateway, the constructor DOES NOT KICK OFF `gateway.load()`
- You can now call `await server.start()` yourself, which will first await
  `gateway.load` if necessary, and then await all serverWillStart calls.
- If you're using `apollo-server` rather than an integration, `server.listen()`
  will just transparently do this for you; explicit `start()` is just for
  integrations!
- Serverless frameworks also call it automatically for you in the background
  (kicked off by the constructor) because their startup has to be
  synchronous; if it fails then future requests will all fail (and log) as before.
- The integration places that used to call willStart now call
  `server.ensureStarting()` instead which will kick off server.start in the
  background if you didn't (and log any errors thrown).
- The places that used to await promiseWillStart no longer do so; generally
  right after that code we end up calling `graphqlServerOptions`
- `graphqlServerOptions` now awaits `server.ensureStarted` which will start the
  server if necessary and throw if it threw.

The overall change to user experience:
- If you're using `apollo-server`, startup errors will cause `listen` to reject;
  no code changes are necessary.
- If you're using a serverless integration, the behavior will be relatively similar,
  except that the startup error will be logged on all requests instead of just
  the first one.
- If you're using an integration you are encouraged to call `await
  server.start()` yourself immediately after the constructor, which will let
  you detect startup errors.
- But if you don't do that, the server will call `start` itself eventually. When
  you try to execute your first GraphQL request, `start` will happen if it
  hasn't already. Also an integration call like `server.applyMiddleware` will
  initiate a background `start`. If startup fails, the startup error will be
  logged on *every* failed graphql request, not just the first time like
  happened before.
- If you have your own ApolloServer subclass that calls the protected
  `willStart` method, it will still work (the method isn't deleted) but you
  should rewrite it to either `await this.start()` or `this.ensureStarting()` instead.

This is close enough to backwards-compatible to be appropriate for a v2 minor
release. We are likely to make `start()` required in Apollo Server 3 for
non-serverless integrations.

Also:
- Previously we used the deprecated `ApolloServer.schema` field to determine
  whether to install ApolloServerPluginInlineTrace, which we want to have active
  by default for federated schemas only. If you're using a gateway, this field
  isn't actually set at the time that ensurePluginInstantiation reads it.
  That's basically OK because we don't want to turn on the plugin automatically
  in the gateway, but in the interest of avoiding use of the deprecated field, I
  refactored it so that `ApolloServerPluginInlineTrace` is installed by default
  (ie, if you don't install your own version or install
  `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and
  then (if it's installed automatically) it decides whether or not to be active
  by checking the schema at `serverWillStart` time.
- Similarly, schema reporting now throws in its `serverWillStart` if the schema
  is federated, instead of in `ensurePluginInstantiation`. (This does mean that
  if you're not using the new `start()` or `apollo-server`, that failure won't
  make your app fail as fast as if the `ApolloServer` constructor threw.)
- Fix some fastify tests that used a fixed listen port to not do that.
- I am doing my best to never accidentally run `prettier` on whole files and
  instead to very carefully select specific blocks of the file to format them
  several times per minute. Apparently I screwed up once and ran it once on
  `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier
  changes" to "actual changes" in that file is low enough that I'd rather just
  leave the changes in this PR rather than spending time carefully reverting
  them. (It's one of the files I work on the most and being able to keep it
  prettier-clean will be helpful anyway.)
- Replace a hacky workaround for the lack of `start` in the op reg tests!
- Replace a use of a `Barrier` class I added recently in tests with the
  `@josephg/resolvable` npm package, which does basically the same thing.
  Use that package in new tests and in the core state machine itself.
- While running tests I found that some test files hung if run separately due to
  lack of cleanup. I ended up refactoring the cache tests to:
  - make who is responsible for calling cache.close more consistent
  - make the Redis client mocks self-contained mocks of the ioredis API instead
    of starting with an actual ioredis implementation and mocking out some
    internals
  - clean up Jest fake timers when a certain test is done
  I'm not super certain exactly which of these changes fixed the hangs but it
  does seem better this way. (Specifically I think the fake timer fix, which I
  did last, is what actually fixed it, but the other changes made it easier for
  me to reason about what was going on.) Can factor out into another PR if
  helpful.

Fixes #4921. Fixes apollographql/federation#335.

Co-authored-by: Stephen Barlow <[email protected]>
@glasser
Copy link
Member Author

glasser commented Mar 22, 2021

I have an alpha out (of Apollo Server, not Gateway) that resolves this issue.

First, install the alpha in your app by installing v2.22.0-alpha.0 of whatever Apollo Server packages your app directly depends on. This might look something like

If you're using the apollo-server package (and calling listen()), that should be enough.

Otherwise, if you're using (say) apollo-server-express, you'll want to insert a call to await server.start() between server = new ApolloServer and server.applyMiddleware. This does mean you'll want to be setting up your server in a context where you can await (an async function, the top level in a .mjs file, etc). This call is technically optional but you don't get the improved error handling if you don't do it!

If you're using a serverless package like apollo-server-lambda, this doesn't fully improve the situation, though I'd still be interested in hearing about how it works for you.

Please provide feedback at #5051.

More details at https://www.apollographql.com/docs/apollo-server/api/apollo-server/#start (though oops, the docs shouldn't be up until the API is in a non-alpha version, so if this doesn't get released pretty soon I might roll that back).

@glasser
Copy link
Member Author

glasser commented Mar 26, 2021

This is released as Apollo Server v2.22.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant