Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apollo engine reporting blocks data and error propagation in '@apollo/gateway' managed federation deployment #3052

Closed
heysailor opened this issue Jul 17, 2019 · 10 comments · Fixed by #3056

Comments

@heysailor
Copy link

heysailor commented Jul 17, 2019

Using versions:
@apollo/gateway 0.7.1
apollo-server-express 2.7.0
node 10.15.3

Setup:
Apollo managed federation gateway server. Local development on Mac, deployment to zeit/now v2.

Expected:
Apollo server produces a synthesised federated schema, passing/reformatting errors as produced by child services.

Actual:
If a graphql error is produced by a child service, the error message below is produced.

"errors": [
    {
      "message": "500: Internal Server Error",
      "extensions": {
        "code": "INTERNAL_SERVER_ERROR",
        "response": {
          "url": "https://engage.xxx.xxx/graphql",
          "status": 500,
          "statusText": "Internal Server Error",
          "body": {
            "errors": [
              {
                "message": "addProtobufError called after stopTiming!",
                "extensions": {
                  "code": "INTERNAL_SERVER_ERROR"
                }
              }
            ]
          }
        }
      }
    }
  ],
  "data": {
    "user": {
      "name": "Nick McIntosh",
      "stars": null
    }
  },
...
]

The query generating this error spans two services

  • users.xxx.xxx which is providing the name field fine
  • engage.xxx.xxx which is responsible for the stars parameter, which produces a graphql error when not authenticated.

Searching the Apollo Server codebase, the error message originates from the addProtobufError method of EngineReportingTreeBuilder from apollo-engine-reporting/src/treeBuilder.ts. addProtobufError appears to be called when an error occurs in a child service; but I cannot see why it is after stopTiming has been called - perhaps someone familiar with the package may be able to see the issue.

@jacob-ebey
Copy link

Here is a bare bones example. I'm excited to push this further after the Apollo Day in Seattle (I'm the one who asked about errors lolol).

https://github.com/jacob-ebey/apollo-error-3052

@heysailor
Copy link
Author

Hey awesome @jacob-ebey - apologies for no reproduction, I was using an ancient errors handling setup and was making sure it wasn't just due to that. Throwing standard ApolloError instances still causes this error on the gateway.

In my example above, the error url field suggests engage.xxx.xxx is generating the problem. However, engage is playing nice and passing a well formatted Apollo error.

@jacob-ebey
Copy link

It's a strange repro case because if you use the gateway and specify the services manually the errors are relayed properly. It's only when the gateway is retrieving it's list of services from the managed apollo platform.

@heysailor
Copy link
Author

heysailor commented Jul 18, 2019

Yes, I think the error passing is only interuptted by the Apollo engine reporting mechanism. When you don't provide a key for the managed federation, no reporting occurs. The Apollo engine reporting is stumbling when it reports errors to the engine API, and its own error is preventing propagation of both the resolved data and the error to the user.

I've encountered another error that oddly prevents more than one successful resolved query - after which it throws with format called before end of execution? and blocks the resolver. This error orginates in line 77 of 'apollo-engine-reporting/src/federatedExtension.ts'.

I think a production Apollo engine reporting solution will need to gracefully handle errors within itself without blocking graphql resolvers.

@heysailor heysailor changed the title 'addProtobufError called after stopTiming' error in @apollo/gateway managed federation deployment Apollo engine reporting blocks data and error propagation in '@apollo/gateway' managed federation deployment Jul 18, 2019
@zionts
Copy link
Contributor

zionts commented Jul 18, 2019

@heysailor I'm taking a look into this now, and following the reproduction case supplied by @jacob-ebey , I actually found that the same error of addProtobufError called after stopTiming! was coming even when not in managed federation mode. I'm going to investigate this issue further and figure out what's causing this illegal state exception.

I'd also love to investigate the issue that you reported in a comment @heysailor, where the resolver is blocked(!!!) by the federatedExtension. That is certainly not the intended behavior, and a reproduction would help further investigate there, too!

@zionts
Copy link
Contributor

zionts commented Jul 18, 2019

I opened up a PR that I believe fixes the problem. I'll check on getting an alpha build out of apollo-engine-reporting so that you can verify that it fixes the repro! :)

@heysailor
Copy link
Author

Cheers @zionts - I see you've solved the second error. Looking forward to the PR merging, federation works a treat otherwise.

glasser added a commit that referenced this issue Jul 26, 2019
The extension stack executionDidEnd method gets called before didEncounterErrors
for GraphQL errors returned from execution (although confusingly the plugin
executionDidEnd method gets called after), which caused the assertion that
nothing gets added to the EngineReportingTreeBuilder after stopTiming to
fail. Fix by moving stopTiming to the last possible moment: format().

Actually test error reporting, including both kinds of rewriting.

Add a comment noting that backend parse and validation errors don't get
reported.

Fixes #3052.
glasser added a commit that referenced this issue Jul 26, 2019
The extension stack executionDidEnd method gets called before didEncounterErrors
for GraphQL errors returned from execution (although confusingly the plugin
executionDidEnd method gets called after), which caused the assertion that
nothing gets added to the EngineReportingTreeBuilder after stopTiming to
fail. Fix by moving stopTiming to the last possible moment: format().

Actually test error reporting, including both kinds of rewriting.

Add a comment noting that backend parse and validation errors don't get
reported.

Fixes #3052.
@glasser
Copy link
Member

glasser commented Jul 26, 2019

I've rewritten @zionts' PR (with tests this time) to fix the blatant brokenness of error reporting from federated backends.

I don't really understand what you've written here so I'm not sure if this fix addresses it as well:

I've encountered another error that oddly prevents more than one successful resolved query - after which it throws with format called before end of execution? and blocks the resolver. This error orginates in line 77 of 'apollo-engine-reporting/src/federatedExtension.ts'.

glasser added a commit that referenced this issue Jul 26, 2019
The extension stack executionDidEnd method gets called before didEncounterErrors
for GraphQL errors returned from execution (although confusingly the plugin
executionDidEnd method gets called after), which caused the assertion that
nothing gets added to the EngineReportingTreeBuilder after stopTiming to
fail. Fix by moving stopTiming to the last possible moment: format().

Actually test error reporting, including both kinds of rewriting.

Add a comment noting that backend parse and validation errors don't get
reported.

Fixes #3052.
glasser added a commit that referenced this issue Jul 26, 2019
The extension stack executionDidEnd method gets called before didEncounterErrors
for GraphQL errors returned from execution (although confusingly the plugin
executionDidEnd method gets called after), which caused the assertion that
nothing gets added to the EngineReportingTreeBuilder after stopTiming to
fail. Fix by moving stopTiming to the last possible moment: format().

Actually test error reporting, including both kinds of rewriting.

Add a comment noting that backend parse and validation errors don't get
reported.

Fixes #3052.
glasser pushed a commit that referenced this issue Jul 26, 2019
The extension stack executionDidEnd method gets called before didEncounterErrors
for GraphQL errors returned from execution (although confusingly the plugin
executionDidEnd method gets called after), which caused the assertion that
nothing gets added to the EngineReportingTreeBuilder after stopTiming to
fail. Fix by moving stopTiming to the last possible moment: format().

Actually test error reporting, including both kinds of rewriting.

Add a comment noting that backend parse and validation errors don't get
reported.

Fixes #3052.
@glasser
Copy link
Member

glasser commented Jul 26, 2019

This should be fixed in [email protected], which brings in [email protected]. Please let me know if this doesn't address all of the issues you ran into!

@heysailor
Copy link
Author

Thanks @glasser - should fix it all up - the second error I referred to looked like it was coming from a code issue in apollo-engine-reporting, as was the original error I referred to. Will report back, if further problems, appreciate all the work 👍🏻👍🏻

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants