Federation: Undefined behaviour and INTERNAL_SERVER_ERROR when returning external types that do not exist #376

nihalgonsalves · 2020-03-04T12:08:34Z

If I return a Federated representation (i.e. the fields defined in @key) of another service, the behaviour is undefined when the other service returns null or throws an error in __resolveReference.

Reproduction Repo: https://github.com/nihalgonsalves/apollo-server-issue3859 - this includes the schema I will use as an example here. You can try out queries with different selection sets and using the IDs should-exist, should-not-exist and should-throw, where service B returns something, null, or throws an error respectively in its __resolveReference resolver.

Service A defines this schema:

type Query {
  serviceAQueryReturningBType(ids: [ID!]!): [ServiceBType!]!
  serviceAQueryReturningNullableBType(ids: [ID!]!): [ServiceBType]!
}

extend type ServiceBType @key(fields: "id") {
  id: ID! @external
  someFieldFromB: String @external
  extendedByServiceA: String! @requires(fields: "someFieldFromB")
}

Note that it @requires a field from Service B to extend a type, but also returns this type from its own query.

Service B defines this type:

type ServiceBType @key(fields: "id") {
  id: ID!
  someFieldFromB: String
}

Now, when Service B doesn't need to be called or only has to be called once, this fails gracefully:

serviceAQueryReturningBType(ids: ["should-not-exist"]) {
  id
  someFieldFromB # returns as null, after __resolveReference returns null
}

However, if another service (in this case, when querying extendedByServiceA) requires a field from Service B, the null is sent to Service A, causing an internal server error: Field "someFieldFromB" was not found in response:

serviceAQueryReturningBType(ids: ["should-not-exist"]) {
  id
  someFieldFromB # still returns null
  extendedByServiceA # requires someFieldFromB
}

{
  "errors": [
      {
      "message": "Field \"someFieldFromB\" was not found in response.",
      "extensions": {
          "code": "INTERNAL_SERVER_ERROR",
          "exception": {
          "stacktrace": [
              "Error: Field \"someFieldFromB\" was not found in response.",
              "    at executeSelectionSet (/<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:227:27)",
              "    at executeSelectionSet (/<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:246:51)",
              "    at /<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:133:40",
              "    at Array.forEach (<anonymous>)",
              "    at /<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:132:22",
              "    at Generator.next (<anonymous>)",
              "    at /<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:8:71",
              "    at new Promise (<anonymous>)",
              "    at __awaiter (/<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:4:12)",
              "    at executeFetch (/<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:103:12)"
          ]
          }
      }
      }
  ],
  "data": null
}

Expected Behaviour:

A query with a nullable return type should fail gracefully, with an error but returning data. Even if you use a nullable return type (e.g. try the queries with serviceAQueryReturningNullableBType in the bug reproduction repo), you still get an internal server error
When the return type is nullable, this error shouldn't fail the entire query: if you try the nullable query with ids: ["should-exist", "should-not-exist"], both objects come back as null due to the internal server error, even though one of the objects can be resolved on its own.
When non-null, a more descriptive error should be returned, and the gateway should not pass on undefined to the next service (the error is thrown from here: https://github.com/apollographql/apollo-server/blob/master/packages/apollo-gateway/src/executeQueryPlan.ts#L399-L406). In general it should not throw an internal server error as even non-affected queries in the request return no data when this happens.

I believe that there should also be defined behaviour/documentation about exactly what happens when __resolveReference throws an error or returns null.

Versions:

@apollo/federation and @apollo/gateway version 0.13.2
apollo-server version 2.10.

The text was updated successfully, but these errors were encountered:

trevor-scheer · 2020-05-05T01:46:22Z

Hey @nihalgonsalves, thanks for reporting this and apologies on the delay. I agree the expected behavior should certainly be documented and well-defined by test cases.

I'm having a bit of an internal struggle deciding what I think the expected behavior should be and I'd love to hear your thoughts. Here's the direction my mind has gone while thinking about the implications introduced in this issue and apollographql/apollo-server#3914:

Should the gateway be tolerant to data inconsistencies between services? In its current form, I think we can say there's a strict contract via the @key relationship. I might argue that relaxing that contract and allowing the gateway to carry on "as best as it can" opens the door for unpredictable behavior. You may get some of the data, but it's an unexpectedly partial result. In this case, even though a field may be nullable, it's being nulled out due to a data inconsistency / breach of contract, not because the field was explicitly null.

Unfortunately I feel like there's no "just graphql" equivalent to analogize with since the joins via __resolveReference are the gateway's magic.

nihalgonsalves · 2020-05-06T08:51:37Z

@trevor-scheer Yeah, I agree that there isn't an easy solution to this.

There are common two examples (not necessarily exhaustive) where this is problematic and could cause situations that are hard to work-around:

Deletions of an entity on the source service which for whatever reason do not get propagated to another service. This should generally be avoided while designing systems, but it can happen.
Permissions. This was the case we ran into: the service returned a @key which the source service refused to fulfil for the current user context.

Both of these can be worked around with some effort, and perhaps improving the documentation around the contract and making this explicit would prevent users from getting to such a situation, but they can cause nasty production bugs that cannot be immediately resolved.

Regarding the general question about the gateway being tolerant to data inconsistencies: when another error occurs, it currently is tolerant. Normally, if a nullable field comes from another service and that service fails to resolve (for whatever reason - permissions, HTTP failure, service down, DB down, etc), the gateway returns partial data and also the error (even though it's null not because it was explicitly set to null. Within the limitation of the current GraphQL primitives, I think the null + error is acceptable). It's only this special case that kills the entire query.

In general, especially when dealing with eventually consistent systems, I believe that the gateway should be as fault-tolerant as possible while also reporting errors. It otherwise requires coordination between systems. I feel it should be up to the application to decide - through a combination of nullability in schema design and the error policy on the client (i.e. a view that is critical can choose to discard any partial data in Apollo Client).

P.S.: The current situation is that the entire query fails with an internal server error. I think that the minimum we could agree on, regardless of whether it's decided that it is a fault-tolerant error or query-failing error, is that the error should be well-defined and descriptive.

mduesterhoeft · 2020-05-06T12:27:39Z

@trevor-scheer I agree to what Nihal said about the gateway being tolerant in other cases. So I think the solution I proposed here apollographql/apollo-server#3914 aligns nicely with what the gateway is doing in other cases

I might argue that relaxing that contract and allowing the gateway to carry on "as best as it can" opens the door for unpredictable behavior. You may get some of the data, but it's an unexpectedly partial result. In this case, even though a field may be nullable

I think at the moment what you get is an unexpectedly partial result. The gateway stops resolving data when it cannot resolve an expected entity and returns the data it has retrieved up to this point.

See the example I described here. apollographql/apollo-server#3914 (comment)

Notice how the name is not resolved also for 0987654321 which would be possible to resolve. Also If I query these two books in a different order, the result becomes different. The gateway would resolve everything until it processes the book unknown to the book service. Every following product would fail to resolve further.

trevor-scheer · 2020-05-13T20:20:06Z

Thank you both for the input! I did some more thinking after I posted my response and came to the conclusion that my initial argument about data consistency across services is too strict and unrealistic.

The scenario I played through my head was as simple as this:
Suppose user and review services, and a user can have reviews like so:

{
  user {
    reviews {
      content
    }
  }
}

It's silly to expect the review service needs to know about a user's creation (and maintain some kind of empty state for said user).

Furthermore, I agree with what you're saying @mduesterhoeft about what's currently happening, and that this is an edge case for @requires.

I'll take another look at the associated PR!

Kartikkumargujarati · 2021-07-15T23:10:24Z

@nihalgonsalves This is the exact same issue we are facing right now. Were you able to find a stop-gap solution for this?

@mduesterhoeft Thanks for the PR that addresses this issue. I can see that the PR was closed and is put "below the line". I am gonna be selfish here and ask if we please re-open that PR again @trevor-scheer @abernix 😄?

epitaphmike · 2021-08-02T13:38:15Z

I am also running into this issue. It seems strange that an entity from an external service would still try to resolve its fields if the returned is value is null. I am not sure if I am completely missing something @trevor-scheer @abernix

For example: Using the example in the docs for resolving in federation

Product Service

type Product @key(fields: "upc") @key(fields: "sku") {
  upc: String!
  sku: String!
  price: String!
}

Review Service

type Review {
  product: Product
}

# This is a "stub" of the Product entity (see below)
extend type Product @key(fields: "upc") {
  upc: String! @external
}

{
  Review: {
    product(review) {
      return { __typename: "Product", upc: review.upc };
    }
  }
}

{
  Product: {
    __resolveReference(reference) {
      return fetchProductByUPC(reference.upc);
    }
  }
}

{
  reviews {
    product {
      name
      price
    }
  }
}

In the above query, if the product is null it will throw an error saying Cannot return null for non-nullable field Product.price.

Why would the resolving of the fields continue if the base entity Product was null.

If you short circuit on the __resolveReference with null, or even if it naturally returns null if no record exists it still tries to resolve the name and price fields for the entity.

{
  Product: {
    __resolveReference(reference) {
      return null;
    }
  }
}

pcmanus · 2021-12-14T10:28:05Z

I agree we should fix this and want to try getting a fix committed. Now, while I agree apollographql/apollo-server#3914 is targeting the right part of the code, I do think it should be slightly modified. Anyway, as said PR couldn't be re-open per-se (since it is against the wrong repository), I took the liberty to create a new PR, #1305 (there is one for main too at #1306), which reuse the commit from that earlier PR but amend it with the changes I had in mind, the details on which I explain in the PR description. @mduesterhoeft: I'd particularly love to have your opinion since I essentially amended your commit.

@requires

Some times, when a @requires is just after a key, we can simply collect the required field before taking the key. Other times, we encounter a @require in a type T of a subgraph A, and have to jump to subgraph B to get the required field. In that case, once we've collected the required fields, we need to "resume" query on T in A, and that means using a key there. The code dealing with that post-require "return key" later case was incorrect, essentially using a key on subgraph B instead of one on subgraph A. As a consequence, subgraphs that shouldn't have been allowed to compose (because they were missing the needed key on A) were allowed to composed, and the code later failed during query planning. This commit fixes this issue, and adds tests for that case. One of the test introduced _fails_ as of this commit, but that is due to the problem describe on apollographql#376 and the fix will be in a followup commit.

Fixes apollographql#376 Co-authored-by: Sylvain Lebresne <[email protected]>

johnciprian · 2022-08-03T20:05:29Z

@epitaphmike I'm having the exact same problem as you had. Did you ever find a solution to this?

benweatherman · 2022-08-03T20:43:26Z

howdy @johnciprian! Are you able to use @apollo/[email protected] or @apollo/[email protected]? There were changes to both major versions to address this issue (#1305 and #1306). If there are still errors with those versions, would you mind opening up a new issue with as much info as you can about how to reproduce it?

johnciprian · 2022-08-05T19:29:55Z

@benweatherman I just tried it with v2 and it works as expected. Thank you!

abernix transferred this issue from apollographql/apollo-server Jan 15, 2021

pcmanus added the 🍐 error-handling label Dec 14, 2021

This was linked to pull requests Dec 14, 2021

Ensure error resolving an entity don't impact resolvable ones (for version-0.x branch) #1305

Merged

Ensure error resolving an entity don't impact resolvable ones (for main branch) #1306

Merged

pcmanus mentioned this issue Dec 14, 2021

Ensure error resolving an entity don't impact resolvable ones (for main branch) #1306

Merged

pcmanus pushed a commit to pcmanus/federation that referenced this issue Jan 3, 2022

Handle unresolved elements gracefully

fa3b1c2

Fixes apollographql#376 Co-authored-by: Sylvain Lebresne <[email protected]>

pcmanus pushed a commit to pcmanus/federation that referenced this issue Jan 3, 2022

Add tests for issue apollographql#376 and related behaviour

13a8091

pcmanus pushed a commit to pcmanus/federation that referenced this issue Jan 3, 2022

Handle unresolved elements gracefully

78a20b2

Fixes apollographql#376 Co-authored-by: Sylvain Lebresne <[email protected]>

pcmanus closed this as completed in #1306 Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Federation: Undefined behaviour and INTERNAL_SERVER_ERROR when returning external types that do not exist #376

Federation: Undefined behaviour and INTERNAL_SERVER_ERROR when returning external types that do not exist #376

nihalgonsalves commented Mar 4, 2020

trevor-scheer commented May 5, 2020

nihalgonsalves commented May 6, 2020 •

edited

Loading

mduesterhoeft commented May 6, 2020

trevor-scheer commented May 13, 2020 •

edited

Loading

Kartikkumargujarati commented Jul 15, 2021

epitaphmike commented Aug 2, 2021 •

edited

Loading

pcmanus commented Dec 14, 2021

johnciprian commented Aug 3, 2022

benweatherman commented Aug 3, 2022

johnciprian commented Aug 5, 2022

Federation: Undefined behaviour and INTERNAL_SERVER_ERROR when returning external types that do not exist #376

Federation: Undefined behaviour and INTERNAL_SERVER_ERROR when returning external types that do not exist #376

Comments

nihalgonsalves commented Mar 4, 2020

trevor-scheer commented May 5, 2020

nihalgonsalves commented May 6, 2020 • edited Loading

mduesterhoeft commented May 6, 2020

trevor-scheer commented May 13, 2020 • edited Loading

Kartikkumargujarati commented Jul 15, 2021

epitaphmike commented Aug 2, 2021 • edited Loading

pcmanus commented Dec 14, 2021

johnciprian commented Aug 3, 2022

benweatherman commented Aug 3, 2022

johnciprian commented Aug 5, 2022

nihalgonsalves commented May 6, 2020 •

edited

Loading

trevor-scheer commented May 13, 2020 •

edited

Loading

epitaphmike commented Aug 2, 2021 •

edited

Loading