Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Federation: Undefined behaviour and INTERNAL_SERVER_ERROR when returning external types that do not exist #376

Closed
nihalgonsalves opened this issue Mar 4, 2020 · 10 comments · Fixed by #1305 or #1306

Comments

@nihalgonsalves
Copy link

If I return a Federated representation (i.e. the fields defined in @key) of another service, the behaviour is undefined when the other service returns null or throws an error in __resolveReference.


Reproduction Repo: https://github.com/nihalgonsalves/apollo-server-issue3859 - this includes the schema I will use as an example here. You can try out queries with different selection sets and using the IDs should-exist, should-not-exist and should-throw, where service B returns something, null, or throws an error respectively in its __resolveReference resolver.


  • Service A defines this schema:

    type Query {
      serviceAQueryReturningBType(ids: [ID!]!): [ServiceBType!]!
      serviceAQueryReturningNullableBType(ids: [ID!]!): [ServiceBType]!
    }
    
    extend type ServiceBType @key(fields: "id") {
      id: ID! @external
      someFieldFromB: String @external
      extendedByServiceA: String! @requires(fields: "someFieldFromB")
    }

    Note that it @requires a field from Service B to extend a type, but also returns this type from its own query.

  • Service B defines this type:

    type ServiceBType @key(fields: "id") {
      id: ID!
      someFieldFromB: String
    }
  • Now, when Service B doesn't need to be called or only has to be called once, this fails gracefully:

    serviceAQueryReturningBType(ids: ["should-not-exist"]) {
      id
      someFieldFromB # returns as null, after __resolveReference returns null
    }
  • However, if another service (in this case, when querying extendedByServiceA) requires a field from Service B, the null is sent to Service A, causing an internal server error: Field "someFieldFromB" was not found in response:

    serviceAQueryReturningBType(ids: ["should-not-exist"]) {
      id
      someFieldFromB # still returns null
      extendedByServiceA # requires someFieldFromB
    }
    {
      "errors": [
          {
          "message": "Field \"someFieldFromB\" was not found in response.",
          "extensions": {
              "code": "INTERNAL_SERVER_ERROR",
              "exception": {
              "stacktrace": [
                  "Error: Field \"someFieldFromB\" was not found in response.",
                  "    at executeSelectionSet (/<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:227:27)",
                  "    at executeSelectionSet (/<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:246:51)",
                  "    at /<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:133:40",
                  "    at Array.forEach (<anonymous>)",
                  "    at /<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:132:22",
                  "    at Generator.next (<anonymous>)",
                  "    at /<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:8:71",
                  "    at new Promise (<anonymous>)",
                  "    at __awaiter (/<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:4:12)",
                  "    at executeFetch (/<redacted>/node_modules/@apollo/gateway/dist/executeQueryPlan.js:103:12)"
              ]
              }
          }
          }
      ],
      "data": null
    }

Expected Behaviour:

  • A query with a nullable return type should fail gracefully, with an error but returning data. Even if you use a nullable return type (e.g. try the queries with serviceAQueryReturningNullableBType in the bug reproduction repo), you still get an internal server error
  • When the return type is nullable, this error shouldn't fail the entire query: if you try the nullable query with ids: ["should-exist", "should-not-exist"], both objects come back as null due to the internal server error, even though one of the objects can be resolved on its own.
  • When non-null, a more descriptive error should be returned, and the gateway should not pass on undefined to the next service (the error is thrown from here: https://github.com/apollographql/apollo-server/blob/master/packages/apollo-gateway/src/executeQueryPlan.ts#L399-L406). In general it should not throw an internal server error as even non-affected queries in the request return no data when this happens.

I believe that there should also be defined behaviour/documentation about exactly what happens when __resolveReference throws an error or returns null.

Versions:

  • @apollo/federation and @apollo/gateway version 0.13.2
  • apollo-server version 2.10.
@trevor-scheer
Copy link
Member

Hey @nihalgonsalves, thanks for reporting this and apologies on the delay. I agree the expected behavior should certainly be documented and well-defined by test cases.

I'm having a bit of an internal struggle deciding what I think the expected behavior should be and I'd love to hear your thoughts. Here's the direction my mind has gone while thinking about the implications introduced in this issue and apollographql/apollo-server#3914:

Should the gateway be tolerant to data inconsistencies between services? In its current form, I think we can say there's a strict contract via the @key relationship. I might argue that relaxing that contract and allowing the gateway to carry on "as best as it can" opens the door for unpredictable behavior. You may get some of the data, but it's an unexpectedly partial result. In this case, even though a field may be nullable, it's being nulled out due to a data inconsistency / breach of contract, not because the field was explicitly null.

Unfortunately I feel like there's no "just graphql" equivalent to analogize with since the joins via __resolveReference are the gateway's magic.

@nihalgonsalves
Copy link
Author

nihalgonsalves commented May 6, 2020

@trevor-scheer Yeah, I agree that there isn't an easy solution to this.

There are common two examples (not necessarily exhaustive) where this is problematic and could cause situations that are hard to work-around:

  1. Deletions of an entity on the source service which for whatever reason do not get propagated to another service. This should generally be avoided while designing systems, but it can happen.

  2. Permissions. This was the case we ran into: the service returned a @key which the source service refused to fulfil for the current user context.

Both of these can be worked around with some effort, and perhaps improving the documentation around the contract and making this explicit would prevent users from getting to such a situation, but they can cause nasty production bugs that cannot be immediately resolved.

Regarding the general question about the gateway being tolerant to data inconsistencies: when another error occurs, it currently is tolerant. Normally, if a nullable field comes from another service and that service fails to resolve (for whatever reason - permissions, HTTP failure, service down, DB down, etc), the gateway returns partial data and also the error (even though it's null not because it was explicitly set to null. Within the limitation of the current GraphQL primitives, I think the null + error is acceptable). It's only this special case that kills the entire query.

In general, especially when dealing with eventually consistent systems, I believe that the gateway should be as fault-tolerant as possible while also reporting errors. It otherwise requires coordination between systems. I feel it should be up to the application to decide - through a combination of nullability in schema design and the error policy on the client (i.e. a view that is critical can choose to discard any partial data in Apollo Client).

P.S.: The current situation is that the entire query fails with an internal server error. I think that the minimum we could agree on, regardless of whether it's decided that it is a fault-tolerant error or query-failing error, is that the error should be well-defined and descriptive.

@mduesterhoeft
Copy link
Contributor

@trevor-scheer I agree to what Nihal said about the gateway being tolerant in other cases. So I think the solution I proposed here apollographql/apollo-server#3914 aligns nicely with what the gateway is doing in other cases

I might argue that relaxing that contract and allowing the gateway to carry on "as best as it can" opens the door for unpredictable behavior. You may get some of the data, but it's an unexpectedly partial result. In this case, even though a field may be nullable

I think at the moment what you get is an unexpectedly partial result. The gateway stops resolving data when it cannot resolve an expected entity and returns the data it has retrieved up to this point.

See the example I described here. apollographql/apollo-server#3914 (comment)

Notice how the name is not resolved also for 0987654321 which would be possible to resolve. Also If I query these two books in a different order, the result becomes different. The gateway would resolve everything until it processes the book unknown to the book service. Every following product would fail to resolve further.

@trevor-scheer
Copy link
Member

trevor-scheer commented May 13, 2020

Thank you both for the input! I did some more thinking after I posted my response and came to the conclusion that my initial argument about data consistency across services is too strict and unrealistic.

The scenario I played through my head was as simple as this:
Suppose user and review services, and a user can have reviews like so:

{
  user {
    reviews {
      content
    }
  }
}

It's silly to expect the review service needs to know about a user's creation (and maintain some kind of empty state for said user).

Furthermore, I agree with what you're saying @mduesterhoeft about what's currently happening, and that this is an edge case for @requires.

I'll take another look at the associated PR!

@abernix abernix transferred this issue from apollographql/apollo-server Jan 15, 2021
@Kartikkumargujarati
Copy link

@nihalgonsalves This is the exact same issue we are facing right now. Were you able to find a stop-gap solution for this?

@mduesterhoeft Thanks for the PR that addresses this issue. I can see that the PR was closed and is put "below the line". I am gonna be selfish here and ask if we please re-open that PR again @trevor-scheer @abernix 😄?

@epitaphmike
Copy link

epitaphmike commented Aug 2, 2021

I am also running into this issue. It seems strange that an entity from an external service would still try to resolve its fields if the returned is value is null. I am not sure if I am completely missing something @trevor-scheer @abernix

For example: Using the example in the docs for resolving in federation

Product Service

type Product @key(fields: "upc") @key(fields: "sku") {
  upc: String!
  sku: String!
  price: String!
}

Review Service

type Review {
  product: Product
}

# This is a "stub" of the Product entity (see below)
extend type Product @key(fields: "upc") {
  upc: String! @external
}
{
  Review: {
    product(review) {
      return { __typename: "Product", upc: review.upc };
    }
  }
}
{
  Product: {
    __resolveReference(reference) {
      return fetchProductByUPC(reference.upc);
    }
  }
}
{
  reviews {
    product {
      name
      price
    }
  }
}

In the above query, if the product is null it will throw an error saying Cannot return null for non-nullable field Product.price.

Why would the resolving of the fields continue if the base entity Product was null.

If you short circuit on the __resolveReference with null, or even if it naturally returns null if no record exists it still tries to resolve the name and price fields for the entity.

{
  Product: {
    __resolveReference(reference) {
      return null;
    }
  }
}

@pcmanus
Copy link
Contributor

pcmanus commented Dec 14, 2021

I agree we should fix this and want to try getting a fix committed. Now, while I agree apollographql/apollo-server#3914 is targeting the right part of the code, I do think it should be slightly modified. Anyway, as said PR couldn't be re-open per-se (since it is against the wrong repository), I took the liberty to create a new PR, #1305 (there is one for main too at #1306), which reuse the commit from that earlier PR but amend it with the changes I had in mind, the details on which I explain in the PR description. @mduesterhoeft: I'd particularly love to have your opinion since I essentially amended your commit.

pcmanus pushed a commit to pcmanus/federation that referenced this issue Jan 3, 2022
Some times, when a @requires is just after a key, we can simply collect
the required field before taking the key. Other times, we encounter
a @require in a type T of a subgraph A, and have to jump to subgraph B
to get the required field. In that case, once we've collected the
required fields, we need to "resume" query on T in A, and that means
using a key there. The code dealing with that post-require "return key"
later case was incorrect, essentially using a key on subgraph B instead
of one on subgraph A. As a consequence, subgraphs that shouldn't have
been allowed to compose (because they were missing the needed key on A)
were allowed to composed, and the code later failed during query
planning.

This commit fixes this issue, and adds tests for that case. One of the
test introduced _fails_ as of this commit, but that is due to the
problem describe on apollographql#376 and the fix will be in a followup commit.
pcmanus pushed a commit to pcmanus/federation that referenced this issue Jan 3, 2022
pcmanus pushed a commit to pcmanus/federation that referenced this issue Jan 3, 2022
pcmanus pushed a commit to pcmanus/federation that referenced this issue Jan 3, 2022
@johnciprian
Copy link

@epitaphmike I'm having the exact same problem as you had. Did you ever find a solution to this?

@benweatherman
Copy link
Contributor

howdy @johnciprian! Are you able to use @apollo/[email protected] or @apollo/[email protected]? There were changes to both major versions to address this issue (#1305 and #1306). If there are still errors with those versions, would you mind opening up a new issue with as much info as you can about how to reproduce it?

@johnciprian
Copy link

@benweatherman I just tried it with v2 and it works as expected. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment