Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record response errors as events in Request Observations #859

Closed
stefaniasm opened this issue Nov 21, 2023 · 10 comments
Closed

Record response errors as events in Request Observations #859

stefaniasm opened this issue Nov 21, 2023 · 10 comments
Assignees
Labels
type: enhancement A general enhancement
Milestone

Comments

@stefaniasm
Copy link

Project details:
Spring Boot project, having:

  • org.springframework.boot' version '3.1.5'
  • spring-boot-starter-graphql:3.0.1
  • spring-boot-starter-actuator
  • spring-boot-starter-aop
  • no custom handlers, filters or error handling

According to this doc: https://docs.spring.io/spring-graphql/reference/observability.html#observability.server.request, by default, in case of data fetching errors, the graphql.error.type should be created and have an appropriate value.
But in case of Spring GraphQL, from what I know, read and debugged, errors (any GraphQl Error, for instance a ValidationError) are resolved and returned in "error" field in response, there is no Throwable exception thrown.

Having all this info, I noticed the following check in DefaultDataFetcherObservationConvention, when errorType is saved:

	protected KeyValue errorType(DataFetcherObservationContext context) {
		if (context.getError() != null) {
			return KeyValue.of(DataFetcherLowCardinalityKeyNames.ERROR_TYPE, context.getError().getClass().getSimpleName());
		} return ERROR_TYPE_NONE;
	}

context.getError() refers to a Throwable error, which is missing in the case described above. So the key is not created in case of a ValidationError, even if I expect it to be there. I am mentioning that for a successful request, I can see the key with default value NONE created, so the configurations should be ok.

Is this the intended behaviour?
If yes..then in what case I will have the DataFetcherLowCardinalityKeyNames.ERROR_TYPE populated with a different value than the default one?
If yes, does that mean that in order to collect informations about GraphQL errors I should do something custom, like a filter or a handler?

Thank you in advance!

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Nov 21, 2023
@bclozel bclozel self-assigned this Nov 21, 2023
@bclozel
Copy link
Member

bclozel commented Nov 22, 2023

Hello @stefaniasm, thanks for reaching out!

First I would like us to be on the same page, so let's use the following example to discuss.

type Mutation {
    createBook(bookInput: BookInput): Book
}

input BookInput {
    name: String
    authorId: ID
}

type Book {
    id: ID
    name: String
    author: Author
    missing: Boolean
}

type Author {
    id: ID
    firstName: String
    lastName: String
}

We'll use this mutation operation for different cases. The implementation looks like:

@Controller
public class BookController {

    @MutationMapping
    public Book createBook(@Argument @Validated BookInput bookInput) {
        // throw new IllegalStateException("this is a custom error");
        return new Book(42L, "Spring for GraphQL book", new Author(12L, "Spring", "team"));
    }
}

public class BookInput {

    @NotNull
    @Size(max=5)
    private String name;

    @NotNull
    private Long authorId;

    //...
}

Exception thrown by the data fetcher itself

If I uncomment the exception throwing in the controller itself and send the following operation:

mutation {
  createBook(bookInput: {name:"test", authorId: 12}) {
    id 
  }
}

I am getting the graphql.error.type=IllegalStateException in the "graphql.datafetcher" observation.

Server validation exception

If I comment again the exception throwing in the controller and this time send a document that will fail the validation because the name is too long:

mutation {
  createBook(bookInput: {name:"booknameistoolong", authorId: 12}) {
    id 
  }
}

I am getting the graphql.error.type=ConstraintViolationException in the "graphql.datafetcher" observation.

Invalid document

This time, let's send an invalid document that will not pass the GraphQL validation:

mutation {
  createBook(invalid: {name:"test", authorId: 12}) {
    id 
  }
}

The response is the following:

{
  "errors": [
    {
      "message": "Validation error (UnknownArgument@[createBook]) : Unknown field argument 'invalid'",
      "locations": [
        {
          "line": 2,
          "column": 14
        }
      ],
      "extensions": {
        "classification": "ValidationError"
      }
    }
  ]
}

In this case, the data fetcher is never invoked and there is no "graphql.datafetcher" observation but there is a "graphql.request" observation that states that graphql.outcome -> REQUEST_ERROR.

  1. does this match what you're seeing?
  2. the collected KeyValues for the "graphql.request" might not be enough in your case. We could collect more (including from the "errors" map), but we need to take into account that there can be multiple errors and that the KeyValues we collect need to be "low cardinality" (a low, finite number of possible values otherwise this will overwhelm metric systems). We can contribute more to high cardinality KeyValues for traces

Thanks!

@bclozel bclozel added the status: waiting-for-feedback We need additional information before we can continue label Nov 22, 2023
@stefaniasm
Copy link
Author

stefaniasm commented Nov 22, 2023

Hello @bclozel !
First, thank you for your time and for the great response! Very helpful!
Yes, the section "Invalid document" is exactly the case I was referring to, thanks for the clarifications.

So best approach to collect info for GraphQl error traces is to do a custom handler/filter in order to collect High cardinality Keys from the "errors" map, right? But in this case, we won't have metrics. Did I understand correctly?

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Nov 22, 2023
@bclozel
Copy link
Member

bclozel commented Nov 22, 2023

Are you collecting this data in metrics and/or traces?
I am wondering if the KeyValues we are providing in metrics and traces for "graphql.request" in case of errors are actually enough.

Custom solution

You can implement your own ExecutionRequestObservationConvention (actually, extend DefaultExecutionRequestObservationConvention) and add the KeyValues you would like to see in metrics (low cardinality) and traces (low and high cardinality) by overriding the relevant methods. To configure it all, you can register your custom convention against the ObservationRegostry (which is a bean in a typical Spring Boot application) as a global convention, or override the GraphQlObservationInstrumentation bean and instantiate it with your custom conventions.

Enhancement request

Here, we could also simply consider this as an enhancement issue and add useful bits to the "graphql.request" observation.
I've had a look at the OTel semantic conventions and they don't show anything here for this case in GraphQL
. The implementation does something different and buffers exception types and messages in a event span. I'm not sure how this is rendered in a trace visualization.

Maybe other instrumentations are doing things differently.

If you can share what kind of information you would like to see in metrics and traces, I could suggest a custom implementation and directly implement the changes in Spring for GraphQL.

@bclozel bclozel added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Nov 22, 2023
@stefaniasm
Copy link
Author

stefaniasm commented Nov 23, 2023

I have to collect them in metrics and traces.
"I am wondering if the KeyValues we are providing in metrics and traces for "graphql.request" in case of errors are actually enough." -> for what we are trying to do is not enough.

"If you can share what kind of information you would like to see in metrics and traces" -> on short, I would like to have the error message from "error" map somewhere in order to be interpreted as an error tag.

We want to be able to understand the big part of the issue directly from traces and also to have some sort of statistics about the number of issues occurred(so we also need metrics), to be aware of the message, of the fields included in the error, the type of the error (everything is in the graphQL error.message from what I noticed)

If I create a custom metric named "errors" where I collect all messages from GraphQL error map, from what I tested, this metric is seen as a 'normal' metric by actuator and probably by data collectors as well. I am not sure it's possible to force it to be seen as an error tag. By error tag I mean the highlighted part (response when calling actuator/metrics/graphql.request):
{"name":"graphql.request","baseUnit":"seconds","measurements":[{"statistic":"COUNT","value":2.0},{"statistic":"TOTAL_TIME","value":0.015097041},{"statistic":"MAX","value":0.014270875}],"availableTags":[{"tag":"graphql.outcome","values":["REQUEST_ERROR"]},{"tag":"graphql.operation","values":["query"]},{"tag":"error","values":["none"]}]}

Thank you again!

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Nov 23, 2023
@stefaniasm
Copy link
Author

stefaniasm commented Nov 29, 2023

Hello @bclozel !

I also noticed that in the scenario described in the 'Invalid document' section, the metric added in @observed(name="metricName") is not identified/created anymore, as in a successful case. It only creates the default 'graphql.request' metric. I am testing this using the actuator/metrics/{requiredMetricName} endpoint. I think that is also because the data.fetcher metrics are not created in that case..
Could you please let me know if this behavior is intentional?
Is there any way of creating custom error metrics in this case?

@bclozel
Copy link
Member

bclozel commented Nov 29, 2023

Hey @stefaniasm

I believe @Observed support is provided by Micrometer directly. It tracks calls to the annotated method. In our case, ValidationError happen during the parse+validation phases of the request, so before the request is actually processed. This means that the data fetcher method is never called, so it is expected that your @Observed is not triggered.

In the meantime, I have discussed this matter with the Observability team and I will provide an update here once I have tested various options.

@stefaniasm
Copy link
Author

Thank you! All clear now.

@bclozel
Copy link
Member

bclozel commented Dec 5, 2023

Sorry for the radio silence. I've been discussing this with the Observability team, which knows much more than I do about the various backends, formats and their limitations. Our goal here is to keep using the Observation API and not tie ourselves to a particular platform. As explained in this comment, the main problem here is that there can be multiple errors for a single GraphQL request.

We can't call observation.error(Throwable) on the observation itself, since there is no exception available; we would need to call it multiple times as well and each call would override the previous value. We could collect those errors in an "errors" KeyValue for the observation, but:

  • we can't guarantee the low cardinality aspect since there can be multiple with many combinations
  • joining multiple errors with a separator, like "validation,syntax,aborted" as high cardinality would not really help either in dashboards

Instead, I've explored adding events on the GraphQL Request observation with observation.event(Event.of(String, String)). Such events will be added as annotations on spans in the tracing world, and will automatically create additional counters in the metrics world.

Here is an example of that with a local experiment.

When sending an invalid GraphQL request:

Screenshot 2023-12-05 at 16-41-03 GraphiQL

This creates annotations on the graphql request span using the "contextual name" of the event:

Screenshot 2023-12-05 at 16-38-29 Zipkin

This also creates counters automatically under the graphql.request.* namespace, like:

$ http http://localhost:8080/actuator/metrics/graphql.request

{"name":"graphql.request","baseUnit":"seconds","measurements":[{"statistic":"COUNT","value":2.0},{"statistic":"TOTAL_TIME","value":0.041338706},{"statistic":"MAX","value":0.0}],"availableTags":[{"tag":"graphql.outcome","values":["REQUEST_ERROR"]},{"tag":"graphql.operation","values":["query"]},{"tag":"error","values":["none"]}]}

$ http http://localhost:8080/actuator/metrics/graphql.request.ValidationError

{"name":"graphql.request.ValidationError","measurements":[{"statistic":"COUNT","value":4.0}],"availableTags":[{"tag":"graphql.outcome","values":["INTERNAL_ERROR"]},{"tag":"graphql.operation","values":["query"]}]}

This adds more information to the observation and improves the situation, but we still need to decide:

  1. should we release this in the 1.2.x line or should we wait for 1.3.x?
  2. the contextual name (name of the span annotation) can be quite long, right now we're taking the graphQLError.getMessage(). Is there a way to improve this?
  3. the metric name is derived from the graphQLError.getErrorType() which is in theory bounded by the ErrorType enum. Is this good enough or are we likely to create numerous counters that will clutter the metrics?
  4. Observation events are recorded at once when the execution result is returned, this means that all events happen at the same time on the timeline and this can be hard to read

@stefaniasm
Copy link
Author

Thanks for keeping me up to date.
The improvements sound promising based on your description.

As a suggestion for point 2, maybe you could use parts of graphQLError.getMessage(), for example, using 'authorById' or 'heyThere' as contextual name might suffice, although this depends on the specific requirements of your project and the big picture.

@bclozel bclozel added type: enhancement A general enhancement and removed status: feedback-provided Feedback has been provided status: waiting-for-triage An issue we've not yet triaged labels Dec 8, 2023
@bclozel bclozel added this to the 1.3.0-M1 milestone Dec 8, 2023
@bclozel
Copy link
Member

bclozel commented Dec 8, 2023

I think we will improve a bit the proposal (especially around the contextual name of span annotations) but go with it. We've discussed it this morning and thought that this is too much of a change to introduce in the current 1.2.x line as a maintenance version, so we're targeting 1.3.0 next May.

@bclozel bclozel changed the title DefaultDataFetcherObservationConvention-missing DataFetcherLowCardinalityKeyNames.ERROR_TYPE for GraphQL errors Record GraphQL response errors as events in GraphQL Request Observations Feb 13, 2024
@bclozel bclozel changed the title Record GraphQL response errors as events in GraphQL Request Observations Record response errors as events in Request Observations Feb 13, 2024
bclozel added a commit to bclozel/spring-graphql that referenced this issue Feb 14, 2024
Prior to this commit, GraphQL Request Observations would not record
errors as Observation errors, because with GraphQL errors can partially
affect the response and there can be multiple. Instead, an invalid
request (for example) would lead to a `"graphql.outcome", "REQUEST_ERROR"`
low cardinality KeyValue. In this case, developers would not know what
type of error occured nor if there were multiple.

This commit records all errors listed in the GraphQL as
Observation.Event on the request Observation. Such events are usually
handled by the tracing handler and are recorded as span annotations for
traces. Other `ObservationHandler` annotations can leverage events in a
different fashion.

Closes spring-projectsgh-859
bclozel added a commit that referenced this issue Feb 14, 2024
Prior to this commit, GraphQL Request Observations would not record
errors as Observation errors, because with GraphQL errors can partially
affect the response and there can be multiple. Instead, an invalid
request (for example) would lead to a `"graphql.outcome", "REQUEST_ERROR"`
low cardinality KeyValue. In this case, developers would not know what
type of error occured nor if there were multiple.

This commit records all errors listed in the GraphQL as
Observation.Event on the request Observation. Such events are usually
handled by the tracing handler and are recorded as span annotations for
traces. Other `ObservationHandler` annotations can leverage events in a
different fashion.

Closes gh-859
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

3 participants