Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add federated tracing instrumentation #25

Merged
merged 17 commits into from
Sep 27, 2019

Conversation

lennyburdette
Copy link
Contributor

@lennyburdette lennyburdette commented Aug 29, 2019

Federated GraphQL services should include timing and error
information as a Base64-encoded protocol buffer message in
the "extensions.ftv1" field.

This change includes a tracer that uses the graphql-java
instrumentation API to record field timings and info and attach it
on the execution context.

I used the Apollo Server typescript code as reference:

https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/federatedExtension.ts
https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/treeBuilder.ts

As well as my ruby implementation:

https://github.com/lennyburdette/apollo-federation-ruby/blob/federated-tracing/lib/apollo-federation/tracing/tracer.rb

Federated tracing documentation: https://www.apollographql.com/docs/apollo-server/federation/metrics/

Fixes #24

@lennyburdette
Copy link
Contributor Author

lennyburdette commented Aug 30, 2019

After a really fun day of learning maven and fighting with a broken CI system, I have evidence that this change works with Apollo tracing!

Screen Shot 2019-08-30 at 1 51 39 PM

}

/**
* Wrapper for Node information with mutable lists for children and errors.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include this? I believe that the builder has accessors for adding children and errors.

@@ -0,0 +1,488 @@
syntax = "proto3";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to include a comment along the lines of

This file is copied over from the `apollo-engine-reporting` package, which, in turn, derives its information from Apollo's cloud infrastructure. It is expected that as the protobuf definition changes within the underlying system we are reporting to, these libraries will need to be updated in sync.

FederatedTracingState state = parameters.getInstrumentationState();

Map<Object, Object> extensions = executionResult.getExtensions();
Map<Object, Object> tracingMap = new LinkedHashMap<>(extensions == null ? Collections.emptyMap() : extensions);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a confusing name for something that is actually a copy of the extensions that are already present within the execution result and has nothing to do with tracing until we add in our own extension below. Maybe something like extensionsCopy or currentExtensionsMap instead?


Reports.Trace trace = state.toProto();

byte[] bytes = trace.toByteArray();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nbd: can we inline this?

byte[] bytes = trace.toByteArray();
tracingMap.put(KEY, Base64.getEncoder().encodeToString(bytes));

if (debuggingEnabled) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this feature, though I think it might be more useful as a log than as a separate extensions key 🤷‍♂ curious on your thoughts!

Node node = state.addNode(executionStepInfo, startNanos, endNanos);

if (throwable != null) {
node.addError(throwable);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feat: It might be useful to allow implementors to decide how they want to translate thrown errors into GraphQL errors that services report to the gateway, but we can definitely do that when we actually have a feature request and not during this round of code review 😉

Copy link
Contributor Author

@lennyburdette lennyburdette Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I left that feature out of the ruby implementation too. One thing I might do today is make it easier to add configuration options for tracing, which would give us a place to add a error translation hook in the future.

import static graphql.schema.GraphQLTypeUtil.simplePrint;

public class FederatedTracingInstrumentation extends SimpleInstrumentation {
private static final String KEY = "ftv1";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nbd: EXTENSION_KEY?

startRequestNanos = System.nanoTime();

// Create node map with the root node
nodeMap = new LinkedHashMap<>();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming suggestion: nodesByPath?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it seems like this state could likely be largely simplified to be built using the tree from TraceNode.newBuilder. I think that this object also means that building the node tree is a more arduous task at the end, and it would likely be a quick performance win (and code cleanliness win) to build the object directly rather than through a wrapper.

}

/**
* Adds node to nodeMap and recursively ensures that all parent node exist.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: nodes

List<Object> pathParts = path.toList();
ExecutionPath parentPath = ExecutionPath.fromList(pathParts.subList(0, pathParts.size() - 1));

if (!nodeMap.containsKey(parentPath)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, since I'm only giving this a brief look atm: how can we get into a state like this? Would this be caused from a bug within the instrumentation pipeline around field execution in GraphQL Java?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the backwards construction of node parents is to support indexes. The instrumentation hooks don't fire for widgets.0, just for widgets.0.id. I agree this isn't super clear though. I'll see what happens to this part of the code when I try dropping the wrapper class.

addNode(parentPath, new Node(parentPath));
}

nodeMap.get(parentPath).addChild(node);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh hm. I didn't realize we were indexing the fields by path and creating their children. I suppose that's relatively low memory overhead because we're just dealing with pointers underneath, but I haven't gotten to the point yet that necessitates indexing by field or justifies the value in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the same strategy that I found in this fork that adds non-federated apollo tracing to the ruby gem. Pretty much just rewrote it in Java here!

You're probably right though, and removing the additional object created for each node would be a performance win. I'll take a look and come back to you if I can't figure out a way to do it!

Copy link

@zionts zionts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though there are a few naming suggestions and ideas, I think the functionality here looks good. I think that we should investigate if there is a way we can replace the wrapper around the Trace Node interface with simply calling into builders directly (since underlying fields of builders take builders as well) in order to both reduce code complexity and reduce the logical complexity of building up the protobuf object from our indirect mapping.

@lennyburdette
Copy link
Contributor Author

@zionts You were right, no need for a wrapper class at all!

The secret sauce was discovering the builder.addChildBuilder() API, which allows me to add children and errors to each node without building them. I didn't know that protos would build recursively ... very cool!

}
}

public static class Options {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

if (!nodesByPath.containsKey(parentPath)) {
Reports.Trace.Node.Builder missingParent = getParent(parentPath).addChildBuilder();

// Missing parents are always list items, so we need to add the `index` field to them
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this true? Is this relying on the way that GraphQL-Java executes? And why isn't it true for list items?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's based on the assumption that execution goes from top to bottom.

Note though that GraphQL types can be nested lists – is that what you're asking about in the second sentence?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, and I think that's true, but I'm questioning whether that's a safe assumption to bake in and if not, how we might avoid it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an instrumentation hook called beginFieldListComplete. It seemed promising because there's a getCurrentListIndex() as part of the arguments to the hook. But it doesn't work: it's called once per list, not once per list item. (Calling code here).

I think this is the best bet. I didn't consider nested lists, so I added a test for them and it seems to work fine!

If y'all can figure out another way to create the indexed nodes, I'm all ears. Maybe I'll post something on the graphql-java spectrum to see if one of the maintainers have any ideas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a response that indicates that there's no existing hook to instrument each item in a list.

I think this makes sense: we just want to instrument the execution of resolvers, and there aren't list item resolvers. This is especially true for lists of scalar values, as Brad mentions—firing an instrumentation hook for each item would be a waste. (I added a test to ensure that lists of scalars are recorded as you'd expect.)

I added some more comments to hopefully make it clearer. Let me know if you have any other suggestions!

Copy link

@zionts zionts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through this, I think this is a great addition, and I love your attention to detail! There are a few ideas around further optimizations and features, but I think we can tackle them in follow-ups / add them in issues for later on. @glasser , if you want to have a final look before this merges in to catch anything I may have missed, this looks all good to me 👍

Copy link
Member

@glasser glasser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a number of changes directly to your branch. You can review them as separate commits. Let me know what you think.

A few outstanding requests:

  • Can you capture parse and validation errors too?

  • Can you add something to the README?

  • Do we want to figure out a way to make this respect apollo-federation-include-trace? The idea behind this is twofold: first, you might want to be able to run some queries directly against your backend without needing traces for some reason. Secondly, it means that federated tracing is on by default if you're running a federated backend, which doesn't seem as relevant for Java. I think it's still a good idea to implement it, though. Here's what I'm thinking — how about you have an interface like HTTPHeaders with a String getHeader(String) on it. If the context object that the user passes to HTTP execution implements that interface, then the extension can use it to check the header first. This needs to be documented in the README too.

@@ -62,6 +74,23 @@
<groupId>org.codehaus.mojo</groupId>
<artifactId>cobertura-maven-plugin</artifactId>
</plugin>

<plugin>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a Maven expert. Maybe @pcarrier 's eyes on at least the Maven part of the PR could help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neithr am I, but this seems to match https://www.xolstice.org/protobuf-maven-plugin/usage.html so LGTM.

}
}

if (result instanceof DataFetcherResult && ((DataFetcherResult) result).hasErrors()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting a warning about an unchecked cast here. It can be fixed by declaring theResult as a wildcard type (DataFetcherResult<?>) instead of a "raw" type — when you declare theResult as DataFetcherResult it actually de-genericizes every method on it, so getErrors() returns List instead of List<GraphQLError>. Fixing on the branch.

objects.add(new Object());
objects.add(new Object());
return objects;
}).dataFetcher("listOfLists", env -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listOfLists and listOfScalars aren't in the schema at tracing.graphql? The test fails because of this. (Our problem for not having CI set up on this repo...) Fixed on the branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, forgot to check in that file. Thanks for catching!

// Widget.foo works normally, Widget.bar always throws an error
builder.dataFetcher("foo", env -> "hello world")
.dataFetcher("bar", env -> {
ExceptionWhileDataFetching whoops = new ExceptionWhileDataFetching(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this can just be throw new GraphQLException("whoops") and let graphql-java wrap it in ExceptionWhileDataFetching. That does remove Exception while fetching data (/widgets[1]/baz) : from the reported message, but that's a good thing: the error is already placed on an appropriate place on the node tree and does not need to repeat the node's location in the message. (Does your Ruby plugin do this?)

... except that also loses the source location, because of how the data is passed to this instrumentation. I'll fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 passing in the field location from the instrumentation around the resolver is way better than depending on a specific kind of error!


Reports.Trace trace = state.toProto();

extensionsCopy.put(EXTENSION_KEY, Base64.getEncoder().encodeToString(trace.toByteArray()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I documented what base64 alphabet to use. The apollo-server implementation use's Node's Buffer.toString('base64') which uses ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. (The decoder also supports the URL and filename safe alphabet which uses -_ instead of +/.) Java's encoder uses the same alphabet. So all's good.

You might want to change your Ruby implementation to use Base64.strict_encode64 instead of Base64.encode64. This leaves out newlines in the middle which aren't particularly helpful when encoded in a JSON string, and causes Node to have to use a slightly slower decode implementation.

}

errors.forEach(error -> {
Reports.Trace.Error.Builder builder = node.addErrorBuilder().setMessage(error.getMessage());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JS implementation also takes other kinds of errors and puts them on the root node — most notably parse and validation errors, which the instrumentation can instrument. Do you want to implement that too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, this was pretty easy. I'm not entirely sure if I'm grabbing locations off the throwable in the best way (in convertErrors) ... got any other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good enough to me

@glasser
Copy link
Member

glasser commented Sep 26, 2019

Would you like to do the README and http headers change or should I?

Here's what I think would make a good interface for the header thing:

interface HTTPRequestHeaders {
  @Nullable String getHTTPRequestHeader(String caseInsensitiveHeaderName);
}

@lennyburdette
Copy link
Contributor Author

@glasser feel free to jump on it! I’m focused on pitching an investment in Apollo to the higher-ups and probably won’t get back to this for a week. 😄

@glasser glasser force-pushed the federated-tracing branch 2 times, most recently from 710ae87 to 7a62ab0 Compare September 27, 2019 18:15
@glasser
Copy link
Member

glasser commented Sep 27, 2019

Enabled CI testing of forked PRs, which would have caught some of the earlier issues!

Lenny Burdette and others added 17 commits September 27, 2019 16:20
Federated GraphQL services should include timing and error
information as a Base64-encoded protocol buffer message in
the "extensions.ftv1" field.

This change includes a tracer that uses the graphql-java
instrumentation API to record field timings and info and attach it
on the execution context.

I used the Apollo Server typescript code as reference:

https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/federatedExtension.ts
https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/treeBuilder.ts

As well as my ruby implementation:

https://github.com/lennyburdette/apollo-federation-ruby/blob/federated-tracing/lib/apollo-federation/tracing/tracer.rb

Federated tracing documentation: https://www.apollographql.com/docs/apollo-server/federation/metrics/
Avoids a warning about unchecked cast
Looks like ExceptionWhileDataFetching is something automatically added by
GraphQL-Java, but not provided to the instrumentation. Instead of only working
if the fetcher somehow throws that, find the source location another way.
Explicitly support the concept of having multiple indexes in a row.

Note that the old code's use of getLevel() was not correct: that undocumented
method means "how many named fields are in the entire path", which isn't helpful
here.

The test change is because the case in question was actually being tested, but
the test was wrong.
@glasser glasser merged commit 305bf2c into apollographql:master Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Federated tracing support
4 participants