feat: add federated tracing instrumentation #25

lennyburdette · 2019-08-29T05:26:33Z

Federated GraphQL services should include timing and error
information as a Base64-encoded protocol buffer message in
the "extensions.ftv1" field.

This change includes a tracer that uses the graphql-java
instrumentation API to record field timings and info and attach it
on the execution context.

I used the Apollo Server typescript code as reference:

https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/federatedExtension.ts
https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/treeBuilder.ts

As well as my ruby implementation:

https://github.com/lennyburdette/apollo-federation-ruby/blob/federated-tracing/lib/apollo-federation/tracing/tracer.rb

Federated tracing documentation: https://www.apollographql.com/docs/apollo-server/federation/metrics/

Fixes #24

lennyburdette · 2019-08-30T20:56:29Z

After a really fun day of learning maven and fighting with a broken CI system, I have evidence that this change works with Apollo tracing!

zionts · 2019-09-05T22:24:30Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+    }
+
+    /**
+     * Wrapper for Node information with mutable lists for children and errors.


Do we need to include this? I believe that the builder has accessors for adding children and errors.

zionts · 2019-09-05T22:26:16Z

graphql-java-support/src/main/proto/reports.proto

@@ -0,0 +1,488 @@
+syntax = "proto3";


I think it would be useful to include a comment along the lines of

This file is copied over from the `apollo-engine-reporting` package, which, in turn, derives its information from Apollo's cloud infrastructure. It is expected that as the protobuf definition changes within the underlying system we are reporting to, these libraries will need to be updated in sync.

zionts · 2019-09-05T22:28:30Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+        FederatedTracingState state = parameters.getInstrumentationState();
+
+        Map<Object, Object> extensions = executionResult.getExtensions();
+        Map<Object, Object> tracingMap = new LinkedHashMap<>(extensions == null ? Collections.emptyMap() : extensions);


This is a confusing name for something that is actually a copy of the extensions that are already present within the execution result and has nothing to do with tracing until we add in our own extension below. Maybe something like extensionsCopy or currentExtensionsMap instead?

zionts · 2019-09-05T22:28:48Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+
+        Reports.Trace trace = state.toProto();
+
+        byte[] bytes = trace.toByteArray();


nbd: can we inline this?

zionts · 2019-09-05T22:29:25Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+        byte[] bytes = trace.toByteArray();
+        tracingMap.put(KEY, Base64.getEncoder().encodeToString(bytes));
+
+        if (debuggingEnabled) {


I like this feature, though I think it might be more useful as a log than as a separate extensions key 🤷‍♂ curious on your thoughts!

zionts · 2019-09-05T22:30:59Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+            Node node = state.addNode(executionStepInfo, startNanos, endNanos);
+
+            if (throwable != null) {
+                node.addError(throwable);


feat: It might be useful to allow implementors to decide how they want to translate thrown errors into GraphQL errors that services report to the gateway, but we can definitely do that when we actually have a feature request and not during this round of code review 😉

Yeah, I left that feature out of the ruby implementation too. One thing I might do today is make it easier to add configuration options for tracing, which would give us a place to add a error translation hook in the future.

zionts · 2019-09-05T22:33:33Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+import static graphql.schema.GraphQLTypeUtil.simplePrint;
+
+public class FederatedTracingInstrumentation extends SimpleInstrumentation {
+    private static final String KEY = "ftv1";


nbd: EXTENSION_KEY?

zionts · 2019-09-05T22:37:13Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+            startRequestNanos = System.nanoTime();
+
+            // Create node map with the root node
+            nodeMap = new LinkedHashMap<>();


naming suggestion: nodesByPath?

Also it seems like this state could likely be largely simplified to be built using the tree from TraceNode.newBuilder. I think that this object also means that building the node tree is a more arduous task at the end, and it would likely be a quick performance win (and code cleanliness win) to build the object directly rather than through a wrapper.

zionts · 2019-09-05T22:40:20Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+        }
+
+        /**
+         * Adds node to nodeMap and recursively ensures that all parent node exist.


typo: nodes

zionts · 2019-09-05T22:41:25Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+            List<Object> pathParts = path.toList();
+            ExecutionPath parentPath = ExecutionPath.fromList(pathParts.subList(0, pathParts.size() - 1));
+
+            if (!nodeMap.containsKey(parentPath)) {


Out of curiosity, since I'm only giving this a brief look atm: how can we get into a state like this? Would this be caused from a bug within the instrumentation pipeline around field execution in GraphQL Java?

The reason for the backwards construction of node parents is to support indexes. The instrumentation hooks don't fire for widgets.0, just for widgets.0.id. I agree this isn't super clear though. I'll see what happens to this part of the code when I try dropping the wrapper class.

zionts · 2019-09-05T22:42:31Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+                addNode(parentPath, new Node(parentPath));
+            }
+
+            nodeMap.get(parentPath).addChild(node);


oh hm. I didn't realize we were indexing the fields by path and creating their children. I suppose that's relatively low memory overhead because we're just dealing with pointers underneath, but I haven't gotten to the point yet that necessitates indexing by field or justifies the value in it.

I'm using the same strategy that I found in this fork that adds non-federated apollo tracing to the ruby gem. Pretty much just rewrote it in Java here!

You're probably right though, and removing the additional object created for each node would be a performance win. I'll take a look and come back to you if I can't figure out a way to do it!

zionts

Though there are a few naming suggestions and ideas, I think the functionality here looks good. I think that we should investigate if there is a way we can replace the wrapper around the Trace Node interface with simply calling into builders directly (since underlying fields of builders take builders as well) in order to both reduce code complexity and reduce the logical complexity of building up the protobuf object from our indirect mapping.

lennyburdette · 2019-09-06T01:17:33Z

@zionts You were right, no need for a wrapper class at all!

The secret sauce was discovering the builder.addChildBuilder() API, which allows me to add children and errors to each node without building them. I didn't know that protos would build recursively ... very cool!

zionts · 2019-09-10T00:42:36Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+        }
+    }
+
+    public static class Options {


zionts · 2019-09-10T00:47:45Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+            if (!nodesByPath.containsKey(parentPath)) {
+                Reports.Trace.Node.Builder missingParent = getParent(parentPath).addChildBuilder();
+
+                // Missing parents are always list items, so we need to add the `index` field to them


Why is this true? Is this relying on the way that GraphQL-Java executes? And why isn't it true for list items?

I assume it's based on the assumption that execution goes from top to bottom.

Note though that GraphQL types can be nested lists – is that what you're asking about in the second sentence?

that makes sense, and I think that's true, but I'm questioning whether that's a safe assumption to bake in and if not, how we might avoid it

There is an instrumentation hook called beginFieldListComplete. It seemed promising because there's a getCurrentListIndex() as part of the arguments to the hook. But it doesn't work: it's called once per list, not once per list item. (Calling code here).

I think this is the best bet. I didn't consider nested lists, so I added a test for them and it seems to work fine!

If y'all can figure out another way to create the indexed nodes, I'm all ears. Maybe I'll post something on the graphql-java spectrum to see if one of the maintainers have any ideas.

https://spectrum.chat/graphql-java/general/instrumentation-hook-for-list-items~534fa292-73b0-4cce-8832-5bbda66dcfb3

Got a response that indicates that there's no existing hook to instrument each item in a list.

I think this makes sense: we just want to instrument the execution of resolvers, and there aren't list item resolvers. This is especially true for lists of scalar values, as Brad mentions—firing an instrumentation hook for each item would be a waste. (I added a test to ensure that lists of scalars are recorded as you'd expect.)

I added some more comments to hopefully make it clearer. Let me know if you have any other suggestions!

zionts

Looking through this, I think this is a great addition, and I love your attention to detail! There are a few ideas around further optimizations and features, but I think we can tackle them in follow-ups / add them in issues for later on. @glasser , if you want to have a final look before this merges in to catch anything I may have missed, this looks all good to me 👍

glasser

I made a number of changes directly to your branch. You can review them as separate commits. Let me know what you think.

A few outstanding requests:

Can you capture parse and validation errors too?
Can you add something to the README?
Do we want to figure out a way to make this respect apollo-federation-include-trace? The idea behind this is twofold: first, you might want to be able to run some queries directly against your backend without needing traces for some reason. Secondly, it means that federated tracing is on by default if you're running a federated backend, which doesn't seem as relevant for Java. I think it's still a good idea to implement it, though. Here's what I'm thinking — how about you have an interface like HTTPHeaders with a String getHeader(String) on it. If the context object that the user passes to HTTP execution implements that interface, then the extension can use it to check the header first. This needs to be documented in the README too.

glasser · 2019-09-25T06:13:03Z

graphql-java-support/pom.xml

@@ -62,6 +74,23 @@
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>cobertura-maven-plugin</artifactId>
            </plugin>
+
+            <plugin>


Not a Maven expert. Maybe @pcarrier 's eyes on at least the Maven part of the PR could help?

Neithr am I, but this seems to match https://www.xolstice.org/protobuf-maven-plugin/usage.html so LGTM.

glasser · 2019-09-25T06:30:08Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+            }
+        }
+
+        if (result instanceof DataFetcherResult && ((DataFetcherResult) result).hasErrors()) {


I'm getting a warning about an unchecked cast here. It can be fixed by declaring theResult as a wildcard type (DataFetcherResult<?>) instead of a "raw" type — when you declare theResult as DataFetcherResult it actually de-genericizes every method on it, so getErrors() returns List instead of List<GraphQLError>. Fixing on the branch.

glasser · 2019-09-25T06:35:57Z

.../test/java/com/apollographql/federation/graphqljava/FederatedTracingInstrumentationTest.java

+                            objects.add(new Object());
+                            objects.add(new Object());
+                            return objects;
+                        }).dataFetcher("listOfLists", env -> {


listOfLists and listOfScalars aren't in the schema at tracing.graphql? The test fails because of this. (Our problem for not having CI set up on this repo...) Fixed on the branch.

Whoops, forgot to check in that file. Thanks for catching!

glasser · 2019-09-25T06:37:36Z

.../test/java/com/apollographql/federation/graphqljava/FederatedTracingInstrumentationTest.java

+                        // Widget.foo works normally, Widget.bar always throws an error
+                        builder.dataFetcher("foo", env -> "hello world")
+                                .dataFetcher("bar", env -> {
+                                    ExceptionWhileDataFetching whoops = new ExceptionWhileDataFetching(


Pretty sure this can just be throw new GraphQLException("whoops") and let graphql-java wrap it in ExceptionWhileDataFetching. That does remove Exception while fetching data (/widgets[1]/baz) : from the reported message, but that's a good thing: the error is already placed on an appropriate place on the node tree and does not need to repeat the node's location in the message. (Does your Ruby plugin do this?)

... except that also loses the source location, because of how the data is passed to this instrumentation. I'll fix.

👍 passing in the field location from the instrumentation around the resolver is way better than depending on a specific kind of error!

glasser · 2019-09-25T07:24:00Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+
+        Reports.Trace trace = state.toProto();
+
+        extensionsCopy.put(EXTENSION_KEY, Base64.getEncoder().encodeToString(trace.toByteArray()));


Not sure if I documented what base64 alphabet to use. The apollo-server implementation use's Node's Buffer.toString('base64') which uses ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. (The decoder also supports the URL and filename safe alphabet which uses -_ instead of +/.) Java's encoder uses the same alphabet. So all's good.

You might want to change your Ruby implementation to use Base64.strict_encode64 instead of Base64.encode64. This leaves out newlines in the middle which aren't particularly helpful when encoded in a JSON string, and causes Node to have to use a slightly slower decode implementation.

glasser · 2019-09-25T09:26:04Z

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

+            }
+
+            errors.forEach(error -> {
+                Reports.Trace.Error.Builder builder = node.addErrorBuilder().setMessage(error.getMessage());


The JS implementation also takes other kinds of errors and puts them on the root node — most notably parse and validation errors, which the instrumentation can instrument. Do you want to implement that too?

Great idea, this was pretty easy. I'm not entirely sure if I'm grabbing locations off the throwable in the best way (in convertErrors) ... got any other ideas?

Looks good enough to me

glasser · 2019-09-26T22:45:25Z

Would you like to do the README and http headers change or should I?

Here's what I think would make a good interface for the header thing:

interface HTTPRequestHeaders {
  @Nullable String getHTTPRequestHeader(String caseInsensitiveHeaderName);
}

lennyburdette · 2019-09-27T01:02:09Z

@glasser feel free to jump on it! I’m focused on pitching an investment in Apollo to the higher-ups and probably won’t get back to this for a week. 😄

glasser · 2019-09-27T18:16:32Z

Enabled CI testing of forked PRs, which would have caught some of the earlier issues!

Federated GraphQL services should include timing and error information as a Base64-encoded protocol buffer message in the "extensions.ftv1" field. This change includes a tracer that uses the graphql-java instrumentation API to record field timings and info and attach it on the execution context. I used the Apollo Server typescript code as reference: https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/federatedExtension.ts https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/treeBuilder.ts As well as my ruby implementation: https://github.com/lennyburdette/apollo-federation-ruby/blob/federated-tracing/lib/apollo-federation/tracing/tracer.rb Federated tracing documentation: https://www.apollographql.com/docs/apollo-server/federation/metrics/

Avoids a warning about unchecked cast

Looks like ExceptionWhileDataFetching is something automatically added by GraphQL-Java, but not provided to the instrumentation. Instead of only working if the fetcher somehow throws that, find the source location another way.

Explicitly support the concept of having multiple indexes in a row. Note that the old code's use of getLevel() was not correct: that undocumented method means "how many named fields are in the entire path", which isn't helpful here. The test change is because the case in question was actually being tested, but the test was wrong.

zionts reviewed Sep 5, 2019

View reviewed changes

zionts reviewed Sep 10, 2019

View reviewed changes

...n/java/com/apollographql/federation/graphqljava/tracing/FederatedTracingInstrumentation.java

}

}

public static class Options {

Copy link

zionts Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

zionts reviewed Sep 10, 2019

View reviewed changes

lennyburdette force-pushed the federated-tracing branch from 5b7e6c5 to 5880924 Compare September 12, 2019 15:45

zionts approved these changes Sep 12, 2019

View reviewed changes

lennyburdette force-pushed the federated-tracing branch from d917db7 to 29d8bc2 Compare September 13, 2019 04:00

glasser force-pushed the federated-tracing branch from 9dc41f6 to 155b49c Compare September 25, 2019 08:33

glasser requested changes Sep 25, 2019

View reviewed changes

lennyburdette mentioned this pull request Sep 25, 2019

Tracing improvements Gusto/apollo-federation-ruby#29

Open

2 tasks

glasser force-pushed the federated-tracing branch 2 times, most recently from 710ae87 to 7a62ab0 Compare September 27, 2019 18:15

glasser mentioned this pull request Sep 27, 2019

More non spring examples + proper docs + bunch of questions #17

Closed

glasser force-pushed the federated-tracing branch 2 times, most recently from 3df34cd to cd3248e Compare September 27, 2019 22:47

glasser force-pushed the federated-tracing branch from cd3248e to e764b2a Compare September 27, 2019 23:15

Lenny Burdette and others added 17 commits September 27, 2019 16:20

Use wildcard type instead of raw type

8a2afe5

Avoids a warning about unchecked cast

Remove unnecessary public from private class

c38658a

Add schema fields used in test

ca61630

Properly handle ordinary errors thrown from fetchers

6ee9983

Looks like ExceptionWhileDataFetching is something automatically added by GraphQL-Java, but not provided to the instrumentation. Instead of only working if the fetcher somehow throws that, find the source location another way.

Use ExecutionResultImpl.Builder to simplify code

f86afba

"test" two Options methods so they are used outside the package

a59ee3c

Factor out instantToTimestamp

2d84590

Add a bunch of @NotNull annotations

2bc9d9c

fixup: attach parse and validation errors to the tracing root node

bcac8ae

Implement HTTP header sensitivity

37d8dd3

README

becdcb7

Revert getParentNode back to recursive (but keep the bug fix)

f2db2d5

add missing javadoc

da12d9c

Support GraphQL-Java v12

7f34c6b

wrap readme

e5084f1

glasser force-pushed the federated-tracing branch from e764b2a to e5084f1 Compare September 27, 2019 23:21

glasser merged commit 305bf2c into apollographql:master Sep 27, 2019

tk26 mentioned this pull request Nov 17, 2019

Federated tracing ExpediaGroup/graphql-kotlin#477

Closed


		Reports.Trace trace = state.toProto();

		byte[] bytes = trace.toByteArray();


		Reports.Trace trace = state.toProto();

		extensionsCopy.put(EXTENSION_KEY, Base64.getEncoder().encodeToString(trace.toByteArray()));

feat: add federated tracing instrumentation #25

feat: add federated tracing instrumentation #25

Conversation

lennyburdette commented Aug 29, 2019 • edited by glasser Loading

lennyburdette commented Aug 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lennyburdette Sep 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zionts left a comment

Choose a reason for hiding this comment

lennyburdette commented Sep 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zionts left a comment

Choose a reason for hiding this comment

glasser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glasser commented Sep 26, 2019

lennyburdette commented Sep 27, 2019

glasser commented Sep 27, 2019

lennyburdette commented Aug 29, 2019 •

edited by glasser

Loading

lennyburdette commented Aug 30, 2019 •

edited

Loading

lennyburdette Sep 5, 2019 •

edited

Loading