-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add federated tracing instrumentation #25
feat: add federated tracing instrumentation #25
Conversation
} | ||
|
||
/** | ||
* Wrapper for Node information with mutable lists for children and errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include this? I believe that the builder has accessors for adding children and errors.
@@ -0,0 +1,488 @@ | |||
syntax = "proto3"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to include a comment along the lines of
This file is copied over from the `apollo-engine-reporting` package, which, in turn, derives its information from Apollo's cloud infrastructure. It is expected that as the protobuf definition changes within the underlying system we are reporting to, these libraries will need to be updated in sync.
FederatedTracingState state = parameters.getInstrumentationState(); | ||
|
||
Map<Object, Object> extensions = executionResult.getExtensions(); | ||
Map<Object, Object> tracingMap = new LinkedHashMap<>(extensions == null ? Collections.emptyMap() : extensions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a confusing name for something that is actually a copy of the extensions that are already present within the execution result and has nothing to do with tracing until we add in our own extension below. Maybe something like extensionsCopy
or currentExtensionsMap
instead?
|
||
Reports.Trace trace = state.toProto(); | ||
|
||
byte[] bytes = trace.toByteArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nbd: can we inline this?
byte[] bytes = trace.toByteArray(); | ||
tracingMap.put(KEY, Base64.getEncoder().encodeToString(bytes)); | ||
|
||
if (debuggingEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this feature, though I think it might be more useful as a log than as a separate extensions key 🤷♂ curious on your thoughts!
Node node = state.addNode(executionStepInfo, startNanos, endNanos); | ||
|
||
if (throwable != null) { | ||
node.addError(throwable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feat: It might be useful to allow implementors to decide how they want to translate thrown errors into GraphQL errors that services report to the gateway, but we can definitely do that when we actually have a feature request and not during this round of code review 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I left that feature out of the ruby implementation too. One thing I might do today is make it easier to add configuration options for tracing, which would give us a place to add a error translation hook in the future.
import static graphql.schema.GraphQLTypeUtil.simplePrint; | ||
|
||
public class FederatedTracingInstrumentation extends SimpleInstrumentation { | ||
private static final String KEY = "ftv1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nbd: EXTENSION_KEY
?
startRequestNanos = System.nanoTime(); | ||
|
||
// Create node map with the root node | ||
nodeMap = new LinkedHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming suggestion: nodesByPath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it seems like this state could likely be largely simplified to be built using the tree from TraceNode.newBuilder
. I think that this object also means that building the node tree is a more arduous task at the end, and it would likely be a quick performance win (and code cleanliness win) to build the object directly rather than through a wrapper.
} | ||
|
||
/** | ||
* Adds node to nodeMap and recursively ensures that all parent node exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: nodes
List<Object> pathParts = path.toList(); | ||
ExecutionPath parentPath = ExecutionPath.fromList(pathParts.subList(0, pathParts.size() - 1)); | ||
|
||
if (!nodeMap.containsKey(parentPath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, since I'm only giving this a brief look atm: how can we get into a state like this? Would this be caused from a bug within the instrumentation pipeline around field execution in GraphQL Java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for the backwards construction of node parents is to support indexes. The instrumentation hooks don't fire for widgets.0
, just for widgets.0.id
. I agree this isn't super clear though. I'll see what happens to this part of the code when I try dropping the wrapper class.
addNode(parentPath, new Node(parentPath)); | ||
} | ||
|
||
nodeMap.get(parentPath).addChild(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh hm. I didn't realize we were indexing the fields by path and creating their children. I suppose that's relatively low memory overhead because we're just dealing with pointers underneath, but I haven't gotten to the point yet that necessitates indexing by field or justifies the value in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the same strategy that I found in this fork that adds non-federated apollo tracing to the ruby gem. Pretty much just rewrote it in Java here!
You're probably right though, and removing the additional object created for each node would be a performance win. I'll take a look and come back to you if I can't figure out a way to do it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though there are a few naming suggestions and ideas, I think the functionality here looks good. I think that we should investigate if there is a way we can replace the wrapper around the Trace Node interface with simply calling into builders directly (since underlying fields of builders take builders as well) in order to both reduce code complexity and reduce the logical complexity of building up the protobuf object from our indirect mapping.
@zionts You were right, no need for a wrapper class at all! The secret sauce was discovering the |
} | ||
} | ||
|
||
public static class Options { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
if (!nodesByPath.containsKey(parentPath)) { | ||
Reports.Trace.Node.Builder missingParent = getParent(parentPath).addChildBuilder(); | ||
|
||
// Missing parents are always list items, so we need to add the `index` field to them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this true? Is this relying on the way that GraphQL-Java executes? And why isn't it true for list items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it's based on the assumption that execution goes from top to bottom.
Note though that GraphQL types can be nested lists – is that what you're asking about in the second sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense, and I think that's true, but I'm questioning whether that's a safe assumption to bake in and if not, how we might avoid it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an instrumentation hook called beginFieldListComplete. It seemed promising because there's a getCurrentListIndex()
as part of the arguments to the hook. But it doesn't work: it's called once per list, not once per list item. (Calling code here).
I think this is the best bet. I didn't consider nested lists, so I added a test for them and it seems to work fine!
If y'all can figure out another way to create the indexed nodes, I'm all ears. Maybe I'll post something on the graphql-java spectrum to see if one of the maintainers have any ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got a response that indicates that there's no existing hook to instrument each item in a list.
I think this makes sense: we just want to instrument the execution of resolvers, and there aren't list item resolvers. This is especially true for lists of scalar values, as Brad mentions—firing an instrumentation hook for each item would be a waste. (I added a test to ensure that lists of scalars are recorded as you'd expect.)
I added some more comments to hopefully make it clearer. Let me know if you have any other suggestions!
5b7e6c5
to
5880924
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through this, I think this is a great addition, and I love your attention to detail! There are a few ideas around further optimizations and features, but I think we can tackle them in follow-ups / add them in issues for later on. @glasser , if you want to have a final look before this merges in to catch anything I may have missed, this looks all good to me 👍
d917db7
to
29d8bc2
Compare
9dc41f6
to
155b49c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a number of changes directly to your branch. You can review them as separate commits. Let me know what you think.
A few outstanding requests:
-
Can you capture parse and validation errors too?
-
Can you add something to the README?
-
Do we want to figure out a way to make this respect apollo-federation-include-trace? The idea behind this is twofold: first, you might want to be able to run some queries directly against your backend without needing traces for some reason. Secondly, it means that federated tracing is on by default if you're running a federated backend, which doesn't seem as relevant for Java. I think it's still a good idea to implement it, though. Here's what I'm thinking — how about you have an interface like HTTPHeaders with a
String getHeader(String)
on it. If thecontext
object that the user passes to HTTP execution implements that interface, then the extension can use it to check the header first. This needs to be documented in the README too.
@@ -62,6 +74,23 @@ | |||
<groupId>org.codehaus.mojo</groupId> | |||
<artifactId>cobertura-maven-plugin</artifactId> | |||
</plugin> | |||
|
|||
<plugin> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a Maven expert. Maybe @pcarrier 's eyes on at least the Maven part of the PR could help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neithr am I, but this seems to match https://www.xolstice.org/protobuf-maven-plugin/usage.html so LGTM.
} | ||
} | ||
|
||
if (result instanceof DataFetcherResult && ((DataFetcherResult) result).hasErrors()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting a warning about an unchecked cast here. It can be fixed by declaring theResult as a wildcard type (DataFetcherResult<?>
) instead of a "raw" type — when you declare theResult as DataFetcherResult
it actually de-genericizes every method on it, so getErrors()
returns List
instead of List<GraphQLError>
. Fixing on the branch.
objects.add(new Object()); | ||
objects.add(new Object()); | ||
return objects; | ||
}).dataFetcher("listOfLists", env -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listOfLists and listOfScalars aren't in the schema at tracing.graphql? The test fails because of this. (Our problem for not having CI set up on this repo...) Fixed on the branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, forgot to check in that file. Thanks for catching!
// Widget.foo works normally, Widget.bar always throws an error | ||
builder.dataFetcher("foo", env -> "hello world") | ||
.dataFetcher("bar", env -> { | ||
ExceptionWhileDataFetching whoops = new ExceptionWhileDataFetching( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this can just be throw new GraphQLException("whoops")
and let graphql-java wrap it in ExceptionWhileDataFetching. That does remove Exception while fetching data (/widgets[1]/baz) :
from the reported message, but that's a good thing: the error is already placed on an appropriate place on the node tree and does not need to repeat the node's location in the message. (Does your Ruby plugin do this?)
... except that also loses the source location, because of how the data is passed to this instrumentation. I'll fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 passing in the field location from the instrumentation around the resolver is way better than depending on a specific kind of error!
|
||
Reports.Trace trace = state.toProto(); | ||
|
||
extensionsCopy.put(EXTENSION_KEY, Base64.getEncoder().encodeToString(trace.toByteArray())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I documented what base64 alphabet to use. The apollo-server implementation use's Node's Buffer.toString('base64')
which uses ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
. (The decoder also supports the URL and filename safe alphabet which uses -_
instead of +/
.) Java's encoder uses the same alphabet. So all's good.
You might want to change your Ruby implementation to use Base64.strict_encode64
instead of Base64.encode64
. This leaves out newlines in the middle which aren't particularly helpful when encoded in a JSON string, and causes Node to have to use a slightly slower decode implementation.
} | ||
|
||
errors.forEach(error -> { | ||
Reports.Trace.Error.Builder builder = node.addErrorBuilder().setMessage(error.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JS implementation also takes other kinds of errors and puts them on the root node — most notably parse and validation errors, which the instrumentation can instrument. Do you want to implement that too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, this was pretty easy. I'm not entirely sure if I'm grabbing locations off the throwable in the best way (in convertErrors
) ... got any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good enough to me
Would you like to do the README and http headers change or should I? Here's what I think would make a good interface for the header thing:
|
@glasser feel free to jump on it! I’m focused on pitching an investment in Apollo to the higher-ups and probably won’t get back to this for a week. 😄 |
710ae87
to
7a62ab0
Compare
Enabled CI testing of forked PRs, which would have caught some of the earlier issues! |
3df34cd
to
cd3248e
Compare
cd3248e
to
e764b2a
Compare
Federated GraphQL services should include timing and error information as a Base64-encoded protocol buffer message in the "extensions.ftv1" field. This change includes a tracer that uses the graphql-java instrumentation API to record field timings and info and attach it on the execution context. I used the Apollo Server typescript code as reference: https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/federatedExtension.ts https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/treeBuilder.ts As well as my ruby implementation: https://github.com/lennyburdette/apollo-federation-ruby/blob/federated-tracing/lib/apollo-federation/tracing/tracer.rb Federated tracing documentation: https://www.apollographql.com/docs/apollo-server/federation/metrics/
Avoids a warning about unchecked cast
Looks like ExceptionWhileDataFetching is something automatically added by GraphQL-Java, but not provided to the instrumentation. Instead of only working if the fetcher somehow throws that, find the source location another way.
Explicitly support the concept of having multiple indexes in a row. Note that the old code's use of getLevel() was not correct: that undocumented method means "how many named fields are in the entire path", which isn't helpful here. The test change is because the case in question was actually being tested, but the test was wrong.
e764b2a
to
e5084f1
Compare
Federated GraphQL services should include timing and error
information as a Base64-encoded protocol buffer message in
the "extensions.ftv1" field.
This change includes a tracer that uses the graphql-java
instrumentation API to record field timings and info and attach it
on the execution context.
I used the Apollo Server typescript code as reference:
https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/federatedExtension.ts
https://github.com/apollographql/apollo-server/blob/master/packages/apollo-engine-reporting/src/treeBuilder.ts
As well as my ruby implementation:
https://github.com/lennyburdette/apollo-federation-ruby/blob/federated-tracing/lib/apollo-federation/tracing/tracer.rb
Federated tracing documentation: https://www.apollographql.com/docs/apollo-server/federation/metrics/
Fixes #24