-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
usage reporting plugin: add fieldLevelInstrumentation option #5963
Conversation
might rename this to fieldLevelInstrumentation to make it clear it's about "are we paying attention to fields" and not "are we building a trace" because that is more confusable with "are we sending a trace" Might change the number thing to just counting for each operation how many un-instrumented requests there were (see https://github.com/mdg-private/monorepo/pull/7411) |
OK, I did some of the above but have to go afk todo:
|
|
||
This option allows you to choose if Apollo Server should calculate detailed per-field statistics for a particular request. It is only called for executable operations: operations which parse and validate properly and which do not have an unknown operation name. It is not called if an [`includeRequest`](#includerequest) hook is provided and returns false. | ||
|
||
Returning false means that Apollo Server will only pay attention to overall properties of the operation, like what GraphQL operation is executing and how long the entire operation takes to execute, and not anything about field-by-field execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads a bit funny but I don't have a good suggestion yet
|
||
Note that returning true here does *not* mean that the captured trace must be *transmitted* to Apollo Studio's servers in the form of a trace; it may still be aggregated locally to statistics. But a captured trace still will contribute to the "field executions" statistic and timing hints even if it is aggregated locally. | ||
|
||
Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator. (If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an aside?
Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator. (If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.) | |
Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator. | |
> If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to change the actual semantics here.
This function is called for each received request. It takes a [`GraphQLRequestContext`](https://github.com/apollographql/apollo-server/blob/main/packages/apollo-server-types/src/index.ts#L115-L150) object and must return a `Promise<Boolean>` that indicates whether to include the request. It's called either after the operation is successfully resolved (via [the `didResolveOperation` event](https://www.apollographql.com/docs/apollo-server/integrations/plugins/#didresolveoperation)), or when sending the final error response if the operation was not successfully resolved (via [the `willSendResponse` event](https://www.apollographql.com/docs/apollo-server/integrations/plugins/#willsendresponse)). | ||
|
||
(If you don't want any usage reporting at all, don't use this option: instead, either avoid specifying an Apollo API key or explicitly [disable the plugin](#disabling-the-plugin).) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this needs parens at all
package.json
Outdated
@@ -116,6 +117,7 @@ | |||
"koa": "2.13.4", | |||
"koa-router": "10.1.1", | |||
"lerna": "4.0.0", | |||
"lodash.sumby": "^4.6.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ^ here will result in a renovate PR to remove it
}: { | ||
pluginOptions?: ApolloServerPluginUsageReportingOptions<any>; | ||
expectReport?: boolean; | ||
query?: string; | ||
operationName?: string | null; | ||
// null means no expectation because it's not deterministic | ||
schemaShouldBeInstrumented?: boolean | null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ever be undefined
? Looks like it's defaulted above
schemaShouldBeInstrumented?: boolean | null; | |
schemaShouldBeInstrumented: boolean | null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ?
here is more like "key can be left out" than "can also be undefined" (though I think in TS these days it kinda means both and they are trying to migrate away from that). Applying your change means that the calls that don't specify schemaShouldBeInstrumented fail to build. TS properly determines that schemaShouldBeInstrumented is boolean | null
(not undefined) inside the function so I think I'm doing it correctly.
const expected = fraction * total; | ||
// If it strays from the expected amount of 35 by too far we fail. | ||
expect(actualFieldLevelInstrumentation).toBeGreaterThanOrEqual( | ||
0.6 * expected, | ||
); | ||
expect(actualFieldLevelInstrumentation).toBeLessThanOrEqual( | ||
1.4 * expected, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smells like a flaky test, can't we mock Math.random()
to get deterministic results in our tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point! https://stackoverflow.com/questions/41570273/how-to-test-a-function-that-output-is-random-using-jest or https://www.npmjs.com/package/jest-mock-random might do the trick here.
Might also remove the test — going to be changing how this is implemented anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mocked Math.random.
e466e8d
to
762a029
Compare
The usage reporting plugin already has an `includeRequest` option which allows you to tell the plugin to completely ignore an operation. One reason you may have wanted to use this is to avoid the overhead of capturing a full field-by-field execution trace. But in this case, `includeRequest: false` is overkill: it removes the operation from aspects of Studio such as the Operations page and schema checks which don't require full execution traces to operate. This PR adds a new option, `fieldLevelInstrumentation`, which is invoked after includeRequest returns true. If you return false from this operation, you won't incur the overhead of capturing a detailed trace (either directly in this process, or in subgraphs if this process is a Gateway). Most of Studio's features (including the newly-added "referencing operations" column on the Fields page) will still reflect the existence of this operation. As of the end of 2021, the features that this operation will not contribute to are: - The "field executions" column on the Fields page - The per-field timing hints shown in Explorer and in vscode-graphql - The trace view on the Operations page Apollo Server now sends both an "observed execution count" and "estimated execution count" for each field (for each operation and client). The former is literally how many times we saw the field get executed (only during operations where `fieldLevelInstrumentation` returned truthy). If the hook only returns true or false, the latter is the same, but you may also return a number between 0 and 1 which represents the sample rate for that operation; the "estimated execution count" will be incremented by the reciprocal of that number instead of by 1. So for example, with: fieldLevelInstrumentation: () => Math.random() < 0.01 ? 0.01 : false Apollo Server will instrument 1% of operations, and the "estimated execution count" will be 100 times more than the observed execution count. (You can imagine more sophisticated implementations of `fieldLevelInstrumentation` which sample more common operations more aggressively than rare operations.) If you pass a number for `fieldLevelInstrumentation`, it is equivalent to passing a function of the form in the above example; that is, the previous example behaves identically to `fieldLevelInstrumentation: 0.01`. The `latency_count` sent with field stats (which powers per-field timing hints) is now always scaled by the given fact (ie, there's no separate "observed histogram"). Note that the semantics of the `requestContext.metrics.captureTraces` field changes with this PR. Previously its value matched the value returned by the `includeRequest` hook; now it matches the truthiness of the value returned by the `fieldLevelInstrumentation` hook. This field determines whether Apollo Gateway includes the `apollo-federation-include-trace` header with outgoing requests so this is an appropriate change. Since this may provide a common use case for running a GraphQL server without any need for per-field instrumentation, the request pipeline now only instruments the schema if it ever sees a willResolveField callback. This *almost* means that if you're running a monolithic server with fieldLevelInstrumentation always returning false (or usage reporting disabled), or if you're running a subgraph whose gateway has fieldLevelInstrumentation:false (and thus never receives a request with the `apollo-federation-include-trace` header), then execution won't have the small performance impact of instrumentation. In practice you need to also disable the cache control plugin to get this speedup, as it is installed by default and uses willResolveField to implement dynamic cache control. If this optimization proves to be important we can provide a mode of the cache control plugin that doesn't allow for dynamic cache control. (Alternatively we may eventually find a way to support the instrumentation of GraphQL execution with lower overhead.) Part of #5708.
542b468
to
fa29840
Compare
OK, this is ready for another round of review! |
(Note that the PR description has been updated.) |
|
||
Returning false (or 0) here for some or all operations can improve your server's performance, as the overhead of calculating complete traces is not always negligible. This is especially the case if this server is an Apollo Gateway, as captured traces are transmitted from the subgraph to the Gateway in-band inside the actual GraphQL response. | ||
|
||
Returning a positive number means that Apollo Server will track each field execution and send Apollo Studio statistics on how many times each field was executed and what the per-field performance was. If the number returned is less than 1, Apollo Server will also send a scaled "estimate" count for each field, equal to the number of observed field executions divided by the number returned by the hook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with weight?
@@ -105,6 +105,38 @@ The only properties of the reported error you can modify are its `message` and i | |||
<tr> | |||
<td> | |||
|
|||
###### `fieldLevelInstrumentation` | |||
|
|||
`async Function` or `number` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or false
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think so.
A number is a short-hand for () => Math.random() <= n ? 1/n : 0
. false
would just be shorthand for () => false
. I think writing the latter yourself is easy enough.
I've updated this so that the hook returns a weight rather than a probability (ie, the reciprocal of what it was previously returning) and changed a name in the protocol from scale factor to weight. I am going to merge this to release-3.6.0 so I can release another alpha from the branch, but I'd still appreciate real review from @trevor-scheer or @pcarrier before this is fully released, if either of you is available this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (post-merge)
In #5963 we added `!!metrics.captureTraces` to the condition for considering sending an operation as a trace to Studio. This makes sense for operations which start to execute (don't send something as a trace if we didn't record a trace for it!) but if the reason `captureTraces` is unset is because we never successfully resolved an operation, it still can be interesting to send a trace, for the sake of the Errors page. So we allow sending traces for unexecutable operations (though it still goes through our standard sampling algorithm). In order to make sure the recently added Trace.fieldExecutionWeight is set to the default value of 1 in this case, we set it when initially creating the Trace rather than in didResolveOperation. Fixes #6193.
In #5963 we added `!!metrics.captureTraces` to the condition for considering sending an operation as a trace to Studio. This makes sense for operations which start to execute (don't send something as a trace if we didn't record a trace for it!) but if the reason `captureTraces` is unset is because we never successfully resolved an operation, it still can be interesting to send a trace, for the sake of the Errors page. So we allow sending traces for unexecutable operations (though it still goes through our standard sampling algorithm). In order to make sure the recently added Trace.fieldExecutionWeight is set to the default value of 1 in this case, we set it when initially creating the Trace rather than in didResolveOperation. Fixes #6193.
) In #5963 we added `!!metrics.captureTraces` to the condition for considering sending an operation as a trace to Studio. This makes sense for operations which start to execute (don't send something as a trace if we didn't record a trace for it!) but if the reason `captureTraces` is unset is because we never successfully resolved an operation, it still can be interesting to send a trace, for the sake of the Errors page. So we allow sending traces for unexecutable operations (though it still goes through our standard sampling algorithm). In order to make sure the recently added Trace.fieldExecutionWeight is set to the default value of 1 in this case, we set it when initially creating the Trace rather than in didResolveOperation. Fixes #6193.
) In #5963 we added `!!metrics.captureTraces` to the condition for considering sending an operation as a trace to Studio. This makes sense for operations which start to execute (don't send something as a trace if we didn't record a trace for it!) but if the reason `captureTraces` is unset is because we never successfully resolved an operation, it still can be interesting to send a trace, for the sake of the Errors page. So we allow sending traces for unexecutable operations (though it still goes through our standard sampling algorithm). In order to make sure the recently added Trace.fieldExecutionWeight is set to the default value of 1 in this case, we set it when initially creating the Trace rather than in didResolveOperation. Fixes #6193.
Apollo Server 2.2.1 introduced an experimental feature named __resolveObject. It was like an extra-powerful version of __resolveReference (which did not yet exist) which also allowed you to return references within a single subgraph and they'd be magically converted into full objects. It was never documented and was described multiple times as experimental. Its implementation is entirely inside the Apollo Server schema instrumentation, which is largely designed for enabling willResolveField plugins (eg, usage reporting). In fact, as the developer of the main feature in AS 3.6.0 (apollographql/apollo-server#5963) I added an optimization to not call schema instrumentation at all if there are no willResolveField plugin hooks... which means that this feature is broken in that case! We are likely to remove this feature in AS4. To get ready for that, stop using it in tests here. This involves changing the existing ones to `__resolveReference` and also adding some direct "object resolution" to resolvers that return objects within the "owning" service. This is an improvement anyway, so that our fixtures use the documented `__resolveReference` feature instead of a different obscure feature.
Apollo Server 2.2.1 introduced an experimental feature named __resolveObject. It was like an extra-powerful version of __resolveReference (which did not yet exist) which also allowed you to return references within a single subgraph and they'd be magically converted into full objects. It was never documented and was described multiple times as experimental. Its implementation is entirely inside the Apollo Server schema instrumentation, which is largely designed for enabling willResolveField plugins (eg, usage reporting). In fact, as the developer of the main feature in AS 3.6.0 (apollographql/apollo-server#5963) I added an optimization to not call schema instrumentation at all if there are no willResolveField plugin hooks... which means that this feature is broken in that case! We are likely to remove this feature in AS4. To get ready for that, stop using it in tests here. This involves changing the existing ones to `__resolveReference` and also adding some direct "object resolution" to resolvers that return objects within the "owning" service. This is an improvement anyway, so that our fixtures use the documented `__resolveReference` feature instead of a different obscure feature.
The usage reporting plugin already has an
includeRequest
option whichallows you to tell the plugin to completely ignore an operation.
One reason you may have wanted to use this is to avoid the overhead of
capturing a full field-by-field execution trace. But in this case,
includeRequest: false
is overkill: it removes the operation from aspectsof Studio such as the Operations page and schema checks which don't
require full execution traces to operate.
This PR adds a new option,
fieldLevelInstrumentation
, which is invoked afterincludeRequest returns true. If you return false from this operation, you won't
incur the overhead of capturing a detailed trace (either directly in this
process, or in subgraphs if this process is a Gateway).
Most of Studio's features (including the newly-added "referencing
operations" column on the Fields page) will still reflect the existence
of this operation. As of the end of 2021, the features that this
operation will not contribute to are:
Apollo Server now sends both an "observed execution count" and "estimated
execution count" for each field (for each operation and client). The former is
literally how many times we saw the field get executed (only during operations
where
fieldLevelInstrumentation
returned truthy). If the hook only returnstrue or false, the latter is the same, but you may also return a number (typically
either 0 or at least 1) which represents a weight for that operation; the "estimated
execution count" will be incremented by that number instead of by 1.
So for example, with:
Apollo Server will instrument 1% of operations, and the "estimated execution
count" will be 100 times more than the observed execution count. (You can
imagine more sophisticated implementations of
fieldLevelInstrumentation
whichsample more common operations more aggressively than rare operations.)
If you pass a number for
fieldLevelInstrumentation
, it is equivalent topassing a function of the form in the above example; that is, the previous
example behaves identically to
fieldLevelInstrumentation: 0.01
.The
latency_count
sent with field stats (which powers per-field timing hints)is now always scaled by the given fact (ie, there's no separate "observed
histogram").
Note that the semantics of the
requestContext.metrics.captureTraces
fieldchanges with this PR. Previously its value matched the value returned by the
includeRequest
hook; now it matches the truthiness of the value returned bythe
fieldLevelInstrumentation
hook. This field determines whether ApolloGateway includes the
apollo-federation-include-trace
header with outgoingrequests so this is an appropriate change.
Since this may provide a common use case for running a GraphQL server without
any need for per-field instrumentation, the request pipeline now only
instruments the schema if it ever sees a willResolveField callback. This
almost means that if you're running a monolithic server with
fieldLevelInstrumentation always returning false (or usage reporting disabled),
or if you're running a subgraph whose gateway has
fieldLevelInstrumentation:false (and thus never receives a request with the
apollo-federation-include-trace
header), then execution won't have the smallperformance impact of instrumentation. In practice you need to also disable the
cache control plugin to get this speedup, as it is installed by default and uses
willResolveField to implement dynamic cache control. If this optimization proves
to be important we can provide a mode of the cache control plugin that doesn't
allow for dynamic cache control. (Alternatively we may eventually find a way to
support the instrumentation of GraphQL execution with lower overhead.)
Part of #5708.