Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usage reporting plugin: add fieldLevelInstrumentation option #5963

Merged
merged 3 commits into from
Dec 28, 2021

Conversation

glasser
Copy link
Member

@glasser glasser commented Dec 22, 2021

The usage reporting plugin already has an includeRequest option which
allows you to tell the plugin to completely ignore an operation.

One reason you may have wanted to use this is to avoid the overhead of
capturing a full field-by-field execution trace. But in this case,
includeRequest: false is overkill: it removes the operation from aspects
of Studio such as the Operations page and schema checks which don't
require full execution traces to operate.

This PR adds a new option, fieldLevelInstrumentation, which is invoked after
includeRequest returns true. If you return false from this operation, you won't
incur the overhead of capturing a detailed trace (either directly in this
process, or in subgraphs if this process is a Gateway).

Most of Studio's features (including the newly-added "referencing
operations" column on the Fields page) will still reflect the existence
of this operation. As of the end of 2021, the features that this
operation will not contribute to are:

  • The "field executions" column on the Fields page
  • The per-field timing hints shown in Explorer and in vscode-graphql
  • The trace view on the Operations page

Apollo Server now sends both an "observed execution count" and "estimated
execution count" for each field (for each operation and client). The former is
literally how many times we saw the field get executed (only during operations
where fieldLevelInstrumentation returned truthy). If the hook only returns
true or false, the latter is the same, but you may also return a number (typically
either 0 or at least 1) which represents a weight for that operation; the "estimated
execution count" will be incremented by that number instead of by 1.

So for example, with:

fieldLevelInstrumentation: () => Math.random() < 0.01 ? 1/0.01 : false

Apollo Server will instrument 1% of operations, and the "estimated execution
count" will be 100 times more than the observed execution count. (You can
imagine more sophisticated implementations of fieldLevelInstrumentation which
sample more common operations more aggressively than rare operations.)

If you pass a number for fieldLevelInstrumentation, it is equivalent to
passing a function of the form in the above example; that is, the previous
example behaves identically to fieldLevelInstrumentation: 0.01.

The latency_count sent with field stats (which powers per-field timing hints)
is now always scaled by the given fact (ie, there's no separate "observed
histogram").

Note that the semantics of the requestContext.metrics.captureTraces field
changes with this PR. Previously its value matched the value returned by the
includeRequest hook; now it matches the truthiness of the value returned by
the fieldLevelInstrumentation hook. This field determines whether Apollo
Gateway includes the apollo-federation-include-trace header with outgoing
requests so this is an appropriate change.

Since this may provide a common use case for running a GraphQL server without
any need for per-field instrumentation, the request pipeline now only
instruments the schema if it ever sees a willResolveField callback. This
almost means that if you're running a monolithic server with
fieldLevelInstrumentation always returning false (or usage reporting disabled),
or if you're running a subgraph whose gateway has
fieldLevelInstrumentation:false (and thus never receives a request with the
apollo-federation-include-trace header), then execution won't have the small
performance impact of instrumentation. In practice you need to also disable the
cache control plugin to get this speedup, as it is installed by default and uses
willResolveField to implement dynamic cache control. If this optimization proves
to be important we can provide a mode of the cache control plugin that doesn't
allow for dynamic cache control. (Alternatively we may eventually find a way to
support the instrumentation of GraphQL execution with lower overhead.)

Part of #5708.

@glasser
Copy link
Member Author

glasser commented Dec 22, 2021

might rename this to fieldLevelInstrumentation to make it clear it's about "are we paying attention to fields" and not "are we building a trace" because that is more confusable with "are we sending a trace"

Might change the number thing to just counting for each operation how many un-instrumented requests there were (see https://github.com/mdg-private/monorepo/pull/7411)

@glasser glasser changed the title usage reporting plugin: add captureTraces option usage reporting plugin: add fieldLevelInstrumentation option Dec 22, 2021
@glasser
Copy link
Member Author

glasser commented Dec 22, 2021

OK, I did some of the above but have to go afk

todo:

  • fix build of stats.test.ts
  • remove number version of fieldLevelInstrumentation (or at least change how we describe it)
  • double check how error reporting works with the new stuff


This option allows you to choose if Apollo Server should calculate detailed per-field statistics for a particular request. It is only called for executable operations: operations which parse and validate properly and which do not have an unknown operation name. It is not called if an [`includeRequest`](#includerequest) hook is provided and returns false.

Returning false means that Apollo Server will only pay attention to overall properties of the operation, like what GraphQL operation is executing and how long the entire operation takes to execute, and not anything about field-by-field execution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads a bit funny but I don't have a good suggestion yet

docs/source/api/plugin/usage-reporting.md Outdated Show resolved Hide resolved

Note that returning true here does *not* mean that the captured trace must be *transmitted* to Apollo Studio's servers in the form of a trace; it may still be aggregated locally to statistics. But a captured trace still will contribute to the "field executions" statistic and timing hints even if it is aggregated locally.

Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator. (If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an aside?

Suggested change
Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator. (If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.)
Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator.
> If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to change the actual semantics here.

This function is called for each received request. It takes a [`GraphQLRequestContext`](https://github.com/apollographql/apollo-server/blob/main/packages/apollo-server-types/src/index.ts#L115-L150) object and must return a `Promise<Boolean>` that indicates whether to include the request. It's called either after the operation is successfully resolved (via [the `didResolveOperation` event](https://www.apollographql.com/docs/apollo-server/integrations/plugins/#didresolveoperation)), or when sending the final error response if the operation was not successfully resolved (via [the `willSendResponse` event](https://www.apollographql.com/docs/apollo-server/integrations/plugins/#willsendresponse)).

(If you don't want any usage reporting at all, don't use this option: instead, either avoid specifying an Apollo API key or explicitly [disable the plugin](#disabling-the-plugin).)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this needs parens at all

package.json Outdated
@@ -116,6 +117,7 @@
"koa": "2.13.4",
"koa-router": "10.1.1",
"lerna": "4.0.0",
"lodash.sumby": "^4.6.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ^ here will result in a renovate PR to remove it

}: {
pluginOptions?: ApolloServerPluginUsageReportingOptions<any>;
expectReport?: boolean;
query?: string;
operationName?: string | null;
// null means no expectation because it's not deterministic
schemaShouldBeInstrumented?: boolean | null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever be undefined? Looks like it's defaulted above

Suggested change
schemaShouldBeInstrumented?: boolean | null;
schemaShouldBeInstrumented: boolean | null;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ? here is more like "key can be left out" than "can also be undefined" (though I think in TS these days it kinda means both and they are trying to migrate away from that). Applying your change means that the calls that don't specify schemaShouldBeInstrumented fail to build. TS properly determines that schemaShouldBeInstrumented is boolean | null (not undefined) inside the function so I think I'm doing it correctly.

Comment on lines 361 to 368
const expected = fraction * total;
// If it strays from the expected amount of 35 by too far we fail.
expect(actualFieldLevelInstrumentation).toBeGreaterThanOrEqual(
0.6 * expected,
);
expect(actualFieldLevelInstrumentation).toBeLessThanOrEqual(
1.4 * expected,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smells like a flaky test, can't we mock Math.random() to get deterministic results in our tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point! https://stackoverflow.com/questions/41570273/how-to-test-a-function-that-output-is-random-using-jest or https://www.npmjs.com/package/jest-mock-random might do the trick here.

Might also remove the test — going to be changing how this is implemented anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mocked Math.random.

@glasser glasser force-pushed the glasser/capture-traces branch 2 times, most recently from e466e8d to 762a029 Compare December 23, 2021 23:28
The usage reporting plugin already has an `includeRequest` option which
allows you to tell the plugin to completely ignore an operation.

One reason you may have wanted to use this is to avoid the overhead of
capturing a full field-by-field execution trace. But in this case,
`includeRequest: false` is overkill: it removes the operation from aspects
of Studio such as the Operations page and schema checks which don't
require full execution traces to operate.

This PR adds a new option, `fieldLevelInstrumentation`, which is invoked after
includeRequest returns true.  If you return false from this operation, you won't
incur the overhead of capturing a detailed trace (either directly in this
process, or in subgraphs if this process is a Gateway).

Most of Studio's features (including the newly-added "referencing
operations" column on the Fields page) will still reflect the existence
of this operation. As of the end of 2021, the features that this
operation will not contribute to are:

- The "field executions" column on the Fields page
- The per-field timing hints shown in Explorer and in vscode-graphql
- The trace view on the Operations page

Apollo Server now sends both an "observed execution count" and "estimated
execution count" for each field (for each operation and client). The former is
literally how many times we saw the field get executed (only during operations
where `fieldLevelInstrumentation` returned truthy). If the hook only returns
true or false, the latter is the same, but you may also return a number between
0 and 1 which represents the sample rate for that operation; the "estimated
execution count" will be incremented by the reciprocal of that number instead of
by 1.

So for example, with:

    fieldLevelInstrumentation: () => Math.random() < 0.01 ? 0.01 : false

Apollo Server will instrument 1% of operations, and the "estimated execution
count" will be 100 times more than the observed execution count. (You can
imagine more sophisticated implementations of `fieldLevelInstrumentation` which
sample more common operations more aggressively than rare operations.)

If you pass a number for `fieldLevelInstrumentation`, it is equivalent to
passing a function of the form in the above example; that is, the previous
example behaves identically to `fieldLevelInstrumentation: 0.01`.

The `latency_count` sent with field stats (which powers per-field timing hints)
is now always scaled by the given fact (ie, there's no separate "observed
histogram").

Note that the semantics of the `requestContext.metrics.captureTraces` field
changes with this PR. Previously its value matched the value returned by the
`includeRequest` hook; now it matches the truthiness of the value returned by
the `fieldLevelInstrumentation` hook. This field determines whether Apollo
Gateway includes the `apollo-federation-include-trace` header with outgoing
requests so this is an appropriate change.

Since this may provide a common use case for running a GraphQL server without
any need for per-field instrumentation, the request pipeline now only
instruments the schema if it ever sees a willResolveField callback. This
*almost* means that if you're running a monolithic server with
fieldLevelInstrumentation always returning false (or usage reporting disabled),
or if you're running a subgraph whose gateway has
fieldLevelInstrumentation:false (and thus never receives a request with the
`apollo-federation-include-trace` header), then execution won't have the small
performance impact of instrumentation. In practice you need to also disable the
cache control plugin to get this speedup, as it is installed by default and uses
willResolveField to implement dynamic cache control. If this optimization proves
to be important we can provide a mode of the cache control plugin that doesn't
allow for dynamic cache control. (Alternatively we may eventually find a way to
support the instrumentation of GraphQL execution with lower overhead.)

Part of #5708.
@glasser glasser force-pushed the glasser/capture-traces branch 2 times, most recently from 542b468 to fa29840 Compare December 23, 2021 23:33
@glasser
Copy link
Member Author

glasser commented Dec 23, 2021

OK, this is ready for another round of review!

@glasser
Copy link
Member Author

glasser commented Dec 23, 2021

(Note that the PR description has been updated.)


Returning false (or 0) here for some or all operations can improve your server's performance, as the overhead of calculating complete traces is not always negligible. This is especially the case if this server is an Apollo Gateway, as captured traces are transmitted from the subgraph to the Gateway in-band inside the actual GraphQL response.

Returning a positive number means that Apollo Server will track each field execution and send Apollo Studio statistics on how many times each field was executed and what the per-field performance was. If the number returned is less than 1, Apollo Server will also send a scaled "estimate" count for each field, equal to the number of observed field executions divided by the number returned by the hook.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with weight?

@@ -105,6 +105,38 @@ The only properties of the reported error you can modify are its `message` and i
<tr>
<td>

###### `fieldLevelInstrumentation`

`async Function` or `number`
Copy link
Contributor

@pcarrier pcarrier Dec 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or false, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think so.

A number is a short-hand for () => Math.random() <= n ? 1/n : 0. false would just be shorthand for () => false. I think writing the latter yourself is easy enough.

@glasser
Copy link
Member Author

glasser commented Dec 28, 2021

I've updated this so that the hook returns a weight rather than a probability (ie, the reciprocal of what it was previously returning) and changed a name in the protocol from scale factor to weight.

I am going to merge this to release-3.6.0 so I can release another alpha from the branch, but I'd still appreciate real review from @trevor-scheer or @pcarrier before this is fully released, if either of you is available this week.

@glasser glasser merged commit 294a8ed into release-3.6.0 Dec 28, 2021
@glasser glasser deleted the glasser/capture-traces branch December 28, 2021 02:51
Copy link
Member

@trevor-scheer trevor-scheer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (post-merge)

glasser added a commit that referenced this pull request Mar 10, 2022
In #5963 we added `!!metrics.captureTraces` to the condition for
considering sending an operation as a trace to Studio. This makes sense
for operations which start to execute (don't send something as a trace
if we didn't record a trace for it!) but if the reason `captureTraces`
is unset is because we never successfully resolved an operation, it
still can be interesting to send a trace, for the sake of the Errors
page.  So we allow sending traces for unexecutable operations (though it
still goes through our standard sampling algorithm).

In order to make sure the recently added Trace.fieldExecutionWeight is
set to the default value of 1 in this case, we set it when initially
creating the Trace rather than in didResolveOperation.

Fixes #6193.
glasser added a commit that referenced this pull request Mar 10, 2022
In #5963 we added `!!metrics.captureTraces` to the condition for
considering sending an operation as a trace to Studio. This makes sense
for operations which start to execute (don't send something as a trace
if we didn't record a trace for it!) but if the reason `captureTraces`
is unset is because we never successfully resolved an operation, it
still can be interesting to send a trace, for the sake of the Errors
page.  So we allow sending traces for unexecutable operations (though it
still goes through our standard sampling algorithm).

In order to make sure the recently added Trace.fieldExecutionWeight is
set to the default value of 1 in this case, we set it when initially
creating the Trace rather than in didResolveOperation.

Fixes #6193.
glasser added a commit that referenced this pull request Mar 10, 2022
)

In #5963 we added `!!metrics.captureTraces` to the condition for
considering sending an operation as a trace to Studio. This makes sense
for operations which start to execute (don't send something as a trace
if we didn't record a trace for it!) but if the reason `captureTraces`
is unset is because we never successfully resolved an operation, it
still can be interesting to send a trace, for the sake of the Errors
page.  So we allow sending traces for unexecutable operations (though it
still goes through our standard sampling algorithm).

In order to make sure the recently added Trace.fieldExecutionWeight is
set to the default value of 1 in this case, we set it when initially
creating the Trace rather than in didResolveOperation.

Fixes #6193.
glasser added a commit that referenced this pull request Mar 11, 2022
)

In #5963 we added `!!metrics.captureTraces` to the condition for
considering sending an operation as a trace to Studio. This makes sense
for operations which start to execute (don't send something as a trace
if we didn't record a trace for it!) but if the reason `captureTraces`
is unset is because we never successfully resolved an operation, it
still can be interesting to send a trace, for the sake of the Errors
page.  So we allow sending traces for unexecutable operations (though it
still goes through our standard sampling algorithm).

In order to make sure the recently added Trace.fieldExecutionWeight is
set to the default value of 1 in this case, we set it when initially
creating the Trace rather than in didResolveOperation.

Fixes #6193.
glasser added a commit to apollographql/federation that referenced this pull request Mar 29, 2022
Apollo Server 2.2.1 introduced an experimental feature named
__resolveObject. It was like an extra-powerful version of
__resolveReference (which did not yet exist) which also allowed you to
return references within a single subgraph and they'd be magically
converted into full objects. It was never documented and was described
multiple times as experimental.

Its implementation is entirely inside the Apollo Server schema
instrumentation, which is largely designed for enabling willResolveField
plugins (eg, usage reporting). In fact, as the developer of the main
feature in AS 3.6.0
(apollographql/apollo-server#5963) I added an
optimization to not call schema instrumentation at all if there are no
willResolveField plugin hooks... which means that this feature is broken
in that case!

We are likely to remove this feature in AS4. To get ready for that, stop
using it in tests here. This involves changing the existing ones to
`__resolveReference` and also adding some direct "object resolution" to
resolvers that return objects within the "owning" service.

This is an improvement anyway, so that our fixtures use the documented
`__resolveReference` feature instead of a different obscure feature.
glasser added a commit to apollographql/federation that referenced this pull request Mar 30, 2022
Apollo Server 2.2.1 introduced an experimental feature named
__resolveObject. It was like an extra-powerful version of
__resolveReference (which did not yet exist) which also allowed you to
return references within a single subgraph and they'd be magically
converted into full objects. It was never documented and was described
multiple times as experimental.

Its implementation is entirely inside the Apollo Server schema
instrumentation, which is largely designed for enabling willResolveField
plugins (eg, usage reporting). In fact, as the developer of the main
feature in AS 3.6.0
(apollographql/apollo-server#5963) I added an
optimization to not call schema instrumentation at all if there are no
willResolveField plugin hooks... which means that this feature is broken
in that case!

We are likely to remove this feature in AS4. To get ready for that, stop
using it in tests here. This involves changing the existing ones to
`__resolveReference` and also adding some direct "object resolution" to
resolvers that return objects within the "owning" service.

This is an improvement anyway, so that our fixtures use the documented
`__resolveReference` feature instead of a different obscure feature.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants