usage reporting plugin: add fieldLevelInstrumentation option #5963

glasser · 2021-12-22T00:30:47Z

The usage reporting plugin already has an includeRequest option which
allows you to tell the plugin to completely ignore an operation.

One reason you may have wanted to use this is to avoid the overhead of
capturing a full field-by-field execution trace. But in this case,
includeRequest: false is overkill: it removes the operation from aspects
of Studio such as the Operations page and schema checks which don't
require full execution traces to operate.

This PR adds a new option, fieldLevelInstrumentation, which is invoked after
includeRequest returns true. If you return false from this operation, you won't
incur the overhead of capturing a detailed trace (either directly in this
process, or in subgraphs if this process is a Gateway).

Most of Studio's features (including the newly-added "referencing
operations" column on the Fields page) will still reflect the existence
of this operation. As of the end of 2021, the features that this
operation will not contribute to are:

The "field executions" column on the Fields page
The per-field timing hints shown in Explorer and in vscode-graphql
The trace view on the Operations page

Apollo Server now sends both an "observed execution count" and "estimated
execution count" for each field (for each operation and client). The former is
literally how many times we saw the field get executed (only during operations
where fieldLevelInstrumentation returned truthy). If the hook only returns
true or false, the latter is the same, but you may also return a number (typically
either 0 or at least 1) which represents a weight for that operation; the "estimated
execution count" will be incremented by that number instead of by 1.

So for example, with:

fieldLevelInstrumentation: () => Math.random() < 0.01 ? 1/0.01 : false

Apollo Server will instrument 1% of operations, and the "estimated execution
count" will be 100 times more than the observed execution count. (You can
imagine more sophisticated implementations of fieldLevelInstrumentation which
sample more common operations more aggressively than rare operations.)

If you pass a number for fieldLevelInstrumentation, it is equivalent to
passing a function of the form in the above example; that is, the previous
example behaves identically to fieldLevelInstrumentation: 0.01.

The latency_count sent with field stats (which powers per-field timing hints)
is now always scaled by the given fact (ie, there's no separate "observed
histogram").

Note that the semantics of the requestContext.metrics.captureTraces field
changes with this PR. Previously its value matched the value returned by the
includeRequest hook; now it matches the truthiness of the value returned by
the fieldLevelInstrumentation hook. This field determines whether Apollo
Gateway includes the apollo-federation-include-trace header with outgoing
requests so this is an appropriate change.

Since this may provide a common use case for running a GraphQL server without
any need for per-field instrumentation, the request pipeline now only
instruments the schema if it ever sees a willResolveField callback. This
almost means that if you're running a monolithic server with
fieldLevelInstrumentation always returning false (or usage reporting disabled),
or if you're running a subgraph whose gateway has
fieldLevelInstrumentation:false (and thus never receives a request with the
apollo-federation-include-trace header), then execution won't have the small
performance impact of instrumentation. In practice you need to also disable the
cache control plugin to get this speedup, as it is installed by default and uses
willResolveField to implement dynamic cache control. If this optimization proves
to be important we can provide a mode of the cache control plugin that doesn't
allow for dynamic cache control. (Alternatively we may eventually find a way to
support the instrumentation of GraphQL execution with lower overhead.)

Part of #5708.

glasser · 2021-12-22T19:02:00Z

might rename this to fieldLevelInstrumentation to make it clear it's about "are we paying attention to fields" and not "are we building a trace" because that is more confusable with "are we sending a trace"

Might change the number thing to just counting for each operation how many un-instrumented requests there were (see https://github.com/mdg-private/monorepo/pull/7411)

glasser · 2021-12-22T19:44:26Z

OK, I did some of the above but have to go afk

todo:

fix build of stats.test.ts
remove number version of fieldLevelInstrumentation (or at least change how we describe it)
double check how error reporting works with the new stuff

trevor-scheer · 2021-12-22T20:45:25Z

docs/source/api/plugin/usage-reporting.md

+
+This option allows you to choose if Apollo Server should calculate detailed per-field statistics for a particular request. It is only called for executable operations: operations which parse and validate properly and which do not have an unknown operation name. It is not called if an [`includeRequest`](#includerequest) hook is provided and returns false.
+
+Returning false means that Apollo Server will only pay attention to overall properties of the operation, like what GraphQL operation is executing and how long the entire operation takes to execute, and not anything about field-by-field execution.


This reads a bit funny but I don't have a good suggestion yet

docs/source/api/plugin/usage-reporting.md

trevor-scheer · 2021-12-22T20:54:47Z

docs/source/api/plugin/usage-reporting.md

+
+Note that returning true here does *not* mean that the captured trace must be *transmitted* to Apollo Studio's servers in the form of a trace; it may still be aggregated locally to statistics. But a captured trace still will contribute to the "field executions" statistic and timing hints even if it is aggregated locally.
+
+Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator. (If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.)


Is this an aside?

Suggested change

Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator. (If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.)

Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator.

> If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.

Going to change the actual semantics here.

trevor-scheer · 2021-12-22T20:56:04Z

docs/source/api/plugin/usage-reporting.md

 This function is called for each received request. It takes a [`GraphQLRequestContext`](https://github.com/apollographql/apollo-server/blob/main/packages/apollo-server-types/src/index.ts#L115-L150) object and must return a `Promise<Boolean>` that indicates whether to include the request. It's called either after the operation is successfully resolved (via [the `didResolveOperation` event](https://www.apollographql.com/docs/apollo-server/integrations/plugins/#didresolveoperation)), or when sending the final error response if the operation was not successfully resolved (via [the `willSendResponse` event](https://www.apollographql.com/docs/apollo-server/integrations/plugins/#willsendresponse)).

+(If you don't want any usage reporting at all, don't use this option: instead, either avoid specifying an Apollo API key or explicitly [disable the plugin](#disabling-the-plugin).)


Not sure this needs parens at all

trevor-scheer · 2021-12-22T20:57:12Z

package.json

@@ -116,6 +117,7 @@
    "koa": "2.13.4",
    "koa-router": "10.1.1",
    "lerna": "4.0.0",
+    "lodash.sumby": "^4.6.0",


The ^ here will result in a renovate PR to remove it

trevor-scheer · 2021-12-22T21:11:06Z

packages/apollo-server-core/src/plugin/usageReporting/__tests__/plugin.test.ts

  }: {
    pluginOptions?: ApolloServerPluginUsageReportingOptions<any>;
    expectReport?: boolean;
    query?: string;
    operationName?: string | null;
+    // null means no expectation because it's not deterministic
+    schemaShouldBeInstrumented?: boolean | null;


Can this ever be undefined? Looks like it's defaulted above

Suggested change

schemaShouldBeInstrumented?: boolean | null;

schemaShouldBeInstrumented: boolean | null;

The ? here is more like "key can be left out" than "can also be undefined" (though I think in TS these days it kinda means both and they are trying to migrate away from that). Applying your change means that the calls that don't specify schemaShouldBeInstrumented fail to build. TS properly determines that schemaShouldBeInstrumented is boolean | null (not undefined) inside the function so I think I'm doing it correctly.

trevor-scheer · 2021-12-22T21:19:36Z

packages/apollo-server-core/src/plugin/usageReporting/__tests__/plugin.test.ts

+      const expected = fraction * total;
+      // If it strays from the expected amount of 35 by too far we fail.
+      expect(actualFieldLevelInstrumentation).toBeGreaterThanOrEqual(
+        0.6 * expected,
+      );
+      expect(actualFieldLevelInstrumentation).toBeLessThanOrEqual(
+        1.4 * expected,
+      );


Smells like a flaky test, can't we mock Math.random() to get deterministic results in our tests?

Interesting point! https://stackoverflow.com/questions/41570273/how-to-test-a-function-that-output-is-random-using-jest or https://www.npmjs.com/package/jest-mock-random might do the trick here.

Might also remove the test — going to be changing how this is implemented anyway.

Mocked Math.random.

The usage reporting plugin already has an `includeRequest` option which allows you to tell the plugin to completely ignore an operation. One reason you may have wanted to use this is to avoid the overhead of capturing a full field-by-field execution trace. But in this case, `includeRequest: false` is overkill: it removes the operation from aspects of Studio such as the Operations page and schema checks which don't require full execution traces to operate. This PR adds a new option, `fieldLevelInstrumentation`, which is invoked after includeRequest returns true. If you return false from this operation, you won't incur the overhead of capturing a detailed trace (either directly in this process, or in subgraphs if this process is a Gateway). Most of Studio's features (including the newly-added "referencing operations" column on the Fields page) will still reflect the existence of this operation. As of the end of 2021, the features that this operation will not contribute to are: - The "field executions" column on the Fields page - The per-field timing hints shown in Explorer and in vscode-graphql - The trace view on the Operations page Apollo Server now sends both an "observed execution count" and "estimated execution count" for each field (for each operation and client). The former is literally how many times we saw the field get executed (only during operations where `fieldLevelInstrumentation` returned truthy). If the hook only returns true or false, the latter is the same, but you may also return a number between 0 and 1 which represents the sample rate for that operation; the "estimated execution count" will be incremented by the reciprocal of that number instead of by 1. So for example, with: fieldLevelInstrumentation: () => Math.random() < 0.01 ? 0.01 : false Apollo Server will instrument 1% of operations, and the "estimated execution count" will be 100 times more than the observed execution count. (You can imagine more sophisticated implementations of `fieldLevelInstrumentation` which sample more common operations more aggressively than rare operations.) If you pass a number for `fieldLevelInstrumentation`, it is equivalent to passing a function of the form in the above example; that is, the previous example behaves identically to `fieldLevelInstrumentation: 0.01`. The `latency_count` sent with field stats (which powers per-field timing hints) is now always scaled by the given fact (ie, there's no separate "observed histogram"). Note that the semantics of the `requestContext.metrics.captureTraces` field changes with this PR. Previously its value matched the value returned by the `includeRequest` hook; now it matches the truthiness of the value returned by the `fieldLevelInstrumentation` hook. This field determines whether Apollo Gateway includes the `apollo-federation-include-trace` header with outgoing requests so this is an appropriate change. Since this may provide a common use case for running a GraphQL server without any need for per-field instrumentation, the request pipeline now only instruments the schema if it ever sees a willResolveField callback. This *almost* means that if you're running a monolithic server with fieldLevelInstrumentation always returning false (or usage reporting disabled), or if you're running a subgraph whose gateway has fieldLevelInstrumentation:false (and thus never receives a request with the `apollo-federation-include-trace` header), then execution won't have the small performance impact of instrumentation. In practice you need to also disable the cache control plugin to get this speedup, as it is installed by default and uses willResolveField to implement dynamic cache control. If this optimization proves to be important we can provide a mode of the cache control plugin that doesn't allow for dynamic cache control. (Alternatively we may eventually find a way to support the instrumentation of GraphQL execution with lower overhead.) Part of #5708.

glasser · 2021-12-23T23:34:10Z

OK, this is ready for another round of review!

glasser · 2021-12-23T23:36:06Z

(Note that the PR description has been updated.)

pcarrier · 2021-12-25T04:49:04Z

docs/source/api/plugin/usage-reporting.md

+
+Returning false (or 0) here for some or all operations can improve your server's performance, as the overhead of calculating complete traces is not always negligible. This is especially the case if this server is an Apollo Gateway, as captured traces are transmitted from the subgraph to the Gateway in-band inside the actual GraphQL response.
+
+Returning a positive number means that Apollo Server will track each field execution and send Apollo Studio statistics on how many times each field was executed and what the per-field performance was. If the number returned is less than 1, Apollo Server will also send a scaled "estimate" count for each field, equal to the number of observed field executions divided by the number returned by the hook.


Replaced with weight?

pcarrier · 2021-12-25T04:49:13Z

docs/source/api/plugin/usage-reporting.md

@@ -105,6 +105,38 @@ The only properties of the reported error you can modify are its `message` and i
 <tr>
 <td>

+###### `fieldLevelInstrumentation`
+
+`async Function` or `number`


or false, no?

Hmm, I don't think so.

A number is a short-hand for () => Math.random() <= n ? 1/n : 0. false would just be shorthand for () => false. I think writing the latter yourself is easy enough.

…ight

glasser · 2021-12-28T02:46:26Z

I've updated this so that the hook returns a weight rather than a probability (ie, the reciprocal of what it was previously returning) and changed a name in the protocol from scale factor to weight.

I am going to merge this to release-3.6.0 so I can release another alpha from the branch, but I'd still appreciate real review from @trevor-scheer or @pcarrier before this is fully released, if either of you is available this week.

trevor-scheer

LGTM (post-merge)

In #5963 we added `!!metrics.captureTraces` to the condition for considering sending an operation as a trace to Studio. This makes sense for operations which start to execute (don't send something as a trace if we didn't record a trace for it!) but if the reason `captureTraces` is unset is because we never successfully resolved an operation, it still can be interesting to send a trace, for the sake of the Errors page. So we allow sending traces for unexecutable operations (though it still goes through our standard sampling algorithm). In order to make sure the recently added Trace.fieldExecutionWeight is set to the default value of 1 in this case, we set it when initially creating the Trace rather than in didResolveOperation. Fixes #6193.

) In #5963 we added `!!metrics.captureTraces` to the condition for considering sending an operation as a trace to Studio. This makes sense for operations which start to execute (don't send something as a trace if we didn't record a trace for it!) but if the reason `captureTraces` is unset is because we never successfully resolved an operation, it still can be interesting to send a trace, for the sake of the Errors page. So we allow sending traces for unexecutable operations (though it still goes through our standard sampling algorithm). In order to make sure the recently added Trace.fieldExecutionWeight is set to the default value of 1 in this case, we set it when initially creating the Trace rather than in didResolveOperation. Fixes #6193.

Apollo Server 2.2.1 introduced an experimental feature named __resolveObject. It was like an extra-powerful version of __resolveReference (which did not yet exist) which also allowed you to return references within a single subgraph and they'd be magically converted into full objects. It was never documented and was described multiple times as experimental. Its implementation is entirely inside the Apollo Server schema instrumentation, which is largely designed for enabling willResolveField plugins (eg, usage reporting). In fact, as the developer of the main feature in AS 3.6.0 (apollographql/apollo-server#5963) I added an optimization to not call schema instrumentation at all if there are no willResolveField plugin hooks... which means that this feature is broken in that case! We are likely to remove this feature in AS4. To get ready for that, stop using it in tests here. This involves changing the existing ones to `__resolveReference` and also adding some direct "object resolution" to resolvers that return objects within the "owning" service. This is an improvement anyway, so that our fixtures use the documented `__resolveReference` feature instead of a different obscure feature.

glasser requested a review from trevor-scheer December 22, 2021 00:30

glasser changed the title ~~usage reporting plugin: add captureTraces option~~ usage reporting plugin: add fieldLevelInstrumentation option Dec 22, 2021

trevor-scheer reviewed Dec 22, 2021

View reviewed changes

glasser force-pushed the glasser/capture-traces branch 2 times, most recently from e466e8d to 762a029 Compare December 23, 2021 23:28

glasser force-pushed the glasser/capture-traces branch 2 times, most recently from 542b468 to fa29840 Compare December 23, 2021 23:33

pcarrier reviewed Dec 25, 2021

View reviewed changes

glasser added 2 commits December 27, 2021 12:41

update protobuf again

2cd135c

Rename scale factor and change fieldLevelInstrumentation to return we…

c38df25

…ight

glasser merged commit 294a8ed into release-3.6.0 Dec 28, 2021

glasser deleted the glasser/capture-traces branch December 28, 2021 02:51

trevor-scheer reviewed Jan 3, 2022

View reviewed changes

glasser mentioned this pull request Mar 10, 2022

Usage reporting: operations that can't execute still can make interesting traces #6193

Closed

glasser mentioned this pull request Mar 10, 2022

usage reporting: allow sending traces for unexecutable operations #6194

Merged

glasser mentioned this pull request Mar 29, 2022

tests: stop using undocumented __resolveObject AS feature apollographql/federation#1658

Merged

github-actions bot locked as resolved and limited conversation to collaborators Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage reporting plugin: add fieldLevelInstrumentation option #5963

usage reporting plugin: add fieldLevelInstrumentation option #5963

glasser commented Dec 22, 2021 •

edited

Loading

glasser commented Dec 22, 2021

glasser commented Dec 22, 2021

trevor-scheer Dec 22, 2021

trevor-scheer Dec 22, 2021

glasser Dec 23, 2021

trevor-scheer Dec 22, 2021

trevor-scheer Dec 22, 2021

trevor-scheer Dec 22, 2021

glasser Dec 23, 2021

trevor-scheer Dec 22, 2021

glasser Dec 23, 2021

glasser Dec 23, 2021

glasser commented Dec 23, 2021

glasser commented Dec 23, 2021

pcarrier Dec 25, 2021

pcarrier Dec 25, 2021 •

edited

Loading

glasser Dec 28, 2021

glasser commented Dec 28, 2021

trevor-scheer left a comment


		This option allows you to choose if Apollo Server should calculate detailed per-field statistics for a particular request. It is only called for executable operations: operations which parse and validate properly and which do not have an unknown operation name. It is not called if an [`includeRequest`](#includerequest) hook is provided and returns false.

		Returning false means that Apollo Server will only pay attention to overall properties of the operation, like what GraphQL operation is executing and how long the entire operation takes to execute, and not anything about field-by-field execution.


		Note that returning true here does not mean that the captured trace must be transmitted to Apollo Studio's servers in the form of a trace; it may still be aggregated locally to statistics. But a captured trace still will contribute to the "field executions" statistic and timing hints even if it is aggregated locally.

		Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator. (If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.)

	Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator. (If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.)
	Passing a number between 0 and 1 is equivalent to passing a function that returns true that fraction of the time; for example, passing 0.01 will capture traces for approximately one out of every hundred operations based on a random number generator.
	> If you were planning to write a function that does exactly the equivalent of passing a number here, we recommend passing a number; we may later improve this feature so that Studio can optionally show an estimated total of field executions based on using this number as a scaling factor.

		This function is called for each received request. It takes a [`GraphQLRequestContext`](https://github.com/apollographql/apollo-server/blob/main/packages/apollo-server-types/src/index.ts#L115-L150) object and must return a `Promise<Boolean>` that indicates whether to include the request. It's called either after the operation is successfully resolved (via [the `didResolveOperation` event](https://www.apollographql.com/docs/apollo-server/integrations/plugins/#didresolveoperation)), or when sending the final error response if the operation was not successfully resolved (via [the `willSendResponse` event](https://www.apollographql.com/docs/apollo-server/integrations/plugins/#willsendresponse)).

		(If you don't want any usage reporting at all, don't use this option: instead, either avoid specifying an Apollo API key or explicitly [disable the plugin](#disabling-the-plugin).)

	schemaShouldBeInstrumented?: boolean \| null;
	schemaShouldBeInstrumented: boolean \| null;


		Returning false (or 0) here for some or all operations can improve your server's performance, as the overhead of calculating complete traces is not always negligible. This is especially the case if this server is an Apollo Gateway, as captured traces are transmitted from the subgraph to the Gateway in-band inside the actual GraphQL response.

		Returning a positive number means that Apollo Server will track each field execution and send Apollo Studio statistics on how many times each field was executed and what the per-field performance was. If the number returned is less than 1, Apollo Server will also send a scaled "estimate" count for each field, equal to the number of observed field executions divided by the number returned by the hook.

usage reporting plugin: add fieldLevelInstrumentation option #5963

usage reporting plugin: add fieldLevelInstrumentation option #5963

Conversation

glasser commented Dec 22, 2021 • edited Loading

glasser commented Dec 22, 2021

glasser commented Dec 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glasser commented Dec 23, 2021

glasser commented Dec 23, 2021

Choose a reason for hiding this comment

pcarrier Dec 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glasser commented Dec 28, 2021

trevor-scheer left a comment

Choose a reason for hiding this comment

glasser commented Dec 22, 2021 •

edited

Loading

pcarrier Dec 25, 2021 •

edited

Loading