Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add profiler for request execution details for write api connection worker #2555

Merged
merged 5 commits into from
Jul 15, 2024

Conversation

GaoleMeng
Copy link
Contributor

Add a profiler to generate periodical report.

In the next PR, we would start to add interested point at which we want to examine the performance.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> ☕️

If you write sample code, please follow the samples format.

@GaoleMeng GaoleMeng requested a review from a team as a code owner July 10, 2024 21:00
@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. labels Jul 10, 2024
@GaoleMeng GaoleMeng requested a review from yirutang July 10, 2024 21:00
@GaoleMeng GaoleMeng force-pushed the main branch 6 times, most recently from b4e467d to ef12805 Compare July 11, 2024 04:43
@GaoleMeng GaoleMeng changed the title Add profiler for request execution details. The usage of the new API feat: add profiler for request execution details for write api connection worker Jul 11, 2024
Copy link
Contributor

@agrawal-siddharth agrawal-siddharth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, once this profiler is available, subsequent commits will then start using it in various places within the codebase. There is definitely overlap with the metrics being generated for OpenTelemetry. Perhaps eventually we could automatically profile those metrics with this tool.


private static final Logger log = Logger.getLogger(RequestProfiler.class.getName());

// Control per how many requests we log one time for a dropped operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear what is meant by "dropped"? Does this mean a request was rejected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we cache too many request I would drop that, added an comment.

* </pre>
*/
public class RequestProfiler {
enum OperationName {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the name of the operation need to be pre-defined or known in advance to the profiler? Is this just to ensure consistency in how it is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah consistency, so that we would deal with random name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ensure there is consistency between the metric names identified here, and those currently used and future ones to be added for OpenTelemetry metrics. The latter are defined in this document: http://go/writeapi-telemetry. See the names used under "List of instruments".

If there is a need for profiling metrics there is likely a need for the same OpenTelemetry metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current overlap is the backend latency, let's change it to NETWORK_RESPONSE_LATENCY

// Control per how many requests we log one time for a dropped operation.
// An operation can be dropped if we are caching too many requests (by default 100000) in memory. If we are at that
// state, any new operation for new requests would be dropped.
private static final int LOG_PER_DROPPED_OPERATION = 50;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Now I understand this constant better. It really just helps with reducing the volume of log messages generated around dropped operations. Maybe it would be more clear if you defined this constant after defining the MAX_CACHED_REQUEST. Perhaps a better name would be DROPPED_OPERATION_LOG_FREQUENCY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, let's just remove this variable and related the logic, the upper cache bound is really large, I would say safe guarding such extreme case does not have too much value

// Json to proto conversion time.
JSON_TO_PROTO_CONVERSION("json_to_proto_conversion"),
// Time spent to fetch the table schema when user didn't provide it.
SCHEMA_FECTCHING("schema_fetching"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here, should be SCHEMA_FETCHING

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@GaoleMeng GaoleMeng force-pushed the main branch 2 times, most recently from 082c376 to a8f8d5a Compare July 11, 2024 23:44
Copy link
Contributor

@agrawal-siddharth agrawal-siddharth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if tracking every single request will result in excessive memory consumption. To deal with this, you may need to:

(a) use histogram buckets to break down individual values into aggregate values, or

(b) dynamically keep track of only the K longest latency requests per type.

* </pre>
*/
public class RequestProfiler {
enum OperationName {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ensure there is consistency between the metric names identified here, and those currently used and future ones to be added for OpenTelemetry metrics. The latter are defined in this document: http://go/writeapi-telemetry. See the names used under "List of instruments".

If there is a need for profiling metrics there is likely a need for the same OpenTelemetry metric.

@GaoleMeng GaoleMeng force-pushed the main branch 3 times, most recently from daac318 to f0cbac8 Compare July 12, 2024 23:00
@GaoleMeng GaoleMeng self-assigned this Jul 12, 2024
Copy link
Contributor Author

@GaoleMeng GaoleMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline that it's not easy to come up with a clean solution that would only keep the top k requests all the time to save memory. Since it's a debugging only class and each flush would release the memory of the past requests, we think it's fine to continue with the current solution

+ "seen before, this is possible when "
+ "we are recording too much ongoing requests. So far we has dropped %s operations.",
requestUniqueId, droppedOperationCount));
droppedOperationCount.incrementAndGet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be incremented here as well? What if the calling code always calls startOperation() and endOperation()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* -----------------------------
* Request uuid: request_1 with total time 1000 milliseconds
* Operation name json_to_proto_conversion starts at: 1720566109971, ends at: 1720566109971, total time: 200 milliseconds
* Operation name backend_latency starts at: 1720566109971, ends at: 1720566109971, total time: 800 milliseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we can call this better, this is not pure backend_latency, it is backend_latency + network_latency?

And one problem is this is that we still cannot join this with full backend performance stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current way we can join is through timestamp, unless this request id can be passed to backend.

Let's call it response_latency to make it clear

* ...
* </pre>
*/
public class RequestProfiler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have package visibility? Maintain a public class is a lot more difficult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it is changed? It is still a public class.

// Time spent to fetch the table schema when user didn't provide it.
SCHEMA_FECTCHING("schema_fetching"),
// Time spent within wait queue before it get picked up.
WAIT_QUEUE("wait_queue"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if additional stats, such as wait queue length can be attached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but that requires really hard interface change, I wanna shrink the scope to time related profiling for now.

@GaoleMeng GaoleMeng merged commit 5691bd5 into googleapis:main Jul 15, 2024
19 checks passed
yifatgortler pushed a commit to yifatgortler/java-bigquerystorage that referenced this pull request Jul 25, 2024
…tion worker (googleapis#2555)

* Add profiler for request execution details. The usage of the new API
will be added in the next PR

* Add profiler for request execution details. The usage of the new API
will be added in the next PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants