Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: break workload info from trial endpoint into a new endpoint [DET-6729] #3635

Merged
merged 1 commit into from
May 4, 2022

Conversation

dzhu
Copy link
Contributor

@dzhu dzhu commented Feb 18, 2022

Description

When a trial has run many workloads, the response to the trial details
endpoint for it can become very large and unwieldy. Since we don't
always need the full set of workloads, we move those into a new endpoint
and just have some useful workload summary information in the original
one.

Test Plan

  • run a no-op experiment, check that the total checkpoint size comes through
  • run a no-op experiment, check the workloads endpoint with various ordering/offset/limit options
  • add an E2E test (or extend an existing one)

Commentary

First two bullet points from the ticket done, third still to do.

The total checkpoint size is represented in the protobuf as uint64, since a trial with large or many checkpoints could easily overflow 32 bits. That ends up being translated into JSON as a string because JavaScript can't natively represent all 64-bit integers. The only alternative appears to be switching the type to float, which would feel weird to me.

@dzhu dzhu requested a review from hkang1 February 18, 2022 21:11
@dzhu dzhu requested a review from ioga as a code owner February 18, 2022 21:11
@cla-bot cla-bot bot added the cla-signed label Feb 18, 2022
@netlify
Copy link

netlify bot commented Feb 18, 2022

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit c5da812
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6272ba9dde8af80008d5187e

@ioga
Copy link
Contributor

ioga commented Feb 24, 2022

cc @stoksc because I have less context on this

@ioga ioga requested a review from stoksc February 24, 2022 19:26
rpc GetTrialWorkloads(GetTrialWorkloadsRequest)
returns (GetTrialWorkloadsResponse) {
option (google.api.http) = {
get: "/api/v1/trial/{trial_id}/workloads"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post discussion in backend eng, should we just make the the /metrics endpoint?

@dzhu dzhu force-pushed the trial-workloads-api branch 2 times, most recently from f2e3ac9 to dd39c93 Compare March 1, 2022 23:28
@hkang1
Copy link
Contributor

hkang1 commented Mar 2, 2022

@dzhu this step probably comes at the end, but should the changes generated by the following get checked in also?

  make -C proto build
  make -C bindings build
  make -C webui/react build-bindings

@@ -84,6 +84,8 @@ message Trial {
// The wall clock time is all active time of the cluster for the trial,
// inclusive of everything (restarts, initiailization, etc), in seconds.
double wall_clock_time = 12;
// The sum of sizes of all resources in all checkpoints for the trial.
uint64 total_checkpoint_size = 13;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I see this defined here as uint64 but the endpoint returns this as a string for some reason:
Screen Shot 2022-03-01 at 9 11 55 PM

@ioga ioga assigned dzhu and unassigned ioga Mar 14, 2022
@codecov-commenter
Copy link

codecov-commenter commented May 3, 2022

Codecov Report

Merging #3635 (ce1fe9b) into master (02114cf) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3635   +/-   ##
=======================================
  Coverage   24.62%   24.62%           
=======================================
  Files         256      256           
  Lines        9982     9982           
  Branches     2818     2818           
=======================================
  Hits         2458     2458           
  Misses       7507     7507           
  Partials       17       17           

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dzhu dzhu force-pushed the trial-workloads-api branch 4 times, most recently from d80a7c1 to ce1fe9b Compare May 4, 2022 17:32
When a trial has run many workloads, the response to the trial details
endpoint for it can become very large and unwieldy. Since we don't
always need the full set of workloads, we move those into a new endpoint
and just have some useful workload summary information in the original
one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants