Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Connect trial UI to workloads API; pass sort/filter to API #4407

Merged
merged 15 commits into from
Aug 2, 2022
Merged

feat: Connect trial UI to workloads API; pass sort/filter to API #4407

merged 15 commits into from
Aug 2, 2022

Conversation

mapmeld
Copy link
Contributor

@mapmeld mapmeld commented Jun 22, 2022

Description

Connects the "Workloads" table on the single-trial page to the Workloads API. This code is part of a plan to remove the response of all workloads and metrics from the Trial API (#4300). We're splitting that PR up for easier review.

We already have a paginated workloads API (#3635). For the sortable table to connect to the API, this PR adds a sortKey field (default is batch #, this adds sorting by a metric) and filter field (a dropdown near the table selects All / Checkpoints / Validation / Checkpoints or Validation).

This commit includes the patch from #4406. Done.

Remaining issue with same-name metrics:

  • coalesce(t.metrics, v.metrics) is odd, can we confirm that no one workload includes both training and validation metrics?
  • When we sort by a metric ("loss"), we sort after the coalesce(t.metrics, v.metrics) line distinguishing training and validation metrics.

Exploring the issue:

Screen Shot 2022-06-22 at 8 12 34 AM

In this test, training-loss and validation-loss both show a sorted-column icon. The training and validation losses are displayed on the correct columns. The training workload 100 and validation workload 100 were returned as separate workloads by the API, but got combined into one table row by workloadsToSteps, and sorted as the highest value for that metric within that page.

latest-master also combines the 100 workloads, and sorts the row as the highest value of all workloads (i.e. DESC shows it first, ASC shows it last)

Screen Shot 2022-06-22 at 8 49 07 AM

Checklist

  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@mapmeld mapmeld requested a review from hkang1 June 22, 2022 13:53
@mapmeld mapmeld requested a review from ioga as a code owner June 22, 2022 13:53
@cla-bot cla-bot bot added the cla-signed label Jun 22, 2022
@netlify
Copy link

netlify bot commented Jun 22, 2022

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit c2e4879
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/62e95d232831fd0008419749
😎 Deploy Preview https://deploy-preview-4407--determined-ui.netlify.app/
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@mapmeld
Copy link
Contributor Author

mapmeld commented Jun 22, 2022

update: I think I have a good resolution on table rows and sorting when there are both training-loss and validation-loss. There would be an issue if typically we have workloads with the same batch number and we do want them merged, but I don't think that's the case?

@mapmeld mapmeld requested a review from trentwatt June 28, 2022 16:14
@trentwatt trentwatt self-assigned this Jun 28, 2022
Copy link
Contributor

@trentwatt trentwatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, main caveat is just that sorting by loss ASC results in displaying null first. may have been the case previously to but might as well address now.

@@ -396,7 +396,7 @@ FROM trials t
WHERE t.id=$2
AND s.state = 'COMPLETED'
AND total_batches >= $3
AND total_batches <= $4
AND ($4 <= 0 OR total_batches <= $4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: don't see a need for it, but it seems like just for reasons of consistency, end_batches = 0 should be a valid filter. is there a reason we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For infinite batches we could use -1, does that seem like a better option?

Copy link
Contributor Author

@mapmeld mapmeld Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now noticed these:

if endBatches <= 0 {
      endBatches = math.MaxInt32
}

So I don't need to change this query at all / in the first place. Leaving this file unchanged.

@@ -423,7 +423,7 @@ JOIN validations v ON t.id = v.trial_id
WHERE t.id=$2
AND v.state = 'COMPLETED'
AND v.total_batches >= $3
AND v.total_batches <= $4
AND ($4 <= 0 OR v.total_batches <= $4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question

master/internal/api_trials.go Outdated Show resolved Hide resolved
master/internal/api_trials.go Outdated Show resolved Hide resolved
proto/src/determined/api/v1/trial.proto Outdated Show resolved Hide resolved
rowClassName={defaultRowClassName({ clickable: false })}
rowKey="batchNum"
rowKey="key"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😃

@@ -58,7 +64,7 @@ page_info AS (
)
SELECT (
SELECT jsonb_agg(w) FROM (SELECT validation, training, checkpoint FROM workloads
ORDER BY total_batches %s, end_time %s
ORDER BY %s %s NULLS LAST, total_batches %s, end_time %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think maybe there could be an extra NULLS LAST in here somewhere? sorting by loss asc results in nulls first

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have access to the same experiment, but as validation_loss would be desc here, I'm thinking that you're sorting by another column? And that column variable exists on both training and validation? I was thinking that it was going OK to sort these together, but I forgot how likely it is that only one of the columns would be displayed in the UI. So I probably want to revisit this and make it possible to treat these differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on 2nd thought, more likely it is doing the correct sort in the API but sorting nulls weirdly in the UI... I have an example locally where I will work on that issue too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ got the UI table's nulls last working

Copy link
Contributor

@hkang1 hkang1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick peak, looks like @trentwatt got a lot of coverage already and it's looking good! Added some minor things here and there.

proto/src/determined/api/v1/trial.proto Show resolved Hide resolved
proto/src/determined/api/v1/trial.proto Outdated Show resolved Hide resolved
proto/src/determined/api/v1/trial.proto Show resolved Hide resolved
webui/react/src/services/decoder.ts Outdated Show resolved Hide resolved
@trentwatt
Copy link
Contributor

update: I think I have a good resolution on table rows and sorting when there are both training-loss and validation-loss. There would be an issue if typically we have workloads with the same batch number and we do want them merged, but I don't think that's the case?

what do you think we want to do here @hkang1 ? merge or not? if no merge, we end up with a single batch having separate rows for training metrics, validation metrics, and checkpoints, at least under certain circumstances.

would require significantly changing API structure/query logic though?

image

@trentwatt trentwatt removed their assignment Jul 11, 2022
@hkang1
Copy link
Contributor

hkang1 commented Aug 2, 2022

update: I think I have a good resolution on table rows and sorting when there are both training-loss and validation-loss. There would be an issue if typically we have workloads with the same batch number and we do want them merged, but I don't think that's the case?

what do you think we want to do here @hkang1 ? merge or not? if no merge, we end up with a single batch having separate rows for training metrics, validation metrics, and checkpoints, at least under certain circumstances.

would require significantly changing API structure/query logic though?

image

What's the level of effort to preserve the merged behavior?

We are not quite there yet but the goal is eventually to have a dedicated checkpoints section for each experiment, along with a dedicated data download capability. This workloads table is really for the user to understand what data is getting used to draw the chart and to find if there are problems with training (e.g. looking for exploding values of NaN, Inf or -Inf).

Basically saying if it's easy to preserve the merged behavior, let's do it, otherwise the separate rows are fine as it is a temporary state. The benefits of being able to sort and paginate metrics and checkpoints more efficiently is much needed for performance issues we've had around this that compared to losing merged behavior is acceptable IMO.

@mapmeld mapmeld merged commit 972954d into determined-ai:master Aug 2, 2022
@mapmeld mapmeld deleted the mvp_downsample_v3 branch August 2, 2022 21:02
@dannysauer dannysauer modified the milestones: 0.0.102, 0.19.1 Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants