-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Connect trial UI to workloads API; pass sort/filter to API #4407
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
update: I think I have a good resolution on table rows and sorting when there are both training-loss and validation-loss. There would be an issue if typically we have workloads with the same batch number and we do want them merged, but I don't think that's the case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, main caveat is just that sorting by loss ASC results in displaying null first. may have been the case previously to but might as well address now.
@@ -396,7 +396,7 @@ FROM trials t | |||
WHERE t.id=$2 | |||
AND s.state = 'COMPLETED' | |||
AND total_batches >= $3 | |||
AND total_batches <= $4 | |||
AND ($4 <= 0 OR total_batches <= $4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: don't see a need for it, but it seems like just for reasons of consistency, end_batches = 0 should be a valid filter. is there a reason we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For infinite batches we could use -1, does that seem like a better option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now noticed these:
if endBatches <= 0 {
endBatches = math.MaxInt32
}
So I don't need to change this query at all / in the first place. Leaving this file unchanged.
@@ -423,7 +423,7 @@ JOIN validations v ON t.id = v.trial_id | |||
WHERE t.id=$2 | |||
AND v.state = 'COMPLETED' | |||
AND v.total_batches >= $3 | |||
AND v.total_batches <= $4 | |||
AND ($4 <= 0 OR v.total_batches <= $4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question
rowClassName={defaultRowClassName({ clickable: false })} | ||
rowKey="batchNum" | ||
rowKey="key" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😃
@@ -58,7 +64,7 @@ page_info AS ( | |||
) | |||
SELECT ( | |||
SELECT jsonb_agg(w) FROM (SELECT validation, training, checkpoint FROM workloads | |||
ORDER BY total_batches %s, end_time %s | |||
ORDER BY %s %s NULLS LAST, total_batches %s, end_time %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have access to the same experiment, but as validation_loss would be desc here, I'm thinking that you're sorting by another column? And that column variable exists on both training and validation? I was thinking that it was going OK to sort these together, but I forgot how likely it is that only one of the columns would be displayed in the UI. So I probably want to revisit this and make it possible to treat these differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on 2nd thought, more likely it is doing the correct sort in the API but sorting nulls weirdly in the UI... I have an example locally where I will work on that issue too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ got the UI table's nulls last working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick peak, looks like @trentwatt got a lot of coverage already and it's looking good! Added some minor things here and there.
what do you think we want to do here @hkang1 ? merge or not? if no merge, we end up with a single batch having separate rows for training metrics, validation metrics, and checkpoints, at least under certain circumstances. would require significantly changing API structure/query logic though? |
What's the level of effort to preserve the merged behavior? We are not quite there yet but the goal is eventually to have a dedicated checkpoints section for each experiment, along with a dedicated data download capability. This workloads table is really for the user to understand what data is getting used to draw the chart and to find if there are problems with training (e.g. looking for exploding values of NaN, Inf or -Inf). Basically saying if it's easy to preserve the merged behavior, let's do it, otherwise the separate rows are fine as it is a temporary state. The benefits of being able to sort and paginate metrics and checkpoints more efficiently is much needed for performance issues we've had around this that compared to losing merged behavior is acceptable IMO. |
Description
Connects the "Workloads" table on the single-trial page to the Workloads API. This code is part of a plan to remove the response of all workloads and metrics from the Trial API (#4300). We're splitting that PR up for easier review.
We already have a paginated workloads API (#3635). For the sortable table to connect to the API, this PR adds a sortKey field (default is batch #, this adds sorting by a metric) and filter field (a dropdown near the table selects All / Checkpoints / Validation / Checkpoints or Validation).
This commit includes the patch from #4406.Done.Remaining issue with same-name metrics:
coalesce(t.metrics, v.metrics)
is odd, can we confirm that no one workload includes both training and validation metrics?coalesce(t.metrics, v.metrics)
line distinguishing training and validation metrics.Exploring the issue:
In this test, training-loss and validation-loss both show a sorted-column icon. The training and validation losses are displayed on the correct columns. The training workload 100 and validation workload 100 were returned as separate workloads by the API, but got combined into one table row by
workloadsToSteps
, and sorted as the highest value for that metric within that page.latest-master also combines the 100 workloads, and sorts the row as the highest value of all workloads (i.e. DESC shows it first, ASC shows it last)
Checklist
docs/release-notes/
.