[no_early_kickoff][core][dashboard] Feature flag task logs recording (#34056) #34101

rickyyx · 2023-04-05T17:04:52Z

This should address a few issues introduced by the original task log recording features:

[core] perf regression: 1_1_actor_calls_concurrent #33924 [core] perf regression: 1_1_actor_calls_async #33949 [Tests] Fix two skipped Windows test for test_task_event_2.py #33738 The roocasue with the regressions are:
With #32943, we are recording log file offsets before and after executing a task, which calls tell() on the file descriptor object for each worker.

The cost of that shows up when there are concurrent execution of tasks on a single worker.

I am turning this off by default for this release since the subsequent PRs are not merged yet. We will need to tackle or resolve the regression once we turn this feature on when we merge subsequent PRs. One idea is to make this "finding-out-offset-procedure" async, e.g. we try to locate the exact task id's log offset when we querying the task logs at querying time.

Why are these changes needed?

Related issue number

Closes #33924
Closes #33949

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

This should address a few issues introduced by the original task log recording features: [core] perf regression: 1_1_actor_calls_concurrent ray-project#33924 [core] perf regression: 1_1_actor_calls_async ray-project#33949 [Tests] Fix two skipped Windows test for test_task_event_2.py ray-project#33738 The roocasue with the regressions are: With ray-project#32943, we are recording log file offsets before and after executing a task, which calls tell() on the file descriptor object for each worker. The cost of that shows up when there are concurrent execution of tasks on a single worker. I am turning this off by default for this release since the subsequent PRs are not merged yet. We will need to tackle or resolve the regression once we turn this feature on when we merge subsequent PRs. One idea is to make this "finding-out-offset-procedure" async, e.g. we try to locate the exact task id's log offset when we querying the task logs at querying time.

rickyyx · 2023-04-05T21:25:33Z

Approval here: #34056 (review)

clarng · 2023-04-05T21:59:02Z

fix dco?

rickyyx · 2023-04-06T00:40:08Z

Fixed.

clarng · 2023-04-06T06:27:46Z

The workflow test failure looks new

2.4.0 doesn't have it : https://buildkite.com/ray-project/oss-ci-build-branch/builds/3201

rickyyx · 2023-04-06T16:31:00Z

oh which job is that? It might be flaky.

clarng · 2023-04-06T18:13:33Z

oh which job is that? It might be flaky.

https://buildkite.com/ray-project/oss-ci-build-pr/builds/17527#01875261-4ea5-41df-98ba-6e5a80845dfd

rickyyx · 2023-04-06T20:54:12Z

The test looks unrelated (workflow logging is completely different from this I believe). Let me retry first.

…-exp-log-regression

clarng · 2023-04-07T03:54:06Z

Hmm looks like it is still failing

rickyyx · 2023-04-07T17:02:31Z

Hmm, that's surprising. Let me look into it more closely.

Signed-off-by: rickyyx <[email protected]>

rickyyx · 2023-04-09T00:06:10Z

Ok - looks [no_early_kickoff] solves the issue. Must be some incompatability with the docker prebuilt on the release branch.

Signed-off-by: rickyyx <[email protected]>

rickyyx · 2023-04-10T16:47:38Z

This should be ready to merge: cc @clarng

Signed-off-by: rickyyx <[email protected]> We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image. - #34101 Seems the workflow test does have a dependency on the wheels built.

Signed-off-by: rickyyx <[email protected]> We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image. - ray-project#34101 Seems the workflow test does have a dependency on the wheels built. Signed-off-by: elliottower <[email protected]>

Signed-off-by: rickyyx <[email protected]> We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image. - ray-project#34101 Seems the workflow test does have a dependency on the wheels built. Signed-off-by: Jack He <[email protected]>

rickyyx assigned clarng Apr 5, 2023

clarng self-requested a review April 5, 2023 21:58

clarng approved these changes Apr 5, 2023

View reviewed changes

clarng added release-blocker P0 Issue that blocks the release v2.4.0-pick labels Apr 6, 2023

Merge branch 'releases/2.4.0' of github.com:ray-project/ray into pick…

be6e440

…-exp-log-regression

staing

0dc484f

Signed-off-by: rickyyx <[email protected]>

rickyyx requested review from ericl, fishbone, stephanie-wang and suquark as code owners April 7, 2023 19:01

fix

d87a403

Signed-off-by: rickyyx <[email protected]>

fishbone approved these changes Apr 7, 2023

View reviewed changes

try

0e34592

Signed-off-by: rickyyx <[email protected]>

rickyyx changed the title ~~[core][dashboard] Feature flag task logs recording (#34056)~~ [no_early_kickoff][core][dashboard] Feature flag task logs recording (#34056) Apr 8, 2023

try try

99a786c

Signed-off-by: rickyyx <[email protected]>

rickyyx added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 8, 2023

fix

4302cdb

Signed-off-by: rickyyx <[email protected]>

rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 9, 2023

rickyyx assigned rickyyx and unassigned clarng Apr 10, 2023

This was referenced Apr 10, 2023

[ci] No early kickoff by default for workflow test #34213

Merged

[core] perf regression: 1_1_actor_calls_async #33949

Closed

zhe-thoughts approved these changes Apr 10, 2023

View reviewed changes

clarng merged commit 9fcc9aa into ray-project:releases/2.4.0 Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[no_early_kickoff][core][dashboard] Feature flag task logs recording (#34056) #34101

[no_early_kickoff][core][dashboard] Feature flag task logs recording (#34056) #34101

rickyyx commented Apr 5, 2023 •

edited

Loading

rickyyx commented Apr 5, 2023

clarng commented Apr 5, 2023

rickyyx commented Apr 6, 2023

clarng commented Apr 6, 2023

rickyyx commented Apr 6, 2023

clarng commented Apr 6, 2023

rickyyx commented Apr 6, 2023

clarng commented Apr 7, 2023

rickyyx commented Apr 7, 2023

rickyyx commented Apr 9, 2023

rickyyx commented Apr 10, 2023

[no_early_kickoff][core][dashboard] Feature flag task logs recording (#34056) #34101

[no_early_kickoff][core][dashboard] Feature flag task logs recording (#34056) #34101

Conversation

rickyyx commented Apr 5, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

rickyyx commented Apr 5, 2023

clarng commented Apr 5, 2023

rickyyx commented Apr 6, 2023

clarng commented Apr 6, 2023

rickyyx commented Apr 6, 2023

clarng commented Apr 6, 2023

rickyyx commented Apr 6, 2023

clarng commented Apr 7, 2023

rickyyx commented Apr 7, 2023

rickyyx commented Apr 9, 2023

rickyyx commented Apr 10, 2023

rickyyx commented Apr 5, 2023 •

edited

Loading