-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[no_early_kickoff][core][dashboard] Feature flag task logs recording (#34056) #34101
[no_early_kickoff][core][dashboard] Feature flag task logs recording (#34056) #34101
Conversation
This should address a few issues introduced by the original task log recording features: [core] perf regression: 1_1_actor_calls_concurrent ray-project#33924 [core] perf regression: 1_1_actor_calls_async ray-project#33949 [Tests] Fix two skipped Windows test for test_task_event_2.py ray-project#33738 The roocasue with the regressions are: With ray-project#32943, we are recording log file offsets before and after executing a task, which calls tell() on the file descriptor object for each worker. The cost of that shows up when there are concurrent execution of tasks on a single worker. I am turning this off by default for this release since the subsequent PRs are not merged yet. We will need to tackle or resolve the regression once we turn this feature on when we merge subsequent PRs. One idea is to make this "finding-out-offset-procedure" async, e.g. we try to locate the exact task id's log offset when we querying the task logs at querying time.
Approval here: #34056 (review) |
fix dco? |
Fixed. |
The workflow test failure looks new 2.4.0 doesn't have it : https://buildkite.com/ray-project/oss-ci-build-branch/builds/3201 |
oh which job is that? It might be flaky. |
https://buildkite.com/ray-project/oss-ci-build-pr/builds/17527#01875261-4ea5-41df-98ba-6e5a80845dfd |
The test looks unrelated (workflow logging is completely different from this I believe). Let me retry first. |
…-exp-log-regression
Hmm looks like it is still failing |
Hmm, that's surprising. Let me look into it more closely. |
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Ok - looks [no_early_kickoff] solves the issue. Must be some incompatability with the docker prebuilt on the release branch. |
Signed-off-by: rickyyx <[email protected]>
This should be ready to merge: cc @clarng |
Signed-off-by: rickyyx <[email protected]> We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image. - #34101 Seems the workflow test does have a dependency on the wheels built.
Signed-off-by: rickyyx <[email protected]> We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image. - ray-project#34101 Seems the workflow test does have a dependency on the wheels built. Signed-off-by: elliottower <[email protected]>
Signed-off-by: rickyyx <[email protected]> We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image. - ray-project#34101 Seems the workflow test does have a dependency on the wheels built. Signed-off-by: Jack He <[email protected]>
This should address a few issues introduced by the original task log recording features:
[core] perf regression: 1_1_actor_calls_concurrent #33924 [core] perf regression: 1_1_actor_calls_async #33949 [Tests] Fix two skipped Windows test for test_task_event_2.py #33738 The roocasue with the regressions are:
With #32943, we are recording log file offsets before and after executing a task, which calls tell() on the file descriptor object for each worker.
The cost of that shows up when there are concurrent execution of tasks on a single worker.
I am turning this off by default for this release since the subsequent PRs are not merged yet. We will need to tackle or resolve the regression once we turn this feature on when we merge subsequent PRs. One idea is to make this "finding-out-offset-procedure" async, e.g. we try to locate the exact task id's log offset when we querying the task logs at querying time.
Why are these changes needed?
Related issue number
Closes #33924
Closes #33949
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.