Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[no_early_kickoff][core][dashboard] Feature flag task logs recording (#34056) #34101

Merged
merged 7 commits into from
Apr 10, 2023

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Apr 5, 2023

This should address a few issues introduced by the original task log recording features:

[core] perf regression: 1_1_actor_calls_concurrent #33924 [core] perf regression: 1_1_actor_calls_async #33949 [Tests] Fix two skipped Windows test for test_task_event_2.py #33738 The roocasue with the regressions are:
With #32943, we are recording log file offsets before and after executing a task, which calls tell() on the file descriptor object for each worker.

The cost of that shows up when there are concurrent execution of tasks on a single worker.

I am turning this off by default for this release since the subsequent PRs are not merged yet. We will need to tackle or resolve the regression once we turn this feature on when we merge subsequent PRs. One idea is to make this "finding-out-offset-procedure" async, e.g. we try to locate the exact task id's log offset when we querying the task logs at querying time.

Why are these changes needed?

Related issue number

Closes #33924
Closes #33949

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

This should address a few issues introduced by the original task log recording features:

[core] perf regression: 1_1_actor_calls_concurrent ray-project#33924
[core] perf regression: 1_1_actor_calls_async ray-project#33949
[Tests] Fix two skipped Windows test for test_task_event_2.py ray-project#33738
The roocasue with the regressions are:
With ray-project#32943, we are recording log file offsets before and after executing a task, which calls tell() on the file descriptor object for each worker.

The cost of that shows up when there are concurrent execution of tasks on a single worker.

I am turning this off by default for this release since the subsequent PRs are not merged yet. We will need to tackle or resolve the regression once we turn this feature on when we merge subsequent PRs. One idea is to make this "finding-out-offset-procedure" async, e.g. we try to locate the exact task id's log offset when we querying the task logs at querying time.
@rickyyx
Copy link
Contributor Author

rickyyx commented Apr 5, 2023

Approval here: #34056 (review)

@clarng clarng self-requested a review April 5, 2023 21:58
@clarng
Copy link
Contributor

clarng commented Apr 5, 2023

fix dco?

@rickyyx
Copy link
Contributor Author

rickyyx commented Apr 6, 2023

Fixed.

@clarng
Copy link
Contributor

clarng commented Apr 6, 2023

The workflow test failure looks new

2.4.0 doesn't have it : https://buildkite.com/ray-project/oss-ci-build-branch/builds/3201

@clarng clarng added release-blocker P0 Issue that blocks the release v2.4.0-pick labels Apr 6, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Apr 6, 2023

oh which job is that? It might be flaky.

@clarng
Copy link
Contributor

clarng commented Apr 6, 2023

@rickyyx
Copy link
Contributor Author

rickyyx commented Apr 6, 2023

The test looks unrelated (workflow logging is completely different from this I believe). Let me retry first.

@clarng
Copy link
Contributor

clarng commented Apr 7, 2023

Hmm looks like it is still failing

@rickyyx
Copy link
Contributor Author

rickyyx commented Apr 7, 2023

Hmm, that's surprising. Let me look into it more closely.

Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
@rickyyx rickyyx changed the title [core][dashboard] Feature flag task logs recording (#34056) [no_early_kickoff][core][dashboard] Feature flag task logs recording (#34056) Apr 8, 2023
Signed-off-by: rickyyx <[email protected]>
@rickyyx rickyyx added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 8, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Apr 9, 2023

Ok - looks [no_early_kickoff] solves the issue. Must be some incompatability with the docker prebuilt on the release branch.

Signed-off-by: rickyyx <[email protected]>
@rickyyx rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 9, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Apr 10, 2023

This should be ready to merge: cc @clarng

@clarng clarng merged commit 9fcc9aa into ray-project:releases/2.4.0 Apr 10, 2023
fishbone pushed a commit that referenced this pull request Apr 10, 2023
Signed-off-by: rickyyx <[email protected]>

We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image. 
- #34101

Seems the workflow test does have a dependency on the wheels built.
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
Signed-off-by: rickyyx <[email protected]>

We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image.
- ray-project#34101

Seems the workflow test does have a dependency on the wheels built.

Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
Signed-off-by: rickyyx <[email protected]>

We have seen workflow test fails on PRs with totally unrelated content because of the re-using of the cached docker image.
- ray-project#34101

Seems the workflow test does have a dependency on the wheels built.

Signed-off-by: Jack He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants