-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][Green-Ray][4] Compute and store unique crash pattern from logs #34200
Merged
Merged
Changes from all commits
Commits
Show all changes
104 commits
Select commit
Hold shift + click to select a range
d5b1257
Re-add glue logic for JobManager
can-anyscale cc6db98
[CI] clean up things
can-anyscale 768466b
Add back job run type. They are still used in CLI so not deprecate th…
can-anyscale 57f499a
[CI] clean up things
can-anyscale 12feefe
Rebase
can-anyscale 1ccc73e
[CI] clean up things
can-anyscale 0e9f29c
Add functions that are free from execeptions
can-anyscale 708b80f
Undo changes to fetch_results
can-anyscale 6baff80
Auto-retry for infrastructure errors
can-anyscale 9103124
Exit buildkite job using buildkite return code
can-anyscale 88317be
Handle everything through result exceptions
can-anyscale 62b530e
Throw and retry on purpose
can-anyscale 4742040
Fix things
can-anyscale dbe354c
Need to use value of enum
can-anyscale f2a69ef
out of testing mode
can-anyscale fea4e57
Name consistency
can-anyscale 92a82c0
Fix lints
can-anyscale 88ef3ef
Fix unit tests
can-anyscale a9d5638
Raise an error for testing
can-anyscale 41b770c
Undo debugging code
can-anyscale f900f10
Move retry logic to sh file
can-anyscale 5b1b4e2
Rebase
can-anyscale 5563637
for testing
can-anyscale a472538
More refactoring
can-anyscale 5d4cc62
Undo more changes
can-anyscale e0b1e9e
For testing
can-anyscale 5160e02
Remove debugging info
can-anyscale a95c675
Fix tests
can-anyscale e0249d2
Exit buildkite job using buildkite return code
can-anyscale 229f9f1
Fix lints
can-anyscale 2952e2d
Rebase
can-anyscale a764f76
Rebase
can-anyscale e5b433f
Rebase
can-anyscale dec0ff6
debugging
can-anyscale a753fcf
fix sh
can-anyscale fdef99f
Fix sh again
can-anyscale bb5a57d
Remove debugging information
can-anyscale 03f0848
Rebase
can-anyscale 72fc32a
Rebase
can-anyscale eda6a88
Auto-retry for infrastructure errors
can-anyscale 2568b02
Exit buildkite job using buildkite return code
can-anyscale 12ce609
Handle everything through result exceptions
can-anyscale e07f3e0
Throw and retry on purpose
can-anyscale ee21c0b
Fix things
can-anyscale 86bffd9
Name consistency
can-anyscale 0b2ab0f
Fix lints
can-anyscale 3320f85
Fix unit tests
can-anyscale 972b885
Move retry logic to sh file
can-anyscale c316ee4
Rebase
can-anyscale 55676c5
More refactoring
can-anyscale f322ae3
Undo more changes
can-anyscale afd9967
Fix tests
can-anyscale 3c9a156
Exit buildkite job using buildkite return code
can-anyscale 32d7f5b
Rebase
can-anyscale ed6542e
Only failed-fast job can have transient error
can-anyscale 7384a79
Set ray log to stderr
can-anyscale 0df9263
Get ray logs
can-anyscale 630132b
job response
can-anyscale 6311d28
correct update last job result
can-anyscale 8b70ae7
Fix get log group
can-anyscale 1625377
Best attempt to get ray error logs on infra failures
can-anyscale 6a91523
Rebase
can-anyscale 7237ef6
Undo changes to job_manager
can-anyscale d101138
Use api to download rather than stream ray log files
can-anyscale 0a69db6
Fix lints
can-anyscale 957fde4
Rebase
can-anyscale 04f84b6
Rebase
can-anyscale b940df3
Rebase
can-anyscale 66eaa38
Rebase
can-anyscale 3eac9ad
Rebase
can-anyscale 443b9f7
Auto-retry for infrastructure errors
can-anyscale 5d04eef
Fix unit tests
can-anyscale 2aba338
Undo more changes
can-anyscale 22dee5f
Fix lints
can-anyscale 64a1c9e
Rebase
can-anyscale ef09f03
fix sh
can-anyscale d88b478
Fix sh again
can-anyscale e234c1d
Exit buildkite job using buildkite return code
can-anyscale 2b478c5
Name consistency
can-anyscale ee54250
Move retry logic to sh file
can-anyscale 70954cc
More refactoring
can-anyscale c07d123
Undo more changes
can-anyscale bdbfb31
Exit buildkite job using buildkite return code
can-anyscale d4904dc
Log aggegration
can-anyscale 927b90d
Compute unique crash pattern and store to databrick
can-anyscale f06b8e5
Rebase
can-anyscale c347538
Test
can-anyscale 2516036
Rebase
can-anyscale a2a08eb
Remove debugging info
can-anyscale ccd1530
Lint
can-anyscale dc18cc3
@aslonnie's comments
can-anyscale 317705f
Rebase
can-anyscale bed4b1d
Rebase
can-anyscale 2e4d5b9
Add comments for why we need to look across many ray logs for error p…
can-anyscale 8f95d5c
Rebase
can-anyscale f5e2a5a
Rebase
can-anyscale c11e4f6
Rebase
can-anyscale e7076ea
@aslonnie's comments
can-anyscale f5043cc
Rebase
can-anyscale 179d48b
Fix lints
can-anyscale 485a199
Simply check that output is none
can-anyscale 1c14992
@krfricke's comments
can-anyscale 25bda89
Add new files
can-anyscale bdc8246
Fix lints
can-anyscale File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
import re | ||
from typing import List | ||
|
||
TRACEBACK_PATTERN = "Traceback (most recent call last)" | ||
|
||
|
||
class LogAggregator: | ||
def __init__(self, log: str): | ||
self.log = log | ||
Comment on lines
+7
to
+9
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a nit, but it doesn't look like we actually need a class here (we could just have functions instead) - but fine with me |
||
|
||
def compute_crash_pattern(self) -> str: | ||
stack_trace = LogAggregator._compute_stack_trace(self.log.splitlines()) | ||
# truncate short enough to store in databases, but long enough to keep the | ||
# pattern unique | ||
return LogAggregator._compute_signature(stack_trace)[:4000] | ||
|
||
@staticmethod | ||
def _compute_signature(stack_trace: List[str]) -> str: | ||
""" | ||
Compute signature pattern from stack trace, by remove factors such as date, | ||
time, temp directory, line numbers, etc. This help to aggregate similar logs | ||
into same bug patterns | ||
""" | ||
massaged_trace = [] | ||
for line in stack_trace: | ||
line = re.sub(r"\d", "", line.strip()) | ||
if line == "Traceback (most recent call last):": | ||
continue | ||
file_line = re.search(r'File "(.*)", (.*)', line) | ||
if file_line: | ||
# append the file's base name and caller information; the result string | ||
# is not something meaningful to human, we just need something that | ||
# uniquely represent the stack trace | ||
line = f'{file_line.group(1).split("/")[-1]}{file_line.group(2)}' | ||
massaged_trace.append(line) | ||
return "".join(massaged_trace) | ||
|
||
@staticmethod | ||
def _compute_stack_trace(logs: List[str]) -> List[str]: | ||
""" | ||
Extract stack trace pattern from the logs. Stack trace pattern often matches | ||
the following: | ||
ERROR ... | ||
Traceback (most recent call last): | ||
File "...", line ..., in ... | ||
... | ||
Exception: exception error | ||
""" | ||
error_stacktrace = [] | ||
stacktrace = [] | ||
i = 0 | ||
while i < len(logs): | ||
stack = [] | ||
trace = error_stacktrace | ||
# Search for lines that are either | ||
# ... ERROR ... | ||
# or | ||
# ... ERROR ... | ||
# Traceback (most recent call last): | ||
if "ERROR" in logs[i]: | ||
stack.append(logs[i]) | ||
next = i + 1 | ||
if i + 1 < len(logs) and TRACEBACK_PATTERN in logs[i + 1]: | ||
stack.append(logs[i + 1]) | ||
next = i + 2 | ||
# Or if the line with ERROR does not exist, just search for the line with | ||
# Traceback (most recent call last): | ||
elif TRACEBACK_PATTERN in logs[i]: | ||
stack.append(logs[i]) | ||
trace = stacktrace | ||
next = i + 1 | ||
# Or else, skip this line and continue | ||
else: | ||
i = i + 1 | ||
continue | ||
# If the line that contains ERROR, Traceback, etc. is found, scan the logs | ||
# until the line no longer has indentation. This is because stack trace | ||
# is always indented, and stops when the line is no longer indented | ||
while next < len(logs): | ||
if logs[next].startswith((" ", "\t")): | ||
stack.append(logs[next]) | ||
next = next + 1 | ||
else: | ||
break | ||
# Finished capturing the entire stack trace | ||
if next < len(logs): | ||
stack.append(logs[next]) | ||
if stack: | ||
trace.append(stack) | ||
i = next + 1 | ||
|
||
# Favor stack trace that contains the ERROR keyword | ||
if error_stacktrace: | ||
return error_stacktrace[-1] | ||
|
||
# Otherwise any stack trace is fine | ||
if stacktrace: | ||
return stacktrace[-1] | ||
|
||
return [] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
from ray_release.log_aggregator import LogAggregator | ||
|
||
|
||
def test_compute_stack_pattern(): | ||
assert ( | ||
LogAggregator( | ||
"\n".join( | ||
[ | ||
"haha", | ||
"Traceback (most recent call last):", | ||
' File "/tmp/something", line 584', | ||
"Exception: yaya45", | ||
"hehe", | ||
] | ||
) | ||
).compute_crash_pattern() | ||
== "somethingline Exception: yaya" | ||
) | ||
|
||
|
||
def test_compute_signature(): | ||
assert ( | ||
LogAggregator._compute_signature( | ||
[ | ||
"Traceback (most recent call last):", | ||
' File "/tmp/something", line 584', | ||
"Exception: yaya45", | ||
] | ||
) | ||
== "somethingline Exception: yaya" | ||
) | ||
|
||
|
||
def test_compute_stack_trace(): | ||
trace = [ | ||
"Traceback (most recent call last):", | ||
' File "/tmp/something", line 584, in run_release_test', | ||
" raise pipeline_exception", | ||
"ray_release.exception.JobNoLogsError: Could not obtain logs for the job.", | ||
] | ||
error_trace = [ | ||
"[2023-01-01] ERROR: something is wrong", | ||
"Traceback (most recent call last):", | ||
' File "/tmp/something", line 584, in run_release_test', | ||
" raise pipeline_exception", | ||
"ray_release.exception.JobStartupTimeout: Cluster did not start.", | ||
] | ||
error_trace_short = [ | ||
"[2023-01-01] ERROR: something is wrong" | ||
' File "/tmp/something", line 584, in run_release_test', | ||
" raise pipeline_exception", | ||
"ray_release.exception.JobStartupTimeout: Cluster did not start.", | ||
] | ||
assert LogAggregator._compute_stack_trace(["haha"] + trace + ["hehe"]) == trace | ||
assert ( | ||
LogAggregator._compute_stack_trace(["haha"] + error_trace + ["hehe"]) | ||
== error_trace | ||
) | ||
assert ( | ||
LogAggregator._compute_stack_trace(["haha"] + error_trace_short + ["hehe"]) | ||
== error_trace_short | ||
) | ||
assert ( | ||
LogAggregator._compute_stack_trace( | ||
["haha"] + trace + ["w00t"] + error_trace + ["hehe"] | ||
) | ||
== error_trace | ||
) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to this long to capture stack trace