Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Green-Ray][4] Compute and store unique crash pattern from logs #34200

Merged
merged 104 commits into from
Apr 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
d5b1257
Re-add glue logic for JobManager
can-anyscale Mar 30, 2023
cc6db98
[CI] clean up things
can-anyscale Mar 30, 2023
768466b
Add back job run type. They are still used in CLI so not deprecate th…
can-anyscale Mar 31, 2023
57f499a
[CI] clean up things
can-anyscale Mar 30, 2023
12feefe
Rebase
can-anyscale Apr 5, 2023
1ccc73e
[CI] clean up things
can-anyscale Mar 30, 2023
0e9f29c
Add functions that are free from execeptions
can-anyscale Apr 3, 2023
708b80f
Undo changes to fetch_results
can-anyscale Apr 10, 2023
6baff80
Auto-retry for infrastructure errors
can-anyscale Apr 4, 2023
9103124
Exit buildkite job using buildkite return code
can-anyscale Apr 4, 2023
88317be
Handle everything through result exceptions
can-anyscale Apr 4, 2023
62b530e
Throw and retry on purpose
can-anyscale Apr 4, 2023
4742040
Fix things
can-anyscale Apr 4, 2023
dbe354c
Need to use value of enum
can-anyscale Apr 4, 2023
f2a69ef
out of testing mode
can-anyscale Apr 5, 2023
fea4e57
Name consistency
can-anyscale Apr 5, 2023
92a82c0
Fix lints
can-anyscale Apr 5, 2023
88ef3ef
Fix unit tests
can-anyscale Apr 5, 2023
a9d5638
Raise an error for testing
can-anyscale Apr 6, 2023
41b770c
Undo debugging code
can-anyscale Apr 6, 2023
f900f10
Move retry logic to sh file
can-anyscale Apr 13, 2023
5b1b4e2
Rebase
can-anyscale Apr 13, 2023
5563637
for testing
can-anyscale Apr 13, 2023
a472538
More refactoring
can-anyscale Apr 13, 2023
5d4cc62
Undo more changes
can-anyscale Apr 13, 2023
e0b1e9e
For testing
can-anyscale Apr 13, 2023
5160e02
Remove debugging info
can-anyscale Apr 13, 2023
a95c675
Fix tests
can-anyscale Apr 14, 2023
e0249d2
Exit buildkite job using buildkite return code
can-anyscale Apr 4, 2023
229f9f1
Fix lints
can-anyscale Apr 5, 2023
2952e2d
Rebase
can-anyscale Apr 8, 2023
a764f76
Rebase
can-anyscale Apr 10, 2023
e5b433f
Rebase
can-anyscale Apr 13, 2023
dec0ff6
debugging
can-anyscale Apr 13, 2023
a753fcf
fix sh
can-anyscale Apr 13, 2023
fdef99f
Fix sh again
can-anyscale Apr 13, 2023
bb5a57d
Remove debugging information
can-anyscale Apr 13, 2023
03f0848
Rebase
can-anyscale Apr 17, 2023
72fc32a
Rebase
can-anyscale Apr 17, 2023
eda6a88
Auto-retry for infrastructure errors
can-anyscale Apr 4, 2023
2568b02
Exit buildkite job using buildkite return code
can-anyscale Apr 4, 2023
12ce609
Handle everything through result exceptions
can-anyscale Apr 4, 2023
e07f3e0
Throw and retry on purpose
can-anyscale Apr 4, 2023
ee21c0b
Fix things
can-anyscale Apr 4, 2023
86bffd9
Name consistency
can-anyscale Apr 5, 2023
0b2ab0f
Fix lints
can-anyscale Apr 5, 2023
3320f85
Fix unit tests
can-anyscale Apr 5, 2023
972b885
Move retry logic to sh file
can-anyscale Apr 13, 2023
c316ee4
Rebase
can-anyscale Apr 13, 2023
55676c5
More refactoring
can-anyscale Apr 13, 2023
f322ae3
Undo more changes
can-anyscale Apr 13, 2023
afd9967
Fix tests
can-anyscale Apr 14, 2023
3c9a156
Exit buildkite job using buildkite return code
can-anyscale Apr 4, 2023
32d7f5b
Rebase
can-anyscale Apr 10, 2023
ed6542e
Only failed-fast job can have transient error
can-anyscale Apr 5, 2023
7384a79
Set ray log to stderr
can-anyscale Apr 8, 2023
0df9263
Get ray logs
can-anyscale Apr 8, 2023
630132b
job response
can-anyscale Apr 8, 2023
6311d28
correct update last job result
can-anyscale Apr 9, 2023
8b70ae7
Fix get log group
can-anyscale Apr 9, 2023
1625377
Best attempt to get ray error logs on infra failures
can-anyscale Apr 9, 2023
6a91523
Rebase
can-anyscale Apr 10, 2023
7237ef6
Undo changes to job_manager
can-anyscale Apr 10, 2023
d101138
Use api to download rather than stream ray log files
can-anyscale Apr 10, 2023
0a69db6
Fix lints
can-anyscale Apr 10, 2023
957fde4
Rebase
can-anyscale Apr 17, 2023
04f84b6
Rebase
can-anyscale Apr 17, 2023
b940df3
Rebase
can-anyscale Apr 18, 2023
66eaa38
Rebase
can-anyscale Apr 18, 2023
3eac9ad
Rebase
can-anyscale Apr 18, 2023
443b9f7
Auto-retry for infrastructure errors
can-anyscale Apr 4, 2023
5d04eef
Fix unit tests
can-anyscale Apr 5, 2023
2aba338
Undo more changes
can-anyscale Apr 13, 2023
22dee5f
Fix lints
can-anyscale Apr 5, 2023
64a1c9e
Rebase
can-anyscale Apr 13, 2023
ef09f03
fix sh
can-anyscale Apr 13, 2023
d88b478
Fix sh again
can-anyscale Apr 13, 2023
e234c1d
Exit buildkite job using buildkite return code
can-anyscale Apr 4, 2023
2b478c5
Name consistency
can-anyscale Apr 5, 2023
ee54250
Move retry logic to sh file
can-anyscale Apr 13, 2023
70954cc
More refactoring
can-anyscale Apr 13, 2023
c07d123
Undo more changes
can-anyscale Apr 13, 2023
bdbfb31
Exit buildkite job using buildkite return code
can-anyscale Apr 4, 2023
d4904dc
Log aggegration
can-anyscale Apr 9, 2023
927b90d
Compute unique crash pattern and store to databrick
can-anyscale Apr 9, 2023
f06b8e5
Rebase
can-anyscale Apr 9, 2023
c347538
Test
can-anyscale Apr 9, 2023
2516036
Rebase
can-anyscale Apr 10, 2023
a2a08eb
Remove debugging info
can-anyscale Apr 10, 2023
ccd1530
Lint
can-anyscale Apr 10, 2023
dc18cc3
@aslonnie's comments
can-anyscale Apr 10, 2023
317705f
Rebase
can-anyscale Apr 16, 2023
bed4b1d
Rebase
can-anyscale Apr 17, 2023
2e4d5b9
Add comments for why we need to look across many ray logs for error p…
can-anyscale Apr 17, 2023
8f95d5c
Rebase
can-anyscale Apr 18, 2023
f5e2a5a
Rebase
can-anyscale Apr 18, 2023
c11e4f6
Rebase
can-anyscale Apr 18, 2023
e7076ea
@aslonnie's comments
can-anyscale Apr 18, 2023
f5043cc
Rebase
can-anyscale Apr 18, 2023
179d48b
Fix lints
can-anyscale Apr 19, 2023
485a199
Simply check that output is none
can-anyscale Apr 19, 2023
1c14992
@krfricke's comments
can-anyscale Apr 21, 2023
25bda89
Add new files
can-anyscale Apr 21, 2023
bdc8246
Fix lints
can-anyscale Apr 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion release/ray_release/anyscale_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from anyscale.sdk.anyscale_client.sdk import AnyscaleSDK


LAST_LOGS_LENGTH = 10
LAST_LOGS_LENGTH = 30
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to this long to capture stack trace

can-anyscale marked this conversation as resolved.
Show resolved Hide resolved


def find_cloud_by_name(
Expand Down
5 changes: 4 additions & 1 deletion release/ray_release/job_manager/anyscale_job_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,10 @@ def _get_logs():
)
print("", flush=True)
output = buf.getvalue().strip()
if "### Starting ###" not in output:
# Many of Ray components have their separated logs (e.g. dashboard,
# gcs_server, etc.), so the interesting errors are not always in the
# job logs. If the job has no logs, check other ray logs for error patterns.
if not output:
output = self._get_ray_error_logs()
assert output, "No logs fetched"
return "\n".join(output.splitlines()[-LAST_LOGS_LENGTH * 3 :])
Expand Down
100 changes: 100 additions & 0 deletions release/ray_release/log_aggregator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
import re
from typing import List

TRACEBACK_PATTERN = "Traceback (most recent call last)"


class LogAggregator:
def __init__(self, log: str):
self.log = log
Comment on lines +7 to +9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit, but it doesn't look like we actually need a class here (we could just have functions instead) - but fine with me


def compute_crash_pattern(self) -> str:
stack_trace = LogAggregator._compute_stack_trace(self.log.splitlines())
# truncate short enough to store in databases, but long enough to keep the
# pattern unique
return LogAggregator._compute_signature(stack_trace)[:4000]

@staticmethod
def _compute_signature(stack_trace: List[str]) -> str:
"""
Compute signature pattern from stack trace, by remove factors such as date,
time, temp directory, line numbers, etc. This help to aggregate similar logs
into same bug patterns
"""
massaged_trace = []
for line in stack_trace:
line = re.sub(r"\d", "", line.strip())
if line == "Traceback (most recent call last):":
continue
file_line = re.search(r'File "(.*)", (.*)', line)
if file_line:
# append the file's base name and caller information; the result string
# is not something meaningful to human, we just need something that
# uniquely represent the stack trace
line = f'{file_line.group(1).split("/")[-1]}{file_line.group(2)}'
massaged_trace.append(line)
return "".join(massaged_trace)

@staticmethod
def _compute_stack_trace(logs: List[str]) -> List[str]:
"""
Extract stack trace pattern from the logs. Stack trace pattern often matches
the following:
ERROR ...
Traceback (most recent call last):
File "...", line ..., in ...
...
Exception: exception error
"""
error_stacktrace = []
stacktrace = []
i = 0
while i < len(logs):
stack = []
trace = error_stacktrace
# Search for lines that are either
# ... ERROR ...
# or
# ... ERROR ...
# Traceback (most recent call last):
if "ERROR" in logs[i]:
stack.append(logs[i])
next = i + 1
if i + 1 < len(logs) and TRACEBACK_PATTERN in logs[i + 1]:
stack.append(logs[i + 1])
next = i + 2
# Or if the line with ERROR does not exist, just search for the line with
# Traceback (most recent call last):
elif TRACEBACK_PATTERN in logs[i]:
stack.append(logs[i])
trace = stacktrace
next = i + 1
# Or else, skip this line and continue
else:
i = i + 1
continue
# If the line that contains ERROR, Traceback, etc. is found, scan the logs
# until the line no longer has indentation. This is because stack trace
# is always indented, and stops when the line is no longer indented
while next < len(logs):
if logs[next].startswith((" ", "\t")):
stack.append(logs[next])
next = next + 1
else:
break
# Finished capturing the entire stack trace
if next < len(logs):
stack.append(logs[next])
if stack:
trace.append(stack)
i = next + 1

# Favor stack trace that contains the ERROR keyword
if error_stacktrace:
return error_stacktrace[-1]

# Otherwise any stack trace is fine
if stacktrace:
return stacktrace[-1]

return []
4 changes: 4 additions & 0 deletions release/ray_release/reporter/db.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from ray_release.result import Result
from ray_release.config import Test
from ray_release.logger import logger
from ray_release.log_aggregator import LogAggregator


class DBReporter(Reporter):
Expand Down Expand Up @@ -40,6 +41,9 @@ def report_result(self, test: Test, result: Result):
"return_code": result.return_code,
"smoke_test": result.smoke_test,
"extra_tags": result.extra_tags or {},
"crash_pattern": LogAggregator(
result.last_logs or ""
).compute_crash_pattern(),
}

logger.debug(f"Result json: {json.dumps(result_json)}")
Expand Down
68 changes: 68 additions & 0 deletions release/ray_release/tests/test_log_aggregator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
from ray_release.log_aggregator import LogAggregator


def test_compute_stack_pattern():
assert (
LogAggregator(
"\n".join(
[
"haha",
"Traceback (most recent call last):",
' File "/tmp/something", line 584',
"Exception: yaya45",
"hehe",
]
)
).compute_crash_pattern()
== "somethingline Exception: yaya"
)


def test_compute_signature():
assert (
LogAggregator._compute_signature(
[
"Traceback (most recent call last):",
' File "/tmp/something", line 584',
"Exception: yaya45",
]
)
== "somethingline Exception: yaya"
)


def test_compute_stack_trace():
trace = [
"Traceback (most recent call last):",
' File "/tmp/something", line 584, in run_release_test',
" raise pipeline_exception",
"ray_release.exception.JobNoLogsError: Could not obtain logs for the job.",
]
error_trace = [
"[2023-01-01] ERROR: something is wrong",
"Traceback (most recent call last):",
' File "/tmp/something", line 584, in run_release_test',
" raise pipeline_exception",
"ray_release.exception.JobStartupTimeout: Cluster did not start.",
]
error_trace_short = [
"[2023-01-01] ERROR: something is wrong"
' File "/tmp/something", line 584, in run_release_test',
" raise pipeline_exception",
"ray_release.exception.JobStartupTimeout: Cluster did not start.",
]
assert LogAggregator._compute_stack_trace(["haha"] + trace + ["hehe"]) == trace
assert (
LogAggregator._compute_stack_trace(["haha"] + error_trace + ["hehe"])
== error_trace
)
assert (
LogAggregator._compute_stack_trace(["haha"] + error_trace_short + ["hehe"])
== error_trace_short
)
assert (
LogAggregator._compute_stack_trace(
["haha"] + trace + ["w00t"] + error_trace + ["hehe"]
)
== error_trace
)