Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Green-Ray][4] Compute and store unique crash pattern from logs #34200

Merged
merged 104 commits into from
Apr 21, 2023

Conversation

can-anyscale
Copy link
Collaborator

@can-anyscale can-anyscale commented Apr 9, 2023

Why are these changes needed?

This PR computes and aggregate unique crash patterns from logs, then store them in Databricks. Later on, this will help us build a dashboard for heat map of errors from aggregated logs, help us prioritize the most impactful errors to fix.

Checks

@can-anyscale can-anyscale force-pushed the can-ci-errors-04 branch 6 times, most recently from adf46b3 to 532ddfd Compare April 10, 2023 16:48
@can-anyscale can-anyscale changed the title Can ci errors 04 [CI][Green-Ray][4] Compute and store unique crash pattern from logs Apr 10, 2023
@can-anyscale can-anyscale changed the base branch from master to can-ci-errors-03 April 10, 2023 17:06
@can-anyscale can-anyscale marked this pull request as ready for review April 10, 2023 17:07
@can-anyscale can-anyscale requested a review from a team as a April 10, 2023 17:07
Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it be easier to get this info if we ask the test to print/log this in a more structured form, rather than trying to parse the log?

release/ray_release/tests/test_db_reporter.py Outdated Show resolved Hide resolved
@can-anyscale
Copy link
Collaborator Author

This logic mainly attempts to capture logs from ray logs (https://docs.ray.io/en/latest/ray-observability/ray-logging.html#logging-directory-structure). 100% agree there are observability improvement/loves to do more in these logs.

release/ray_release/reporter/db.py Outdated Show resolved Hide resolved
release/ray_release/reporter/db.py Outdated Show resolved Hide resolved
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, but the stack trace parsing does not seem to be related to the DB reporter. Let's move it into a separate Python module!

Comment on lines 20 to 22
def compute_crash_pattern(self, logs: str) -> str:
stack_trace = self._compute_stack_trace(logs.splitlines())
return self._compute_signature(stack_trace)[:4000] # limit of databrick field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really have anything to do with the Databricks reporter. Can we move the crash pattern parsing into a separate python file?

release/ray_release/tests/test_db_reporter.py Outdated Show resolved Hide resolved
@krfricke krfricke added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 21, 2023
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
@can-anyscale can-anyscale removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 21, 2023
Signed-off-by: Cuong Nguyen <[email protected]>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Ping me for merge when CI passes

Comment on lines +7 to +9
class LogAggregator:
def __init__(self, log: str):
self.log = log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit, but it doesn't look like we actually need a class here (we could just have functions instead) - but fine with me

@krfricke krfricke merged commit f6963dc into master Apr 21, 2023
@krfricke krfricke deleted the can-ci-errors-04 branch April 21, 2023 16:59
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…ay-project#34200)

This PR computes and aggregate unique crash patterns from logs, then store them in Databricks. Later on, this will help us build a dashboard for heat map of errors from aggregated logs, help us prioritize the most impactful errors to fix.

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…ay-project#34200)

This PR computes and aggregate unique crash patterns from logs, then store them in Databricks. Later on, this will help us build a dashboard for heat map of errors from aggregated logs, help us prioritize the most impactful errors to fix.

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Jack He <[email protected]>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
…ay-project#34200)

This PR computes and aggregate unique crash patterns from logs, then store them in Databricks. Later on, this will help us build a dashboard for heat map of errors from aggregated logs, help us prioritize the most impactful errors to fix.

Signed-off-by: Cuong Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants