Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running Lambda e2e tests are failing #249

Open
tillrohrmann opened this issue Jan 15, 2024 · 7 comments
Open

Long running Lambda e2e tests are failing #249

tillrohrmann opened this issue Jan 15, 2024 · 7 comments

Comments

@tillrohrmann
Copy link
Contributor

tillrohrmann commented Jan 15, 2024

Currently, our long running lambda e2e tests are failing according to our dev-alerts channel. After failing they seem to recover after a bit of time (and retries). @jackkleeman mentioned that it sometimes happens that pods are being created w/o internet connection and that this is the reason why the tests are failing and then recovering on a retry.

I think it would be great to solve this problem because it creates false positives and makes people take the long running tests that are failing less seriously as they should be taken.

@jackkleeman
Copy link
Contributor

Actually I don't think tests are failing, there are just pods restarting which is a separate alert

@tillrohrmann
Copy link
Contributor Author

@jackkleeman I am seeing these e2e tests failing repeatedly. Any new ideas how to fix the problem?

@jackkleeman
Copy link
Contributor

@tillrohrmann the tests are not failing! it is just a restarting pods alert - its an alert i added a few days back to your suggestion re detecting panics in restate. but the tests succeed despite the restart. i didnt have the bandwidth for this this week, but i can remove this alert if you like. most likely the whole infrastructure is about to change and its not worth investigating the network partitions that are causing these alerts

@tillrohrmann
Copy link
Contributor Author

So because of the missing internet connection, the binary is panicking and on Restart it usually gets resolved? Or is there an easy way to distinguish between the "no internet connection" case and a "Restate panic"? Maybe a pragmatic solution could be to disable these alerts for the lambda tests where the problem with the internet connection can arise.

@jackkleeman
Copy link
Contributor

it appears that the binary eventually exits, yes, and after some restarts it is resolved. it is not possible without parsing logs to figure out what caused a restate binary to restart, no

@slinkydeveloper
Copy link
Collaborator

is this still relevant?

@tillrohrmann
Copy link
Contributor Author

We haven't improved the situation yet. So I would say, it is still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants