Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cx_limit_integration_test flakes #11841

Closed
alyssawilk opened this issue Jul 1, 2020 · 7 comments · Fixed by #12233 or #12298
Closed

cx_limit_integration_test flakes #11841

alyssawilk opened this issue Jul 1, 2020 · 7 comments · Fixed by #12233 or #12298
Assignees

Comments

@alyssawilk
Copy link
Contributor

2020-07-01T13:04:39.4991615Z [ RUN ] IpVersions/ConnectionLimitIntegrationTest.TestBothLimits/IPv4
2020-07-01T13:04:39.4992248Z test/integration/cx_limit_integration_test.cc:74: Failure
2020-07-01T13:04:39.4993196Z Value of: fake_upstreams_[0]->waitForRawConnection(raw_conns.back())
2020-07-01T13:04:39.4994095Z Actual: false (Timed out waiting for raw connection)
2020-07-01T13:04:39.4994618Z Expected: true
2020-07-01T13:04:39.4995531Z Stack trace:
2020-07-01T13:04:39.4996224Z 0x1d3caf5: __llvm_coverage_mapping
2020-07-01T13:04:39.4999246Z 0x1d43958: __llvm_coverage_mapping
2020-07-01T13:04:39.4999853Z 0x59ff1b6: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
2020-07-01T13:04:39.5000495Z 0x59ea751: testing::internal::HandleExceptionsInMethodIfSupported<>()
2020-07-01T13:04:39.5001236Z 0x59d4d82: testing::Test::Run()
2020-07-01T13:04:39.5001716Z 0x59d5888: testing::TestInfo::Run()
2020-07-01T13:04:39.5002321Z ... Google Test internal frames ...
2020-07-01T13:04:39.5002653Z
2020-07-01T13:04:39.5003721Z [ FAILED ] IpVersions/ConnectionLimitIntegrationTest.TestBothLimits/IPv4, where GetParam() = 4-byte object <00-00 00-00> (16148 ms)

@alyssawilk
Copy link
Contributor Author

Fails consistently on windows too. have a PR out for that just to get windows CI green

@alyssawilk
Copy link
Contributor Author

I'm wondering if there's a race, where the client picks up the disconnect before the server decrements the overload counter, so the next connection fails. If so I'm not sure if there's any counters we should explicitly wait for, or if we just try connecting until one gets through? cc @tonya11en

@alyssawilk alyssawilk changed the title cx_limit_integration_test flakes (IpVersions/ConnectionLimitIntegrationTest.TestBothLimits) cx_limit_integration_test flakes Jul 1, 2020
@alyssawilk
Copy link
Contributor Author

Another failure on ASAN

2020-07-01T14:08:07.8198111Z [ RUN ] IpVersions/ConnectionLimitIntegrationTest.TestGlobalLimit/IPv4
2020-07-01T14:08:07.8198489Z test/integration/cx_limit_integration_test.cc:69: Failure
2020-07-01T14:08:07.8198946Z Value of: raw_conns.front()->close()
2020-07-01T14:08:07.8199350Z Actual: false (The connection disconnected unexpectedly, and allow_unexpected_disconnects_ is false.
2020-07-01T14:08:07.8200325Z See https://github.com/envoyproxy/envoy/blob/master/test/integration/README.md#unexpected-disconnects)
2020-07-01T14:08:07.8200710Z Expected: true
2020-07-01T14:08:07.8200956Z Stack trace:
2020-07-01T14:08:07.8201486Z 0x168862fa: Envoy::(anonymous namespace)::ConnectionLimitIntegrationTest::doTest()
2020-07-01T14:08:07.8202167Z 0x168978b2: Envoy::(anonymous namespace)::ConnectionLimitIntegrationTest_TestGlobalLimit_Test::TestBody()
2020-07-01T14:08:07.8202655Z 0x2724666a: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
2020-07-01T14:08:07.8203246Z 0x271f4f36: testing::internal::HandleExceptionsInMethodIfSupported<>()
2020-07-01T14:08:07.8203583Z 0x271a1527: testing::Test::Run()
2020-07-01T14:08:07.8203891Z 0x271a4230: testing::TestInfo::Run()
2020-07-01T14:08:07.8204185Z ... Google Test internal frames ...
2020-07-01T14:08:07.8204383Z
2020-07-01T14:08:07.8205367Z [2020-07-01 14:08:02.001][15][critical][assert] [source/common/network/connection_impl.cc:85] assert failure: !ioHandle().isOpen() && delayed_close_timer_ == nullptr. Details: ConnectionImpl was unexpectedly torn down without being closed.
2020-07-01T14:08:07.8205947Z AddressSanitizer:DEADLYSIGNAL

I think this is the double close so I'll fix this one

alyssawilk added a commit that referenced this issue Jul 1, 2020
Risk Level: n/a
Testing: n/a
Docs Changes: n/a
Release Notes: n/a
part of #11841

Signed-off-by: Alyssa Wilk <[email protected]>
@mattklein123 mattklein123 added the help wanted Needs help! label Jul 1, 2020
@alyssawilk
Copy link
Contributor Author

2020-07-21T00:06:09.0956342Z [ RUN ] IpVersions/ConnectionLimitIntegrationTest.TestBothLimits/IPv6
2020-07-21T00:06:09.0957118Z TestRandomGenerator running with seed -1566620465
2021-05-28T20:16:05.5739433Z $TEST_TMPDIR defined: output root default is '/build/tmp' and max_idle_secs default is '15'.
2020-05-28T20:16:05.5748004Z WARNING: The following rc files are no longer being read, please transfer their contents or import their path into one of the standard rc files:
2020-07-21T00:06:09.0957678Z test/integration/cx_limit_integration_test.cc:73: Failure
2020-07-21T00:06:09.0958504Z Value of: fake_upstreams_[0]->waitForRawConnection(raw_conns.back())
2020-07-21T00:06:09.0959084Z Actual: false (Timed out waiting for raw connection)
2020-07-21T00:06:09.0959544Z Expected: true
2020-07-21T00:06:09.0959913Z Stack trace:
2020-07-21T00:06:09.0960452Z 0x1b8aae4: Envoy::(anonymous namespace)::ConnectionLimitIntegrationTest::doTest()
2020-07-21T00:06:09.0961250Z 0x1b8e8bd: Envoy::(anonymous namespace)::ConnectionLimitIntegrationTest_TestBothLimits_Test::TestBody()
2020-07-21T00:06:09.0962010Z 0x40213b4: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
2020-07-21T00:06:09.0962659Z 0x40126ee: testing::internal::HandleExceptionsInMethodIfSupported<>()
2020-07-21T00:06:09.0963228Z 0x3ffccc3: testing::Test::Run()
2020-07-21T00:06:09.0963699Z 0x3ffd8ad: testing::TestInfo::Run()
2020-07-21T00:06:09.0964272Z ... Google Test internal frames ...
2020-07-21T00:06:09.0964581Z
2020-07-21T00:06:09.0965912Z [2020-07-21 00:05:50.632][16][critical][assert] [source/common/network/connection_impl.cc:85] assert failure: !ioHandle().isOpen() && delayed_close_timer_ == nullptr. Details: ConnectionImpl was unexpectedly torn down without being closed.

@davinci26
Copy link
Member

davinci26 commented Jul 21, 2020

I was looking into this yesterday and I will continue to look into it today.

I have the debug logs from a repro and I have uploaded them here

The repro is from a windows machine but I dont think it should matter

@mattklein123
Copy link
Member

This is still broken. I am looking into this further.

@mattklein123 mattklein123 reopened this Jul 25, 2020
@mattklein123 mattklein123 self-assigned this Jul 25, 2020
mattklein123 added a commit that referenced this issue Jul 26, 2020
This fixes two different issues:
1) Previously there was no locking around log sink replacement,
   so it was possibles for a link sink to get removed by one
   thread while getting written to be another thread.
2) Even with locking, the base class destructor pattern would
   do the swap after the derived class was destroyed, leading to
   undefined behavior.

This was easy to reproduce in cx_limit_integration_test but is
an issue anywhere the log expectations are used or death test
stderr workaround for coverage which has been removed because
it no longer logs to a file.

Fixes #11841

Signed-off-by: Matt Klein <[email protected]>
mattklein123 added a commit that referenced this issue Jul 28, 2020
This fixes two different issues:
1) Previously there was no locking around log sink replacement,
   so it was possibles for a log sink to get removed by one
   thread while getting written to by another thread.
2) Even with locking, the base class destructor pattern would
   do the swap after the derived class was destroyed, leading to
   undefined behavior.

This was easy to reproduce in cx_limit_integration_test but is
an issue anywhere the log expectations are used, or previously in the death test
stderr workaround (EXPECT_DEATH_LOG_TO_STDERR) for coverage which has 
been removed because coverage no longer logs to a file and instead logs to stderr
like the rest of the tests.

Fixes #11841

Risk Level: Medium, code is a bit scary, though only really in tests
Testing: Existing tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Matt Klein <[email protected]>
KBaichoo pushed a commit to KBaichoo/envoy that referenced this issue Jul 30, 2020
This fixes two different issues:
1) Previously there was no locking around log sink replacement,
   so it was possibles for a log sink to get removed by one
   thread while getting written to by another thread.
2) Even with locking, the base class destructor pattern would
   do the swap after the derived class was destroyed, leading to
   undefined behavior.

This was easy to reproduce in cx_limit_integration_test but is
an issue anywhere the log expectations are used, or previously in the death test
stderr workaround (EXPECT_DEATH_LOG_TO_STDERR) for coverage which has
been removed because coverage no longer logs to a file and instead logs to stderr
like the rest of the tests.

Fixes envoyproxy#11841

Risk Level: Medium, code is a bit scary, though only really in tests
Testing: Existing tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Kevin Baichoo <[email protected]>
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this issue Aug 7, 2020
This fixes two different issues:
1) Previously there was no locking around log sink replacement,
   so it was possibles for a log sink to get removed by one
   thread while getting written to by another thread.
2) Even with locking, the base class destructor pattern would
   do the swap after the derived class was destroyed, leading to
   undefined behavior.

This was easy to reproduce in cx_limit_integration_test but is
an issue anywhere the log expectations are used, or previously in the death test
stderr workaround (EXPECT_DEATH_LOG_TO_STDERR) for coverage which has 
been removed because coverage no longer logs to a file and instead logs to stderr
like the rest of the tests.

Fixes envoyproxy#11841

Risk Level: Medium, code is a bit scary, though only really in tests
Testing: Existing tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Matt Klein <[email protected]>
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this issue Aug 7, 2020
This fixes two different issues:
1) Previously there was no locking around log sink replacement,
   so it was possibles for a log sink to get removed by one
   thread while getting written to by another thread.
2) Even with locking, the base class destructor pattern would
   do the swap after the derived class was destroyed, leading to
   undefined behavior.

This was easy to reproduce in cx_limit_integration_test but is
an issue anywhere the log expectations are used, or previously in the death test
stderr workaround (EXPECT_DEATH_LOG_TO_STDERR) for coverage which has
been removed because coverage no longer logs to a file and instead logs to stderr
like the rest of the tests.

Fixes envoyproxy#11841

Risk Level: Medium, code is a bit scary, though only really in tests
Testing: Existing tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: chaoqinli <[email protected]>
@wrowe
Copy link
Contributor

wrowe commented Aug 18, 2020

This appears to be resolved now on Windows with the work on simtime, proposing to reenable the test on windows with #12695

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants