Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix build and C++ tests for FreeBSD #10480

Merged
merged 19 commits into from
Jun 28, 2024
Merged

Fix build and C++ tests for FreeBSD #10480

merged 19 commits into from
Jun 28, 2024

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Jun 24, 2024

Closes #10468
Closes #10467
Closes #10466

cc @yurivict

@hcho3
Copy link
Collaborator Author

hcho3 commented Jun 24, 2024

@trivialfis Many C++ tests are failing on FreeBSD. It appears that all tracker C++ tests have a race condition where the tracker is stopped before workers can send a shutdown signal.

$ ./build/testxgboost --gtest_catch_exceptions=0 --gtest_filter=AllgatherTest.Basic
Note: Google Test filter = AllgatherTest.Basic
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AllgatherTest
[ RUN      ] AllgatherTest.Basic
[01:25:58] INFO: /home/phcho/Desktop/xgboost/tests/cpp/collective/test_worker.h:119: Using 2 workers for test.
[01:25:58] RabitTracker::Stop()
[01:25:58] RabitTracker::Stop(): Shutdown()
[01:25:58] Task t:0 got rank 0
[01:25:58] Task t:1 got rank 1
[01:25:58] RabitComm::Shutdown(): ConnectTrackerImpl()
[01:25:58] RabitComm::Shutdown(): ConnectTrackerImpl()
[01:25:58] WARNING: /home/phcho/Desktop/xgboost/src/collective/socket.cc:134: socket.cc(153): Failed to connect to:10.0.2.15:39699 Error:
- [socket.cc:152|01:25:58]: connect failed. system error:Connection refused
[01:25:58] WARNING: /home/phcho/Desktop/xgboost/src/collective/socket.cc:134: socket.cc(177): Failed to connect to:10.0.2.15:39699 Error:
- [socket.h:357|01:25:58]: Socket error. system error:Connection refused
[01:25:58] /home/phcho/Desktop/xgboost/src/collective/result.cc:78: 
- [comm.cc:40|01:25:58]: Failed to connect to the tracker.
- [socket.cc:187|01:25:58]: Failed to connect to 10.0.2.15:39699
- [socket.h:357|01:25:58]: Socket error. Connection refused
[01:25:58] /home/phcho/Desktop/xgboost/src/collective/result.cc:78: 
- [comm.cc:40|01:25:58]: Failed to connect to the tracker.
- [socket.cc:187|01:25:58]: Failed to connect to 10.0.2.15:39699
- [socket.cc:152|01:25:58]: connect failed. Connection refused
Stack trace:
  [bt] (0) 0x705d38 <dmlc::LogMessageFatal::~LogMessageFatal()+0x48> at /home/phcho/Desktop/xgboost/build/testxgboost
  [bt] (1) 0x84a43e <xgboost::collective::SafeColl(xgboost::collective::Result const&)+0x7e> at /home/phcho/Desktop/xgboost/build/testxgboost
  [bt] (2) 0xe26059 <xgboost::collective::WorkerForTest::~WorkerForTest()+0x39> at /home/phcho/Desktop/xgboost/build/testxgboost
  [bt] (3) 0xe1dd15 <xgboost::collective::(anonymous namespace)::Worker::~Worker()+0x15> at /home/phcho/Desktop/xgboost/build/testxgboost
  [bt] (4) 0xe1d400 <xgboost::collective::AllgatherTest_Basic_Test::TestBody()::$_0::operator()(std::__1::basic_string<char, $_0::char_traits<char>, $_0::allocator<char> >, int, $_0::chrono::duration<long long, $_0::ratio<1, 1> >, int) const+0xa0> at /home/phcho/Desktop/xgboost/build/testxgboost
  [bt] (5) 0xe1d325 <_ZZN7xgboost10collective15TestDistributedIZNS0_24AllgatherTest_Basic_Test8TestBodyEvE3$_0EEviT_ENKUlvE_clEv+0x45> at /home/phcho/Desktop/xgboost/build/testxgboost
  [bt] (6) 0xe1d2b5 <_ZNSt3__18__invokeB8se180100IZN7xgboost10collective15TestDistributedIZNS2_24AllgatherTest_Basic_Test8TestBodyEvE3$_0EEviT_EUlvE_JEEEDTclclsr3stdE7declvalIS6_EEspclsr3stdE7declvalIT0_EEEEOS6_DpOS8_+0x15> at /home/phcho/Desktop/xgboost/build/testxgboost
  [bt] (7) 0xe1d28d <_ZNSt3__116__thread_executeB8se180100INS_10unique_ptrINS_15__thread_structENS_14default_deleteIS2_EEEEZN7xgboost10collective15TestDistributedIZNS7_24AllgatherTest_Basic_Test8TestBodyEvE3$_0EEviT_EUlvE_JETpTnmJEEEvRNS_5tupleIJSB_T0_DpT1_EEENS_15__tuple_indicesIJXspT2_EEEE+0x1d> at /home/phcho/Desktop/xgboost/build/testxgboost

Abort trap (core dumped)

To get this error, I built XGBoost on FreeBSD 14.1 with the following patch:

diff --git a/src/collective/comm.cc b/src/collective/comm.cc
index 543ece639..6c257a703 100644
--- a/src/collective/comm.cc
+++ b/src/collective/comm.cc
@@ -373,6 +373,7 @@ RabitComm::~RabitComm() noexcept(false) {
   TCPSocket err_client;
 
   return Success() << [&] {
+    LOG(CONSOLE) << "RabitComm::Shutdown(): ConnectTrackerImpl()";
     return ConnectTrackerImpl(tracker_, timeout_, retry_, task_id_, &tracker, Rank(), World());
   } << [&] {
     return this->Block();
diff --git a/src/collective/tracker.cc b/src/collective/tracker.cc
index 9441ab449..822a283ab 100644
--- a/src/collective/tracker.cc
+++ b/src/collective/tracker.cc
@@ -355,6 +355,7 @@ Result RabitTracker::Bootstrap(std::vector<WorkerProxy>* p_workers) {
 }
 
 [[nodiscard]] Result RabitTracker::Stop() {
+  LOG(CONSOLE) << "RabitTracker::Stop()";
   if (!this->Ready()) {
     return Success();
   }
@@ -366,6 +367,7 @@ Result RabitTracker::Bootstrap(std::vector<WorkerProxy>* p_workers) {
   }
 
   return Success() << [&] {
+    LOG(CONSOLE) << "RabitTracker::Stop(): Shutdown()";
     // This should have the effect of stopping the `accept` call.
     return this->listener_.Shutdown();
   } << [&] {
diff --git a/tests/cpp/collective/test_allgather.cc b/tests/cpp/collective/test_allgather.cc
index 7764a2adc..71bf8203d 100644
--- a/tests/cpp/collective/test_allgather.cc
+++ b/tests/cpp/collective/test_allgather.cc
@@ -141,7 +141,7 @@ class Worker : public WorkerForTest {
 }  // namespace
 
 TEST_F(AllgatherTest, Basic) {
-  std::int32_t n_workers = std::min(7u, std::thread::hardware_concurrency());
+  std::int32_t n_workers = 2;
   TestDistributed(n_workers, [=](std::string host, std::int32_t port, std::chrono::seconds timeout,
                                  std::int32_t r) {
     Worker worker{host, port, timeout, n_workers, r};

Shall we disable all tests using the tracker until we can figure out how to fix the race condition in the test harness?

@trivialfis
Copy link
Member

I will try to debug it. Should not be difficult once I can reproduce.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jun 24, 2024

@trivialfis Let me know if you need help with setting up FreeBSD VM.

@trivialfis
Copy link
Member

These issues are not new, it's just we started to have c++ tests for networking in this release. Will dive deeper.

@trivialfis
Copy link
Member

The networking issue should be fixed now. It's an invalid argument error in the tracker. Freebsd requires timeout to be -1 for infinite timeout, while other platforms are OK with negative values.

@hcho3 hcho3 marked this pull request as ready for review June 27, 2024 08:31
@hcho3 hcho3 changed the title [WIP] Fix build for FreeBSD Fix build and C++ tests for FreeBSD Jun 27, 2024
@trivialfis
Copy link
Member

I'm still running into a flaky hang for column sampler. Debugging.

@trivialfis
Copy link
Member

It's probably just my virtual machine (qemu). The OS seems overwhelmed by the number of threads in some tests. There's also a flaky test for column split, probably due to timeout; I haven't gotten to the bottom of it yet. If it fails on the CI, I will disable it for now and get back to it when I sort out some other priorities.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jun 28, 2024

@trivialfis Can we merge this for now? The CI is green.

@hcho3 hcho3 merged commit 09d32f1 into dmlc:master Jun 28, 2024
30 checks passed
@hcho3 hcho3 deleted the fix_freebsd branch June 28, 2024 08:48
hcho3 added a commit to hcho3/xgboost that referenced this pull request Jun 29, 2024
@trivialfis trivialfis mentioned this pull request Jul 12, 2024
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants