Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41004: [C++][FS][Azure] Don't run TestGetFileInfoGenerator() with Valgrind #41163

Merged
merged 1 commit into from
Apr 12, 2024

Conversation

kou
Copy link
Member

@kou kou commented Apr 12, 2024

Rationale for this change

GetFileInfo() with generator reports false positive memory leak in Azure SDK for C++.

What changes are included in this PR?

Don't run TestGetFileInfoGenerator() with Valgrind.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

… with Valgrind

Because GetFileInfo() with generator reports false positive memory
leak in Azure SDK for C++.
@kou
Copy link
Member Author

kou commented Apr 12, 2024

@github-actions crossbow submit test-conda-cpp-valgrind

Copy link

⚠️ GitHub issue #41004 has been automatically assigned in GitHub to PR creator.

Copy link

Revision: 5f4255f

Submitted crossbow builds: ursacomputing/crossbow @ actions-29b32e4857

Task Status
test-conda-cpp-valgrind GitHub Actions

@kou
Copy link
Member Author

kou commented Apr 12, 2024

+1

@kou kou merged commit 6e8ac43 into apache:main Apr 12, 2024
34 of 35 checks passed
@kou kou deleted the ci-azurefs-valgrind branch April 12, 2024 02:37
@kou kou removed the awaiting committer review Awaiting committer review label Apr 12, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6e8ac43.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 20 possible false positives for unstable benchmarks that are known to sometimes produce them.

@@ -57,6 +57,7 @@
#cmakedefine ARROW_GCS
#cmakedefine ARROW_HDFS
#cmakedefine ARROW_S3
#cmakedefine ARROW_TEST_MEMCHECK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have ARROW_VALGRIND, I don't understand why we would need to introduce this new constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I missed it. Can we move ARROW_VALGRIND to config.h.cmake from add_definitions(-DARROW_VALGRIND)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an information that is useful to third-party code, so I don't understand why that would be useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be useful to third-party code but it helps us to know what macros are defined in a build.
If we use add_definitions(-DARROW_VALGRIND), we need to use ninja -v and find a -DARROW_VALGRIND from a long command line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Apr 12, 2024
@pitrou
Copy link
Member

pitrou commented Apr 12, 2024

@kou Did we try to understand why a memory leak was reported? We shouldn't just silence the errors that annoy us.

cc @felipecrv

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 12, 2024
@kou
Copy link
Member Author

kou commented Apr 12, 2024

@pitrou
Copy link
Member

pitrou commented Apr 12, 2024

Ok... This seems to assume that Valgrind is buggy and doesn't handle a calloc() call correctly, while our Azure FS implementation (and/or the Azure SDK) is spotless and cannot contain any bugs. I'm quite skeptical.

There are couple search results online for this, but very few, so I'm not sure it's really an issue in libxml or in the Azure SDK. Have you tried opening a bug report there? Perhaps they know more about the issue.

Really, deciding that errors are "false positives" without any further analysis does not strike me as a serious way to deal with CI issues.

And even if we decide it's a false positive, we can add a Valgrind suppression for the case of xmlNewGlobalState, which would still help us find other issues:
https://github.com/apache/arrow/blob/main/cpp/valgrind.supp

raulcd pushed a commit that referenced this pull request Apr 12, 2024
…Valgrind (#41163)

### Rationale for this change

`GetFileInfo()` with generator reports false positive memory leak in Azure SDK for C++.

### What changes are included in this PR?

Don't run `TestGetFileInfoGenerator()` with Valgrind.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #41004

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@felipecrv
Copy link
Contributor

Ok... This seems to assume that Valgrind is buggy and doesn't handle a calloc() call correctly, while our Azure FS implementation (and/or the Azure SDK) is spotless and cannot contain any bugs

The reason I didn't make progress on the issue was also this sense that this seems to be an actual leak and not a misreport that we can just suppress.

But I suspect it's something in the Azure SDK (exception-safety failure?) because GetFileInfoWithSelectorFromContainer doesn't do complicated memory management. For instance, the lambdas don't even hold shared/strong references to anything.

Status GetFileInfoWithSelectorFromContainer(
const Blobs::BlobContainerClient& container_client, const Core::Context& context,
Azure::Nullable<int32_t> page_size_hint, const FileSelector& select,
FileInfoVector* acc_results) {
ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir));

@pitrou
Copy link
Member

pitrou commented Apr 12, 2024

But I suspect it's something in the Azure SDK (exception-safety failure?) because GetFileInfoWithSelectorFromContainer doesn't do complicated memory management.

That may definitely be possible (in which case the issue should be reported upstream nevertheless), but it would be nice to investigate a bit to make sure we're not missing anything on our side.

@pitrou
Copy link
Member

pitrou commented Apr 12, 2024

Note that xmlNewGlobalState seems to be some kind of legacy function ("backwards compatibility
of libxml2 to pre-thread-safe behaviour"):
https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/globals.c#L837-845

We see this in xmlGetThreadLocalStorage: xmlNewGlobalState is only called if the macro USE_TLS is not defined.
https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/globals.c#L872-895

USE_TLS in turn gets conditionally defined here:
https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/globals.c#L99-114

Where does libxml2 come in our case? Is it vendored by the Azure SDK? Something else?

@pitrou
Copy link
Member

pitrou commented Apr 12, 2024

Where does libxml2 come in our case? Is it vendored by the Azure SDK? Something else?

Answering myself: it's version 2.12.6 from conda-forge.

@felipecrv
Copy link
Contributor

And the Azure SDK depends on it:

_deps/azure_sdk-build/sdk/storage/azure-storage-common/azure-storage-common-cppConfig.cmake
35:  find_dependency(LibXml2)

@pitrou
Copy link
Member

pitrou commented Apr 12, 2024

Judging by the disassembly of xmlGetThreadLocalStorage on conda-forge's libxml2, it seems that USE_TLS doesn't get enabled in that package.

@pitrou
Copy link
Member

pitrou commented Apr 12, 2024

Ok, that's because --with-tls doesn't get passed here:
https://github.com/conda-forge/libxml2-feedstock/blob/main/recipe/build.sh

@pitrou
Copy link
Member

pitrou commented Apr 12, 2024

I've reported conda-forge/libxml2-feedstock#117

@kou
Copy link
Member Author

kou commented Apr 13, 2024

I've also investigated this more.

I think that --with-tls isn't the cause of this.

I could reproduce this by the following program that only uses libxml2 and std::thread:

// g++ -fsanitize=address -g3 -O0 -o aaa aaa.cxx $(pkg-config --cflags --libs libxml-2.0) && ./aaa

#include <chrono>
#include <condition_variable>
#include <mutex>
#include <thread>

#include <libxml/xmlreader.h>

int main(void) {
  xmlInitParser();
  std::mutex mutex;
  std::condition_variable variable;
  std::thread thread([&] {
    xmlTextReaderPtr reader =
      xmlReaderForMemory("<root/>", 7, nullptr, nullptr, 0);
    xmlFreeTextReader(reader);
    variable.notify_one();
    {
      std::unique_lock<std::mutex> lock(mutex);
      variable.wait(lock);
    }
  });
  {
    std::unique_lock<std::mutex> lock(mutex);
    variable.wait(lock);
  }
  xmlCleanupParser();
  variable.notify_one();
  thread.join();
  return 0;
}

The point of this program is that a thread that uses libxml2 API is finished after xmlCleanupParser() is called.

This is also happen in our test. A thread created by arrow::internal::ThreadPool is finished after a xmlCleanupParser() call that is caused in Azure SDK for C++. ( https://github.com/Azure/azure-sdk-for-cpp/blob/067d6acb3b2d1f82b5ad9d258050bc525941d501/sdk/storage/azure-storage-common/src/xml_wrapper.cpp#L398 )

Note that both of them are happen by a C++ destructor. They are called on process exit. Azure::Storage::_internal::XmlGlobalInitializer::~XmlGlobalInitializer() is called before arrow::internal::ThreadPool::~ThreadPool of arrow::io::default_io_context().

If we shutdown a thread of arrow::internal::ThreadPool explicitly before Azure::Storage::_internal::XmlGlobalInitializer::~XmlGlobalInitializer(), the leak isn't reported. For example:

diff --git a/cpp/src/arrow/filesystem/azurefs_test.cc b/cpp/src/arrow/filesystem/azurefs_test.cc
index ed09bfc2fa..d87e3e0731 100644
--- a/cpp/src/arrow/filesystem/azurefs_test.cc
+++ b/cpp/src/arrow/filesystem/azurefs_test.cc
@@ -58,6 +58,7 @@
 #include "arrow/util/logging.h"
 #include "arrow/util/pcg_random.h"
 #include "arrow/util/string.h"
+#include "arrow/util/thread_pool.h"
 #include "arrow/util/unreachable.h"
 #include "arrow/util/value_parsing.h"
 
@@ -371,6 +372,8 @@ class TestGeneric : public ::testing::Test, public GenericFileSystemTest {
     if (azure_fs_) {
       ASSERT_OK(azure_fs_->DeleteDir(container_name_));
     }
+    // Dirty
+    ASSERT_OK(reinterpret_cast<::arrow::internal::ThreadPool *>(io_context_->executor())->Shutdown());
   }
 
  protected:
@@ -379,7 +382,8 @@ class TestGeneric : public ::testing::Test, public GenericFileSystemTest {
     random::pcg32_fast rng((std::random_device()()));
     container_name_ = PreexistingData::RandomContainerName(rng);
     ASSERT_OK_AND_ASSIGN(auto options, MakeOptions(env_));
-    ASSERT_OK_AND_ASSIGN(azure_fs_, AzureFileSystem::Make(options));
+    io_context_ = std::make_unique<io::IOContext>();
+    ASSERT_OK_AND_ASSIGN(azure_fs_, AzureFileSystem::Make(options, *io_context_));
     ASSERT_OK(azure_fs_->CreateDir(container_name_, true));
     fs_ = std::make_shared<SubTreeFileSystem>(container_name_, azure_fs_);
   }
@@ -417,6 +421,7 @@ class TestGeneric : public ::testing::Test, public GenericFileSystemTest {
 
  private:
   std::string container_name_;
+  std::unique_ptr<io::IOContext> io_context_;
 };
 
 class TestAzuriteGeneric : public TestGeneric {

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Apr 15, 2024
… with Valgrind (apache#41163)

### Rationale for this change

`GetFileInfo()` with generator reports false positive memory leak in Azure SDK for C++.

### What changes are included in this PR?

Don't run `TestGetFileInfoGenerator()` with Valgrind.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41004

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@pitrou
Copy link
Member

pitrou commented Apr 15, 2024

I think that --with-tls isn't the cause of this.

Well, we can wait for an updated conda-forge package to make sure that's the case.

@felipecrv
Copy link
Contributor

Looks like yet another case of a library polluting the thread-local storage area :)

@pitrou
Copy link
Member

pitrou commented Apr 15, 2024

It's more a case of DLL finalization coming before whatever unit of work is still being executed in some worker thread, AFAIU. We had a similar problem with S3 (but resulting in worse misbehavior aka crashes) and had to jump through hoops to (hopefully) solve the problem.

kou added a commit to kou/arrow that referenced this pull request Apr 18, 2024
…Valgrind again

We use conda's libxml2. It didn't use `--with-tls` but 2.12.6-2 or
later uses `--with-tls`. It may suppress a detected leak.
@kou
Copy link
Member Author

kou commented Apr 21, 2024

It seems that --with-tls is related: #41279 (comment)

tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
… with Valgrind (apache#41163)

### Rationale for this change

`GetFileInfo()` with generator reports false positive memory leak in Azure SDK for C++.

### What changes are included in this PR?

Don't run `TestGetFileInfoGenerator()` with Valgrind.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41004

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
… with Valgrind (apache#41163)

### Rationale for this change

`GetFileInfo()` with generator reports false positive memory leak in Azure SDK for C++.

### What changes are included in this PR?

Don't run `TestGetFileInfoGenerator()` with Valgrind.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41004

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
… with Valgrind (apache#41163)

### Rationale for this change

`GetFileInfo()` with generator reports false positive memory leak in Azure SDK for C++.

### What changes are included in this PR?

Don't run `TestGetFileInfoGenerator()` with Valgrind.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41004

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
… with Valgrind (apache#41163)

### Rationale for this change

`GetFileInfo()` with generator reports false positive memory leak in Azure SDK for C++.

### What changes are included in this PR?

Don't run `TestGetFileInfoGenerator()` with Valgrind.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41004

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
… with Valgrind (apache#41163)

### Rationale for this change

`GetFileInfo()` with generator reports false positive memory leak in Azure SDK for C++.

### What changes are included in this PR?

Don't run `TestGetFileInfoGenerator()` with Valgrind.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41004

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants