Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40037: [C++][FS][Azure] Make attempted reads and writes against directories fail fast #40119

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Feb 18, 2024

Rationale for this change

Prevent confusion if a user attempts to read or write a directory.

What changes are included in this PR?

  • Make ObjectAppendStream::Flush a noop if ObjectAppendStream::Init has not run successfully. This avoids an unhandled error when the destructor calls flush.
  • Check blob properties for directory marker metadata when initialising ObjectInputFile or ObjectAppendStream.
  • When initialising ObjectAppendStream call GetFileInfo if it is a flat namespace account.

Are these changes tested?

Add new tests DisallowReadingOrWritingDirectoryMarkers and DisallowCreatingFileAndDirectoryWithTheSameName to cover the new fail fast behaviour.
Also updated WriteMetadata to ensure that my changes to Flush didn't break setting metadata without calling Write on the stream.

Are there any user-facing changes?

Yes. Invalid read and write operations will now fail fast and gracefully. Previously could get into a confusing state where there were files and directories at the same path and there were some un-graceful failures.

@Tom-Newton
Copy link
Contributor Author

@felipecrv I expect you will be interested in reviewing this.

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 19, 2024
DCHECK_GE(content_length_, 0);
pos_ = content_length_;
Status Init(const bool truncate,
std::function<Status()> ensure_not_flat_namespace_directory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can inject AzureFileSystem *azure_file_system here and not have to allocate a closure for this. You would call AzureFileSystem::Impl::EnsureNotFlatNamespaceDirectory(location) via azure_file_system->impl_ (accessible because the handles produced by the azure file system can be friends with the filesystem class).

Copy link
Contributor Author

@Tom-Newton Tom-Newton Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra info. I was planning to do this but I was struggling with the friends thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick is to forward-declare the class as well.

diff --git a/cpp/src/arrow/filesystem/azurefs.h b/cpp/src/arrow/filesystem/azurefs.h
index 2a131e40c..d48ef9dd7 100644
--- a/cpp/src/arrow/filesystem/azurefs.h
+++ b/cpp/src/arrow/filesystem/azurefs.h
@@ -44,6 +44,7 @@ class DataLakeServiceClient;

 namespace arrow::fs {

+class ObjectAppendStream;
 class TestAzureFileSystem;

 /// Options for the AzureFileSystem implementation.
@@ -180,6 +181,7 @@ class ARROW_EXPORT AzureFileSystem : public FileSystem {

   explicit AzureFileSystem(std::unique_ptr<Impl>&& impl);

+  friend class ObjectAppendStream;
   friend class TestAzureFileSystem;
   void ForceCachedHierarchicalNamespaceSupport(int hns_support);

Copy link
Contributor Author

@Tom-Newton Tom-Newton Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my main problem was that ObjectAppendStream is defined inside an anonymous namespace but I still haven't got it working as you describe.

Are you suggesting to use AzureFileSystem *azure_file_system or AzureFileSystem:Impl *azure_file_system as the argument to ObjectAppendStream::Impl. I don't know how I can get a AzureFileSystem pointer from inside AzureFileSystem::Impl and using AzureFileSystem::Impl as the argument leads to incomplete type errors which I don't think I can avoid.

Also if you wouldn't mind I would be interested to know what the disadvantage of a lambda function is compared to what you proposed.

Sorry about my lacking C++ knowledge here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you wouldn't mind I would be interested to know what the disadvantage of a lambda function is compared to what you proposed.

To create the std::function, you heap allocate an object with copies of the values in the capture list and generate a lot more extra code in the binary:

class function {
  T valuesfromthecpapturelist;
  
  RetType operator()(ArgsType ...) {...};
}

When you think about an std::function this way (a pair of context data and a function), you realize the class you already serves that purpose.

But hey, this is becoming challenging, so I won't hold the PR anymore because of this. Moving to Init() was a big step in the right direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
DCHECK_GE(content_length_, 0);
pos_ = content_length_;
Status Init(const bool truncate,
std::function<Status()> ensure_not_flat_namespace_directory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you wouldn't mind I would be interested to know what the disadvantage of a lambda function is compared to what you proposed.

To create the std::function, you heap allocate an object with copies of the values in the capture list and generate a lot more extra code in the binary:

class function {
  T valuesfromthecpapturelist;
  
  RetType operator()(ArgsType ...) {...};
}

When you think about an std::function this way (a pair of context data and a function), you realize the class you already serves that purpose.

But hey, this is becoming challenging, so I won't hold the PR anymore because of this. Moving to Init() was a big step in the right direction.

Comment on lines 1606 to 1607
TYPED_TEST(TestAzureFileSystemOnAllScenarios,
OpenOutputStreamWithMissingContainer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing the linter to fail @Tom-Newton. Please fix and I will merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@felipecrv felipecrv merged commit 8a62f30 into apache:main Feb 21, 2024
34 of 35 checks passed
@felipecrv felipecrv removed the awaiting committer review Awaiting committer review label Feb 21, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Feb 21, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

@austin3dickey
Copy link
Contributor

My apologies for the noise. Looking into this now.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30.

There was 1 benchmark result with an error:

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

@austin3dickey
Copy link
Contributor

That should be the last one. Sorry again about the noise!

@felipecrv
Copy link
Contributor

@austin3dickey no worries :)

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…nst directories fail fast (apache#40119)

### Rationale for this change
Prevent confusion if a user attempts to read or write a directory. 

### What changes are included in this PR?
- Make `ObjectAppendStream::Flush` a noop if `ObjectAppendStream::Init` has not run successfully. This avoids an unhandled error when the destructor calls flush. 
- Check blob properties for directory marker metadata when initialising `ObjectInputFile` or `ObjectAppendStream`. 
- When initialising `ObjectAppendStream` call `GetFileInfo` if it is a flat namespace account.

### Are these changes tested?
Add new tests `DisallowReadingOrWritingDirectoryMarkers` and `DisallowCreatingFileAndDirectoryWithTheSameName` to cover the new fail fast behaviour. 
Also updated `WriteMetadata` to ensure that my changes to Flush didn't break setting metadata without calling `Write` on the stream. 

### Are there any user-facing changes?
Yes. Invalid read and write operations will now fail fast and gracefully. Previously could get into a confusing state where there were files and directories at the same path and there were some un-graceful failures. 

* Closes: apache#40037

Authored-by: Thomas Newton <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…nst directories fail fast (apache#40119)

### Rationale for this change
Prevent confusion if a user attempts to read or write a directory. 

### What changes are included in this PR?
- Make `ObjectAppendStream::Flush` a noop if `ObjectAppendStream::Init` has not run successfully. This avoids an unhandled error when the destructor calls flush. 
- Check blob properties for directory marker metadata when initialising `ObjectInputFile` or `ObjectAppendStream`. 
- When initialising `ObjectAppendStream` call `GetFileInfo` if it is a flat namespace account.

### Are these changes tested?
Add new tests `DisallowReadingOrWritingDirectoryMarkers` and `DisallowCreatingFileAndDirectoryWithTheSameName` to cover the new fail fast behaviour. 
Also updated `WriteMetadata` to ensure that my changes to Flush didn't break setting metadata without calling `Write` on the stream. 

### Are there any user-facing changes?
Yes. Invalid read and write operations will now fail fast and gracefully. Previously could get into a confusing state where there were files and directories at the same path and there were some un-graceful failures. 

* Closes: apache#40037

Authored-by: Thomas Newton <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][FS][Azure] Prevent reading or writing directory marker blobs
3 participants