-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40037: [C++][FS][Azure] Make attempted reads and writes against directories fail fast #40119
GH-40037: [C++][FS][Azure] Make attempted reads and writes against directories fail fast #40119
Conversation
@felipecrv I expect you will be interested in reviewing this. |
DCHECK_GE(content_length_, 0); | ||
pos_ = content_length_; | ||
Status Init(const bool truncate, | ||
std::function<Status()> ensure_not_flat_namespace_directory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can inject AzureFileSystem *azure_file_system
here and not have to allocate a closure for this. You would call AzureFileSystem::Impl::EnsureNotFlatNamespaceDirectory(location)
via azure_file_system->impl_
(accessible because the handles produced by the azure file system can be friends with the filesystem class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra info. I was planning to do this but I was struggling with the friends thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trick is to forward-declare the class as well.
diff --git a/cpp/src/arrow/filesystem/azurefs.h b/cpp/src/arrow/filesystem/azurefs.h
index 2a131e40c..d48ef9dd7 100644
--- a/cpp/src/arrow/filesystem/azurefs.h
+++ b/cpp/src/arrow/filesystem/azurefs.h
@@ -44,6 +44,7 @@ class DataLakeServiceClient;
namespace arrow::fs {
+class ObjectAppendStream;
class TestAzureFileSystem;
/// Options for the AzureFileSystem implementation.
@@ -180,6 +181,7 @@ class ARROW_EXPORT AzureFileSystem : public FileSystem {
explicit AzureFileSystem(std::unique_ptr<Impl>&& impl);
+ friend class ObjectAppendStream;
friend class TestAzureFileSystem;
void ForceCachedHierarchicalNamespaceSupport(int hns_support);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my main problem was that ObjectAppendStream
is defined inside an anonymous namespace but I still haven't got it working as you describe.
Are you suggesting to use AzureFileSystem *azure_file_system
or AzureFileSystem:Impl *azure_file_system
as the argument to ObjectAppendStream::Impl
. I don't know how I can get a AzureFileSystem
pointer from inside AzureFileSystem::Impl
and using AzureFileSystem::Impl
as the argument leads to incomplete type
errors which I don't think I can avoid.
Also if you wouldn't mind I would be interested to know what the disadvantage of a lambda function is compared to what you proposed.
Sorry about my lacking C++ knowledge here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if you wouldn't mind I would be interested to know what the disadvantage of a lambda function is compared to what you proposed.
To create the std::function
, you heap allocate an object with copies of the values in the capture list and generate a lot more extra code in the binary:
class function {
T valuesfromthecpapturelist;
RetType operator()(ArgsType ...) {...};
}
When you think about an std::function
this way (a pair of context data and a function), you realize the class you already serves that purpose.
But hey, this is becoming challenging, so I won't hold the PR anymore because of this. Moving to Init()
was a big step in the right direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining
DCHECK_GE(content_length_, 0); | ||
pos_ = content_length_; | ||
Status Init(const bool truncate, | ||
std::function<Status()> ensure_not_flat_namespace_directory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if you wouldn't mind I would be interested to know what the disadvantage of a lambda function is compared to what you proposed.
To create the std::function
, you heap allocate an object with copies of the values in the capture list and generate a lot more extra code in the binary:
class function {
T valuesfromthecpapturelist;
RetType operator()(ArgsType ...) {...};
}
When you think about an std::function
this way (a pair of context data and a function), you realize the class you already serves that purpose.
But hey, this is becoming challenging, so I won't hold the PR anymore because of this. Moving to Init()
was a big step in the right direction.
TYPED_TEST(TestAzureFileSystemOnAllScenarios, | ||
OpenOutputStreamWithMissingContainer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing the linter to fail @Tom-Newton. Please fix and I will merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
My apologies for the noise. Looking into this now. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 8a62f30. There was 1 benchmark result with an error:
There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
That should be the last one. Sorry again about the noise! |
@austin3dickey no worries :) |
…nst directories fail fast (apache#40119) ### Rationale for this change Prevent confusion if a user attempts to read or write a directory. ### What changes are included in this PR? - Make `ObjectAppendStream::Flush` a noop if `ObjectAppendStream::Init` has not run successfully. This avoids an unhandled error when the destructor calls flush. - Check blob properties for directory marker metadata when initialising `ObjectInputFile` or `ObjectAppendStream`. - When initialising `ObjectAppendStream` call `GetFileInfo` if it is a flat namespace account. ### Are these changes tested? Add new tests `DisallowReadingOrWritingDirectoryMarkers` and `DisallowCreatingFileAndDirectoryWithTheSameName` to cover the new fail fast behaviour. Also updated `WriteMetadata` to ensure that my changes to Flush didn't break setting metadata without calling `Write` on the stream. ### Are there any user-facing changes? Yes. Invalid read and write operations will now fail fast and gracefully. Previously could get into a confusing state where there were files and directories at the same path and there were some un-graceful failures. * Closes: apache#40037 Authored-by: Thomas Newton <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
…nst directories fail fast (apache#40119) ### Rationale for this change Prevent confusion if a user attempts to read or write a directory. ### What changes are included in this PR? - Make `ObjectAppendStream::Flush` a noop if `ObjectAppendStream::Init` has not run successfully. This avoids an unhandled error when the destructor calls flush. - Check blob properties for directory marker metadata when initialising `ObjectInputFile` or `ObjectAppendStream`. - When initialising `ObjectAppendStream` call `GetFileInfo` if it is a flat namespace account. ### Are these changes tested? Add new tests `DisallowReadingOrWritingDirectoryMarkers` and `DisallowCreatingFileAndDirectoryWithTheSameName` to cover the new fail fast behaviour. Also updated `WriteMetadata` to ensure that my changes to Flush didn't break setting metadata without calling `Write` on the stream. ### Are there any user-facing changes? Yes. Invalid read and write operations will now fail fast and gracefully. Previously could get into a confusing state where there were files and directories at the same path and there were some un-graceful failures. * Closes: apache#40037 Authored-by: Thomas Newton <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Rationale for this change
Prevent confusion if a user attempts to read or write a directory.
What changes are included in this PR?
ObjectAppendStream::Flush
a noop ifObjectAppendStream::Init
has not run successfully. This avoids an unhandled error when the destructor calls flush.ObjectInputFile
orObjectAppendStream
.ObjectAppendStream
callGetFileInfo
if it is a flat namespace account.Are these changes tested?
Add new tests
DisallowReadingOrWritingDirectoryMarkers
andDisallowCreatingFileAndDirectoryWithTheSameName
to cover the new fail fast behaviour.Also updated
WriteMetadata
to ensure that my changes to Flush didn't break setting metadata without callingWrite
on the stream.Are there any user-facing changes?
Yes. Invalid read and write operations will now fail fast and gracefully. Previously could get into a confusing state where there were files and directories at the same path and there were some un-graceful failures.