Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightning: support opening empty files and skip directory objects in S3 #33029

Merged
merged 11 commits into from
Mar 18, 2022

Conversation

dsdashun
Copy link
Contributor

@dsdashun dsdashun commented Mar 14, 2022

What problem does this PR solve?

Issue Number: close #31824

Problem Summary: When an empty directory object is matched during parquet import, an error will be reported

What is changed and how it works?

  • When walking a directory in S3, skip those empty directory objects and continue
  • The Open() in S3 now supports opening empty files, so that the behavior is the same as in local file system.
  • The Seek() operation in S3 added a real offset check, so that negative values are not supported
  • Also the integration test for Lightning S3 has been refactored, and added a test case for testing the manually configured file path pattern with empty directories.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

* Also refined the S3 integration test
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Mar 14, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • glorv
  • gozssky

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added the release-note-none Denotes a PR that doesn't merit a release note. label Mar 14, 2022
@CLAassistant
Copy link

CLAassistant commented Mar 14, 2022

CLA assistant check
All committers have signed the CLA.

@ti-chi-bot ti-chi-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 14, 2022
@dsdashun
Copy link
Contributor Author

/cc @gozssky @glorv

* Refined the test for introducing some empty files.
@dsdashun dsdashun changed the title lightning: skip all the empty data files for importing lightning: skip opening all the empty files for making parquet regions Mar 14, 2022
@dsdashun
Copy link
Contributor Author

Since the bug will be triggered when:

  • Situation 1: the data file type is parquet, and in mydump.makeParquetFileRegion(), ExternalStorage.Open() is called on the file path.
  • Situation 2: In S3, all the empty objects will fail on calling Open().

If either situation is avoided, the problem will be fixed.

Previous solutions tackled on avoiding the situation 1. That is, to avoid calling ExternalStorage.Open() on empty objects in mydump.makeParquetFileRegion(). This can make the current storage implementations intact. However, this introduces extra complexity on the mydump.makeParquetFileRegion() logic and the size check only fixes the problems on the S3 ExternalStorage implementation. So this fixing logic is tightly coupled with some storage implementation. For some other ExternalStorage implementations, calling the Open() method on the empty item path is totally OK. For example, opening an empty file on file system is totally OK, and the size check for local FS implementation is totally unnecessary.

Now I decide to change the fixing solution: to avoid the happening on situation 2. That is, making every objects calling Open() successfully.

There are two ways to do this. The first way is to change the implementation of Open() for S3. If opening an empty file path, returns a eofReadCloser . However, there is no data size information for the Open() parameters, and we can only determine whether the file is empty by its file path. (If the path has a '/' suffix, we assume it is a directory and return a eofReadCloser ). Another way to fix the problem is to change the behavior of WalkDir(), which I'll use.

In the definition of ExternalStorage.WalkDir():

// WalkDir traverse all the files in a dir.
//
// fn is the function called for each regular file visited by WalkDir.
// The argument `path` is the file path that can be used in `Open`
// function; the argument `size` is the size in byte of the file determined

It indicates that all the paths iterated in the hook function of WalkDir() should be able to call the Open() method without definite errors. For the S3 implementation, those empty objects don't meet the requirements. So I plan to modify the implementation of the WalkDir() for S3, to filter out those empty objects.

This will make the main logic for restoring and making regions intact and clean. However, that leaves all ExternalStorage implementations to ensure that the iterated paths in WalkDir() hook function should be able to call the Open() method without problems. Currently I only modify the behavior of S3. Some other implementations, such as Azure Blob and Google Cloud Storage, should also check their implementations. If this PR is accepted, those implementations' adaptions can be done in other PRs.

Maybe the filtering logic for empty files can also be added in the hook function in WalkDir() calling in (*mdLoaderSetup).listFile() . However, this have same problem as the current solution for avoiding situation 1: The logic in the hook function should be implementation-independent for ExternalStorage, but here the filtering logic only applies to S3 implementation. So this is not so reasonable.

* Revert the makeParquetFileRegion logic
* For S3 WalkDir(), filter out empty files
@dsdashun dsdashun changed the title lightning: skip opening all the empty files for making parquet regions lightning: skip all the empty files for walking a directory in S3 Mar 14, 2022
@dsdashun
Copy link
Contributor Author

/run-unit-test

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 16, 2022
Copy link
Contributor

@sleepymole sleepymole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsdashun WalkDir may not be used only by Lightning. To be consistent with normal file system, I think we shouldn't skip all empty objects in WalkDir.

For opening an empty object in S3, we can use S3API.HeadObject to get the size information first.

@ti-chi-bot ti-chi-bot removed the status/LGT1 Indicates that a PR has LGTM 1. label Mar 16, 2022
@dsdashun
Copy link
Contributor Author

@dsdashun WalkDir may not be used only by Lightning. To be consistent with normal file system, I think we shouldn't skip all empty objects in WalkDir.

For opening an empty object in S3, we can use S3API.HeadObject to get the size information first.

Hi @gozssky , I think calling S3API.HeadObject in the Open() operation will introduce an extra request/response overhead on the Open() operation. And this also might add some more chances of failure for the Open() operation for S3.

I still think changing the behavior in WalkDir() will be appropriate, because it has the data size information directly. Would that be OK if I just skip all the empty DIRECTORY objects (that is empty objects with "/" suffix ) in WalkDir?

@sleepymole
Copy link
Contributor

@dsdashun To avoid unnecessary overhead, perhaps we can add a special check for the InvalidRange error and do a retry.

Would that be OK if I just skip all the empty DIRECTORY objects (that is empty objects with "/" suffix ) in WalkDir

Agree. ExternalStorage acts like a filesystem. So any filename that ends with "/" is inappropriate.

* For `WalkDir()`, only the empty directories are omitted.
* For `open()`, if the request is a full range request, skip passing Range argument,
  and use response.ContentLength as the object size information
@ti-chi-bot ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 17, 2022
@sre-bot
Copy link
Contributor

sre-bot commented Mar 17, 2022

Copy link
Contributor

@sleepymole sleepymole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

br/pkg/storage/s3.go Outdated Show resolved Hide resolved
br/pkg/storage/s3.go Outdated Show resolved Hide resolved
@dsdashun
Copy link
Contributor Author

/run-integration-br-test

1 similar comment
@sleepymole
Copy link
Contributor

/run-integration-br-test

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 18, 2022
br/pkg/storage/s3.go Outdated Show resolved Hide resolved
br/pkg/storage/s3.go Outdated Show resolved Hide resolved
br/pkg/storage/s3.go Show resolved Hide resolved
@dsdashun dsdashun changed the title lightning: skip all the empty files for walking a directory in S3 lightning: support opening empty files and skip directory objects in S3 Mar 18, 2022
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 18, 2022
@glorv
Copy link
Contributor

glorv commented Mar 18, 2022

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: eda2f4e

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 18, 2022
@ti-chi-bot
Copy link
Member

@dsdashun: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot merged commit 7309b08 into pingcap:master Mar 18, 2022
@dsdashun dsdashun deleted the fix-31824 branch March 18, 2022 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants