Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add fast file metadata provider and refactor Parquet datasource #24094

Merged
merged 5 commits into from
Apr 29, 2022

Conversation

pdames
Copy link
Member

@pdames pdames commented Apr 21, 2022

Why are these changes needed?

Adds a fast file metadata provider that trades comprehensive file metadata collection for speed of metadata collection, and which also disabled directory path expansion which can be very slow on some cloud storage service providers. This PR also refactors the Parquet datasource to be able to take advantage of both these changes and the content-type agnostic partitioning support from #23624.

This is the second PR of a series originally proposed in #23179.

Related issue number

Partially resolves #22910.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@pdames pdames requested a review from jianoaix April 21, 2022 23:42
@pdames pdames force-pushed the fast-file-metadata branch 4 times, most recently from 519dc7d to c7a61ba Compare April 25, 2022 23:03
@@ -237,6 +244,32 @@ def expand_paths(
return expanded_paths, file_sizes


class FastFileMetadataProvider(DefaultFileMetadataProvider):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it weird to have those in a *_datasource.py file. Would it make more sense for all those subclasses of FileMetadataProvider to be placed in filed_meta_provider.py? If not, why?

Copy link
Member Author

@pdames pdames Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered the same, and I think it would be a good idea to move them all over to file_metadata_provider.py as part of this PR.

From the perspective of an end-user, they should be able to use from ray.data.datasource import FastFileMetadataProvider regardless of which file we put the class inside of, so this shouldn't change much from their perspective.

From the perspective of a Ray Data maintainer, the current organization largely stems from organic growth of the code, where one-off utility functions gradually evolved into more generic/extensible classes over time, and thus follow existing conventions of grouping these utility classes with the datasource that uses them. Thus, all file metadata providers given in file_based_datasource.py are meant to be used with file-based datasources, while all file metadata providers in parquet_datasource.py are meant to be used specifically with the ParquetDatasource.

The primary downside of continuing to preserve this type of grouping is that core dependencies like file_based_datasource.py start to grow too large over time (sitting now at >700 LoC), which makes it harder to grok the important parts at a glance or during a quick top-to-bottom read through.

If we were to move all file metadata provider implementations into file_metadata_provider.py, I'd expect it to be just over 300 LoC with all classes organized around the central purpose of providing file metadata. This better normalizes the size of our "large files" and keeps them more focused on a single purpose, so I'm slightly in favor of making this move now if we're in agreement that it's beneficial.

We probably also should do the same with block write path providers in a follow-up PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this in the latest commit - let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think it makes sense for subclasses of FileMetadataProvider live in file_metadata_provider.py.


def expand_paths(
self,
paths: List[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all paths need to be for the same block?
If not, why does it make sense to return BlockMetadata as value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this comment is meant to apply to the core __call__ and _get_block_metadata methods, which do require that all paths provided are part of the same block. I think this could be made more clear with some doc string updates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some clarification to the docstrings here in the latest commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
@clarkzinzow Is it true that we have multiple files to load into a single block?

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good overall. My question is whether we can make the chain of inheritance shorter/simpler?

logger = logging.getLogger(__name__)


class ParquetBaseDatasource(FileBasedDatasource):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has only internal methods and also has only one subclass -- can it be folded into its subclass ParquetDatasource? My concern is it's a bit overskill for this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that depends on what folding it into ParquetDatasource actually means but, if we mean getting rid of this class altogether, I don't think that will work moving forward. Some degree of refactoring may be possible, but we will need to separate this class in either this PR or the next PR since the upcoming Parquet bulk file reader API needs to use ParquetBaseDatasource directly. This API will not be able to use ParquetDatasource directly due to its prepare_read method override being the root cause of subsequent scalability/performance issues when consuming a large number of Parquet files cited in #22910.

So we can delay separation of this class until the next PR, but I think it still needs to happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for explanation. Let's keep it then.



@DeveloperAPI
class ParquetMetadataProvider(FileMetadataProvider):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have similar concern of whether this layer is needed, i.e. can DefaultParquetMetadataProvider directly subclass FileMetadataProvider?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference to preserve the current layout since the intent is for anyone providing a custom ParquetDatasource metadata provider to implement the interface signature established in ParquetMetadataProvider and, in particular:

  1. Only implement metadata prefetching if their use-case requires it.
  2. Carefully consider the best way to fetch block metadata for their use-case rather than just keep a default implementation.

For example, my CachedFileMetadataProvider at https://github.com/pdames/deltacat/blob/edec6159c5acda3ede15653b4e92aaa45a43206f/deltacat/io/aws/redshift/redshift_datasource.py#L67-L70 inherits from ParquetMetadataProvider, does not require metadata prefetching, and uses a different implementation for getting block metadata from a prebuilt cache. I also expect this to be the general case for most upcoming data warehouse and data catalog integrations with Ray Datasets beyond just the Amazon Redshift integration that uses it here.

So, in summary, my slight preference is to keep DefaultParquetMetadataProvider set aside as an internal implementation detail exposed to Ray Data maintainers (and to continue to keep the @DeveloperAPI label excluded from this class), and for ParquetMetadataProvider to be exposed to end-users creating their own ParquetDatasource metadata providers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG, thanks.

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Patrick, LG!

logger = logging.getLogger(__name__)


class ParquetBaseDatasource(FileBasedDatasource):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for explanation. Let's keep it then.



@DeveloperAPI
class ParquetMetadataProvider(FileMetadataProvider):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG, thanks.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, awesome work!

@clarkzinzow
Copy link
Contributor

Hey @pdames, just reverted a commit that was breaking Datasets CI job, could you rebase on master one more time? 🙏

@pdames
Copy link
Member Author

pdames commented Apr 29, 2022

Hey @pdames, just reverted a commit that was breaking Datasets CI job, could you rebase on master one more time? 🙏

Done!

@pdames
Copy link
Member Author

pdames commented Apr 29, 2022

@clarkzinzow Looks like the remaining CI failures are unrelated. Could you give this a final pass through?

@clarkzinzow
Copy link
Contributor

LGTM, merging!

@clarkzinzow clarkzinzow merged commit 4691d2d into ray-project:master Apr 29, 2022
clarkzinzow pushed a commit that referenced this pull request May 12, 2022
…ata providers. (#24354)

API doc updates for #23179 and #24094. All data docs related to #23179 should be up-to-date once this PR and #24203 are merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Ray dataset loading large list of parquet files is extremely slow
3 participants