-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add fast file metadata provider and refactor Parquet datasource #24094
Conversation
519dc7d
to
c7a61ba
Compare
@@ -237,6 +244,32 @@ def expand_paths( | |||
return expanded_paths, file_sizes | |||
|
|||
|
|||
class FastFileMetadataProvider(DefaultFileMetadataProvider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it weird to have those in a *_datasource.py file. Would it make more sense for all those subclasses of FileMetadataProvider to be placed in filed_meta_provider.py? If not, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered the same, and I think it would be a good idea to move them all over to file_metadata_provider.py
as part of this PR.
From the perspective of an end-user, they should be able to use from ray.data.datasource import FastFileMetadataProvider
regardless of which file we put the class inside of, so this shouldn't change much from their perspective.
From the perspective of a Ray Data maintainer, the current organization largely stems from organic growth of the code, where one-off utility functions gradually evolved into more generic/extensible classes over time, and thus follow existing conventions of grouping these utility classes with the datasource that uses them. Thus, all file metadata providers given in file_based_datasource.py
are meant to be used with file-based datasources, while all file metadata providers in parquet_datasource.py
are meant to be used specifically with the ParquetDatasource
.
The primary downside of continuing to preserve this type of grouping is that core dependencies like file_based_datasource.py
start to grow too large over time (sitting now at >700 LoC), which makes it harder to grok the important parts at a glance or during a quick top-to-bottom read through.
If we were to move all file metadata provider implementations into file_metadata_provider.py
, I'd expect it to be just over 300 LoC with all classes organized around the central purpose of providing file metadata. This better normalizes the size of our "large files" and keeps them more focused on a single purpose, so I'm slightly in favor of making this move now if we're in agreement that it's beneficial.
We probably also should do the same with block write path providers in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this in the latest commit - let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think it makes sense for subclasses of FileMetadataProvider live in file_metadata_provider.py.
|
||
def expand_paths( | ||
self, | ||
paths: List[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all paths need to be for the same block?
If not, why does it make sense to return BlockMetadata as value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this comment is meant to apply to the core __call__
and _get_block_metadata
methods, which do require that all paths provided are part of the same block. I think this could be made more clear with some doc string updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some clarification to the docstrings here in the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@clarkzinzow Is it true that we have multiple files to load into a single block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good overall. My question is whether we can make the chain of inheritance shorter/simpler?
logger = logging.getLogger(__name__) | ||
|
||
|
||
class ParquetBaseDatasource(FileBasedDatasource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class has only internal methods and also has only one subclass -- can it be folded into its subclass ParquetDatasource? My concern is it's a bit overskill for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that depends on what folding it into ParquetDatasource
actually means but, if we mean getting rid of this class altogether, I don't think that will work moving forward. Some degree of refactoring may be possible, but we will need to separate this class in either this PR or the next PR since the upcoming Parquet bulk file reader API needs to use ParquetBaseDatasource
directly. This API will not be able to use ParquetDatasource
directly due to its prepare_read
method override being the root cause of subsequent scalability/performance issues when consuming a large number of Parquet files cited in #22910.
So we can delay separation of this class until the next PR, but I think it still needs to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for explanation. Let's keep it then.
|
||
|
||
@DeveloperAPI | ||
class ParquetMetadataProvider(FileMetadataProvider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have similar concern of whether this layer is needed, i.e. can DefaultParquetMetadataProvider directly subclass FileMetadataProvider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference to preserve the current layout since the intent is for anyone providing a custom ParquetDatasource
metadata provider to implement the interface signature established in ParquetMetadataProvider
and, in particular:
- Only implement metadata prefetching if their use-case requires it.
- Carefully consider the best way to fetch block metadata for their use-case rather than just keep a default implementation.
For example, my CachedFileMetadataProvider
at https://github.com/pdames/deltacat/blob/edec6159c5acda3ede15653b4e92aaa45a43206f/deltacat/io/aws/redshift/redshift_datasource.py#L67-L70 inherits from ParquetMetadataProvider
, does not require metadata prefetching, and uses a different implementation for getting block metadata from a prebuilt cache. I also expect this to be the general case for most upcoming data warehouse and data catalog integrations with Ray Datasets beyond just the Amazon Redshift integration that uses it here.
So, in summary, my slight preference is to keep DefaultParquetMetadataProvider
set aside as an internal implementation detail exposed to Ray Data maintainers (and to continue to keep the @DeveloperAPI
label excluded from this class), and for ParquetMetadataProvider
to be exposed to end-users creating their own ParquetDatasource
metadata providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Patrick, LG!
logger = logging.getLogger(__name__) | ||
|
||
|
||
class ParquetBaseDatasource(FileBasedDatasource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for explanation. Let's keep it then.
|
||
|
||
@DeveloperAPI | ||
class ParquetMetadataProvider(FileMetadataProvider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, awesome work!
ec5499a
to
e205a49
Compare
e205a49
to
c615a6b
Compare
@clarkzinzow Looks like the remaining CI failures are unrelated. Could you give this a final pass through? |
LGTM, merging! |
Why are these changes needed?
Adds a fast file metadata provider that trades comprehensive file metadata collection for speed of metadata collection, and which also disabled directory path expansion which can be very slow on some cloud storage service providers. This PR also refactors the Parquet datasource to be able to take advantage of both these changes and the content-type agnostic partitioning support from #23624.
This is the second PR of a series originally proposed in #23179.
Related issue number
Partially resolves #22910.
Checks
scripts/format.sh
to lint the changes in this PR.