Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pseudo columns #1203

Closed
Igosuki opened this issue Oct 29, 2021 · 4 comments
Closed

Support pseudo columns #1203

Igosuki opened this issue Oct 29, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@Igosuki
Copy link
Contributor

Igosuki commented Oct 29, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Haven't see that anywhere in the code, so please let me know if I overlooked it.
Spark supports pseudo columns, a pseudo column is a column that has a single value for an entire partition, and can be used to filter data without reading any physical files.

Describe the solution you'd like
reading basepath which contains two directories foo=a and foo=b gives me a dataframe with the utf8 column foo merged into the schema, and a partition for each value.
If I filter on that column, for instance select * from basepathtable where foo = "a" only scans the files in that directory.

Describe alternatives you've considered
Currently AFAIK, datafusion requires you to know the full list of partition directories and read from each individually.

Additional context
Example pyspark code where basepath contains directories like dt={*}

df = self.spark.read.option("basePath", basepath).format("parquet").load(path)
filtered = df.filter((col('dt') >= start) & (col('dt') <= end))
@Igosuki Igosuki added the enhancement New feature or request label Oct 29, 2021
@alamb
Copy link
Contributor

alamb commented Nov 1, 2021

I believe this is exactly the usecase in @rdettai 's epic #1141 (tracked by #1139)

@rdettai
Copy link
Contributor

rdettai commented Nov 1, 2021

Right! Note that even after #1141 is merged, the feature will not yet be in the read_xxx APIs, this will be added there with #1185 😊

@Igosuki
Copy link
Contributor Author

Igosuki commented Nov 1, 2021

Awesome, I was tracking @rdettai 's work but I thought it was mostly rewriting table providers and enabling remote storage.

@Igosuki
Copy link
Contributor Author

Igosuki commented Nov 1, 2021

Guess I'll just close this and contribute to the discussion there then

@Igosuki Igosuki closed this as completed Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants