Support pseudo columns #1203

Igosuki · 2021-10-29T15:17:11Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Haven't see that anywhere in the code, so please let me know if I overlooked it.
Spark supports pseudo columns, a pseudo column is a column that has a single value for an entire partition, and can be used to filter data without reading any physical files.

Describe the solution you'd like
reading basepath which contains two directories foo=a and foo=b gives me a dataframe with the utf8 column foo merged into the schema, and a partition for each value.
If I filter on that column, for instance select * from basepathtable where foo = "a" only scans the files in that directory.

Describe alternatives you've considered
Currently AFAIK, datafusion requires you to know the full list of partition directories and read from each individually.

Additional context
Example pyspark code where basepath contains directories like dt={*}

df = self.spark.read.option("basePath", basepath).format("parquet").load(path)
filtered = df.filter((col('dt') >= start) & (col('dt') <= end))

The text was updated successfully, but these errors were encountered:

alamb · 2021-11-01T13:10:05Z

I believe this is exactly the usecase in @rdettai 's epic #1141 (tracked by #1139)

rdettai · 2021-11-01T15:13:56Z

Right! Note that even after #1141 is merged, the feature will not yet be in the read_xxx APIs, this will be added there with #1185 😊

Igosuki · 2021-11-01T19:32:27Z

Awesome, I was tracking @rdettai 's work but I thought it was mostly rewriting table providers and enabling remote storage.

Igosuki · 2021-11-01T19:33:10Z

Guess I'll just close this and contribute to the discussion there then

Igosuki added the enhancement New feature or request label Oct 29, 2021

Igosuki closed this as completed Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pseudo columns #1203

Support pseudo columns #1203

Igosuki commented Oct 29, 2021 •

edited

Loading

alamb commented Nov 1, 2021

rdettai commented Nov 1, 2021

Igosuki commented Nov 1, 2021

Igosuki commented Nov 1, 2021

Support pseudo columns #1203

Support pseudo columns #1203

Comments

Igosuki commented Oct 29, 2021 • edited Loading

alamb commented Nov 1, 2021

rdettai commented Nov 1, 2021

Igosuki commented Nov 1, 2021

Igosuki commented Nov 1, 2021

Igosuki commented Oct 29, 2021 •

edited

Loading