Databricks loader: Support for generated columns #951

istreeter · 2022-06-24T22:13:27Z

For optimum table partitioning, we want to use a auto-generated column on the date of collector timestamp. I imagine a table definition something like this:

CREATE TABLE events(
  app_id    VARCHAR(255),
  --
  -- lots of other columns here!
  --
  collector_tstamp_date DATE ALWAYS GENERATED AS (DATE(collector_tstamp))
)
PARTITIONED BY (collector_tstamp_date, event_name)

I have found that with generated columns we occasionally get exceptions with messages like:

Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: [MISSING_COLUMN] org.apache.spark.sql.AnalysisException: Column 'unstruct_event_com_acme_myevent_1' does not exist.

I think it's something to do with how we use the MERGESCHEMA copy option, without explicitly setting the table schema, and because different batches can have different sets of entities. These seems to be inconsistent with generated columns.

The solution I've found is to always specify every single column in the table in the COPY INTO statement. If the column is not in the parquet file then select it as NULL AS unstruct_event_com_acme_myevent_1.

The text was updated successfully, but these errors were encountered:

istreeter added a commit that referenced this issue Jun 24, 2022

Databricks loader: Support for generated columns (close #951)

3cb4f6f

istreeter added this to the 4.1.0 milestone Jun 25, 2022

istreeter added a commit that referenced this issue Jun 25, 2022

Databricks loader: Support for generated columns (close #951)

2e49096

This was referenced Jun 25, 2022

Release 4.1.0 #948

Closed

Databricks loader: Add collector_tstamp_date column #943

Closed

istreeter added a commit that referenced this issue Jun 25, 2022

Databricks loader: Support for generated columns (close #951)

1ff123a

istreeter added a commit that referenced this issue Jun 25, 2022

Databricks loader: Support for generated columns (close #951)

25e334e

istreeter added a commit that referenced this issue Jun 25, 2022

Databricks loader: Support for generated columns (close #951)

2079ec2

istreeter added a commit that referenced this issue Jun 25, 2022

Databricks loader: Support for generated columns (close #951)

8a76d03

pondzix pushed a commit that referenced this issue Jun 28, 2022

Databricks loader: Support for generated columns (close #951)

eeebd45

spenes closed this as completed in 07a98f2 Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databricks loader: Support for generated columns #951

Databricks loader: Support for generated columns #951

istreeter commented Jun 24, 2022

Databricks loader: Support for generated columns #951

Databricks loader: Support for generated columns #951

Comments

istreeter commented Jun 24, 2022