Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timestamp with timezone mapping in iceberg type converter #23534

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

auden-woolfson
Copy link
Contributor

@auden-woolfson auden-woolfson commented Aug 27, 2024

Description

Fixes bug described in #23529

== RELEASE NOTES ==

Iceberg Connector Changes
* Add logic to iceberg type converter for timestamp with timezone :pr:`23534`

@auden-woolfson auden-woolfson added bug iceberg Apache Iceberg related labels Aug 27, 2024
@auden-woolfson auden-woolfson self-assigned this Aug 27, 2024
@auden-woolfson auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from 8e1716e to 38919d7 Compare August 27, 2024 22:01
Copy link
Member

@agrawalreetika agrawalreetika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests with column types Timestamp with timezone?

@tdcmeehan
Copy link
Contributor

+1. Let's add some end to end tests. Additionally, we may want to remove the validation added here, since I believe we should support this properly now: #22926

@tdcmeehan tdcmeehan self-assigned this Aug 28, 2024
@hantangwangd
Copy link
Member

Also add test cases involving Timestamp with timezone in filter conditions and partition columns, I'm a little concerned about the behavior in these scenarios.

@auden-woolfson
Copy link
Contributor Author

Also add test cases involving Timestamp with timezone in filter conditions and partition columns, I'm a little concerned about the behavior in these scenarios.

Just to clarify, do you want the timestamp with timezone to be a part of the table that is being partitioned or the type of the partition column? Currently it is not supported as one of the types for partition columns.

@hantangwangd
Copy link
Member

Just to clarify, do you want the timestamp with timezone to be a part of the table that is being partitioned or the type of the partition column? Currently it is not supported as one of the types for partition columns.

Yes, that's right. But I think it's better for us to first figure out how to handle it in these cases when we start to support it.

A very important question is, what format of long type data do we plan to actually store in data files for Timestamp with timezone? Presto has a special encoding for data with type of timestamp with timezone, which mix the time zone information with UTC values in millis. Meanwhile, Iceberg spec store the timestamp tz data as a UTC values in micros and do not retain the source time zone.

If we store the data following Iceberg spec, then we will lose the information of time zone; and if we store the data following Presto's format, then we may meet problems involving cross-engine compatibility.

cc: @tdcmeehan @agrawalreetika @ZacBlanco

@tdcmeehan
Copy link
Contributor

@hantangwangd I don't see this as a choice, we must store the data according to the Iceberg spec, which means we'll lose the embedded time zone information. This is fine--semantically, it's the same thing, and the only thing that might be confusing is the user, when retrieving stored Iceberg timestamps, will see that the timezones have been adjusted to UTC. But the point in time values will remain the same, and this is merely a limitation of the Iceberg table format.

@hantangwangd
Copy link
Member

@tdcmeehan Completely agree with your viewpoint.

That means we need to perform transformation logics for data with type of timestamp with timezone when writing/reading, parsing filter conditions, and handling partition values, besides doing the type conversion. It's not to say completing all these works all at once, but it can be divided into a series of PRs to complete.

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits. One question about removing the verifyTypeSupported method

@auden-woolfson auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from 65d3bed to a562f45 Compare September 11, 2024 23:31
@auden-woolfson auden-woolfson force-pushed the add_timestamptz_mapping_to_iceberg_connector branch from a562f45 to d877bcc Compare September 13, 2024 21:50
Copy link
Member

@agrawalreetika agrawalreetika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me.

  1. Please add a document entry in https://prestodb.io/docs/current/connector/iceberg.html#type-mapping
  2. Squash all the commits into 1 "Fix timestamp with timezone mapping in iceberg type converter"

@@ -117,6 +117,10 @@ public static Type toPrestoType(org.apache.iceberg.types.Type type, TypeManager
case TIME:
return TimeType.TIME;
case TIMESTAMP:
Types.TimestampType timestampType = (Types.TimestampType) type.asPrimitiveType();
if (timestampType.shouldAdjustToUTC()) {
return TimestampWithTimeZoneType.TIMESTAMP_WITH_TIME_ZONE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add static import for this

this.batchReaderEnabledQueryRunner = createIcebergQueryRunner(
ImmutableMap.of(),
ImmutableMap.of(),
ImmutableMap.of(PARQUET_BATCH_READ_OPTIMIZATION_ENABLED, "true"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to refactor the IcebergQueryRunner.createIcebergQueryRunner(...), just set hive.parquet-batch-read-optimization-enabled to true in extraProperties would be ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is another way, but adding this to extra properties actually causes the tests to break since it is an unused property there. This sets it as a configuration property, but it needs to be a session property. The session is actually passed to the distributed query runner builder in it's constructor, so we need some way to add properties to that session before building the runner. This is why I decided to make changes to IcebergQueryRunner. Please let me know if this clears things up, if you have another approach in mind here I would certainly be open to it!

else {
type.writeLong(blockBuilder, utcMillis);
}
type.writeLong(blockBuilder, utcMillis);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real long type data for timestamp with tz stored in parquet file is of presto's format? This may lead to many problems. As guided by @tdcmeehan, we should store the data in Iceberg format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to modify the writer as well for this, but I need to verify. I noticed that we were improperly packing the date in the previous logic which led to incorrect values being read.

For example the following set of queries produced bad results:

create table t(t TIMESTAMP WITH TIME ZONE);

INSERT INTO t VALUES TIMESTAMP '1980-12-08 0:10:0 America/Los_Angeles';

presto:tpch> SELECT * FROM t;
               t
-------------------------------
 +46764-05-25 18:40:01.825 UTC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to modify the writer as well for this.

Agree, the bad results in your example seems to be the problem caused by writing data with the incorrect format. So we need to customize a dedicated writer logic for timestamp with tz, which transform the long values encoded in presto's format to long values encoded in iceberg's format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes that I thought would fix this, added the unpackMillisUtc method to a new TimestampWithTimezoneValueWriter that I added to parquet, but we are losing all timezone information in the unpacking bit shift. This theoretically shouldn't be a problem since we are supposed to be storing millisUtc in the first section of the bits, and the last 12 are reserved for whatever timezone we want to display to the user. However this is clearly not the case, since the filter operations are failing, the program expects the millis UTC to be transformed based on the timezone info, to get an actual millis UTC.

This is some pretty misleading variable naming and I think it leaves us with two options...

  1. leave the system the same and build methods on top of it that apply the timezone part of the ts with tz to the millis part to get a correct instant (and change the variable names)
  2. change the way ts with tz is read so at the time of unpacking we already have the millis UTC correct.

Sorry if this is redundant, I couldn't find documentation about how iceberg stores ts w tz under the hood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug iceberg Apache Iceberg related
Projects
Status: 🏗 In progress
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

Iceberg timestamptz should map to Presto TIMESTAMP WITH TIME ZONE type
5 participants