support PyArrow timestamptz with Etc/UTC #910

sungwy · 2024-07-10T21:48:55Z

This ended up being a lot simpler than I had thought it would be.

"Etc/UTC" is just another representation of UTC, and should be safe to support. This will allow us to support arrow dataframes produced with datetimes from libraries like pendulum that use this timezone.

Fixes: #863

HonahX

LGTM! @syun64 Thanks for the quick fix!

pyiceberg/io/pyarrow.py

Super Co-authored-by: Fokko Driesprong <[email protected]>

sungwy · 2024-07-11T16:37:20Z

pyiceberg/table/__init__.py

-        # cast if the two schemas are compatible but not equal
-        table_arrow_schema = self._table.schema().as_arrow()
-        if table_arrow_schema != df.schema:
-            df = df.cast(table_arrow_schema)


Removed this cast function - to_requested_schema should be responsible for casting the types to their desired schema, instead of casting it here

sungwy · 2024-07-11T16:42:13Z

@Fokko @HonahX - thank you for your reviews. I've updated the integration test to make the check more comprehensive, and removed the pyarrow table casting in append and overwrite so that each column can be cast with more targeted safe flag for each field.

sungwy · 2024-07-11T16:50:24Z

pyiceberg/io/pyarrow.py

+                    if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz:
+                        return values.cast(target_type, safe=False)
+                if (
+                    pa.types.is_timestamp(target_type)
+                    and target_type.unit == "us"
+                    and pa.types.is_timestamp(values.type)
+                    and values.type.unit in {"s", "ms", "us"}
+                ):
+                    if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz:
+                        return values.cast(target_type)


Does this look correct that we are casting the types in order to follow the Iceberg Spec for Parquet Physical and Logical Types? https://iceberg.apache.org/spec/#parquet

Yes, it looks good. Arrow writes INT64 by default for timestamps: https://arrow.apache.org/docs/cpp/parquet.html#logical-types

Fokko · 2024-07-11T20:55:46Z

pyiceberg/io/pyarrow.py

+                    and pa.types.is_timestamp(values.type)
+                    and values.type.unit in {"s", "ms", "us"}
+                ):
+                    if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz:


Can we add parentheses?

Fokko · 2024-07-11T20:57:39Z

pyiceberg/io/pyarrow.py

+                    pa.types.is_timestamp(target_type)
+                    and target_type.unit == "us"
+                    and pa.types.is_timestamp(values.type)
+                    and values.type.unit in {"s", "ms", "us"}


When we're requesting usand the file provide a us, I don't think we need to cast?

Suggested change

and values.type.unit in {"s", "ms", "us"}

and values.type.unit in {"s", "ms"}

Sorry this is a bit confusing, I agree 😓 . I included it here because I wanted to support casting pa.timestamp('us', tz='Etc/UTC') to pa.timestamp('us', tz='UTC') within the same condition.

I think we won't hit this condition if both the input and requested types are pa.timestamp('us') because we enter this block only if target_type and values.type are not equal:
https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1315

I think we might need to do some more work here. In Iceberg we're rather strict on the distinction between Timestamp and TimestampTZ. A good way of showing this can be found here:

iceberg-python/pyiceberg/expressions/literals.py

Lines 566 to 572 in aceed2a

@to.register(TimestampType)

def _(self, _: TimestampType) -> Literal[int]:

return TimestampLiteral(timestamp_to_micros(self.value))

@to.register(TimestamptzType)

def _(self, _: TimestamptzType) -> Literal[int]:

return TimestampLiteral(timestamptz_to_micros(self.value))

This is when we parse a string from a literal, which often comes an expression: dt >= '1925-05-22T00:00:00'. If the value has a timezone, even UTC, we reject it as Timestamp. When it has a timestamp, we normalize it to UTC and then store the integer.

Sounds good @Fokko- thank you for the review.
I've made these checks stricter and also clearer for users to follow.

sungwy · 2024-07-11T22:43:02Z

pyiceberg/io/pyarrow.py

+                    pa.types.is_timestamp(target_type)
+                    and target_type.unit == "us"
+                    and pa.types.is_timestamp(values.type)
+                    and values.type.unit in {"s", "ms", "us"}


Sounds good @Fokko- thank you for the review.
I've made these checks stricter and also clearer for users to follow.

sungwy · 2024-07-11T22:51:22Z

pyiceberg/io/pyarrow.py

-                    and values.type.unit == "ns"
-                ):
-                    return values.cast(target_type, safe=False)
+                if field.field_type == TimestampType():


stricter and clearer field_type driven checks

sungwy · 2024-07-11T22:53:20Z

tests/io/test_pyarrow.py

@@ -1798,3 +1799,35 @@ def test_identity_partition_on_multi_columns() -> None:
            ("n_legs", "ascending"),
            ("animal", "ascending"),
        ]) == arrow_table.sort_by([("born_year", "ascending"), ("n_legs", "ascending"), ("animal", "ascending")])
+
+
+def test_to_requested_schema_timestamps(


I noticed that there weren't any tests for to_requested_schema although it is a public API, so I added this in.

Do we want to keep to_requested_schema as a public API?

+1 on making to_requested_schema private. IMO this seem to be specific to our internal workflow so making it private can give us more flexibility in future development.

Actually, the signature of this API has been updated in #786: pa.Table to pa.RecordBatch, which is a breaking change. I think it may be a good time to send a deprecation notice on the old one and making the new one private. WDYT?

I've made the function private on this PR, and have another one up to port back to public API function with the previous signature with a deprecation notice here: #918

kevinjqliu

LGTM, glad we're getting rid of the dataframe casting.

As a followup, let's investigate the level of abstraction on the write path. Currently we are doing schema compatible check, schema coersion, bin-packing, etc at different levels of the stack. It'll be good to optimize and see which functions can be pushed up the stack.

For example, here's what the overwrite path looks like

overwrite
	_dataframe_to_data_files
		write_file
			write_parquet

HonahX · 2024-07-12T19:26:37Z

Merged! Thanks for the great work from @syun64 and reviews from @Fokko @kevinjqliu

@HonahX

commit 1ed3abd Author: Sung Yun <[email protected]> Date: Wed Jul 17 02:04:52 2024 -0400 Allow writing `pa.Table` that are either a subset of table schema or in arbitrary order, and support type promotion on write (apache#921) * merge * thanks @HonahX :) Co-authored-by: Honah J. <[email protected]> * support promote * revert promote * use a visitor * support promotion on write * fix * Thank you @Fokko ! Co-authored-by: Fokko Driesprong <[email protected]> * revert * add-files promotiontest * support promote for add_files * add tests for uuid * add_files subset schema test --------- Co-authored-by: Honah J. <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> commit 0f2e19e Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jul 15 23:25:08 2024 -0700 Bump zstandard from 0.22.0 to 0.23.0 (apache#934) Bumps [zstandard](https://github.com/indygreg/python-zstandard) from 0.22.0 to 0.23.0. - [Release notes](https://github.com/indygreg/python-zstandard/releases) - [Changelog](https://github.com/indygreg/python-zstandard/blob/main/docs/news.rst) - [Commits](indygreg/python-zstandard@0.22.0...0.23.0) --- updated-dependencies: - dependency-name: zstandard dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit ec73d97 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jul 15 23:24:47 2024 -0700 Bump griffe from 0.47.0 to 0.48.0 (apache#933) Bumps [griffe](https://github.com/mkdocstrings/griffe) from 0.47.0 to 0.48.0. - [Release notes](https://github.com/mkdocstrings/griffe/releases) - [Changelog](https://github.com/mkdocstrings/griffe/blob/main/CHANGELOG.md) - [Commits](mkdocstrings/griffe@0.47.0...0.48.0) --- updated-dependencies: - dependency-name: griffe dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit d05a423 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jul 15 23:24:16 2024 -0700 Bump mkdocs-material from 9.5.28 to 9.5.29 (apache#932) Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.28 to 9.5.29. - [Release notes](https://github.com/squidfunk/mkdocs-material/releases) - [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG) - [Commits](squidfunk/mkdocs-material@9.5.28...9.5.29) --- updated-dependencies: - dependency-name: mkdocs-material dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit e27cd90 Author: Yair Halevi (Spock) <[email protected]> Date: Sun Jul 14 22:11:04 2024 +0300 Allow empty `names` in mapped field of Name Mapping (apache#927) * Remove check_at_least_one field validator Iceberg spec permits an emtpy list of names in the default name mapping. check_at_least_one is therefore unnecessary. * Remove irrelevant test case * Fixing pydantic model No longer requiring minimum length of names list to be 1. * Added test case for empty names in name mapping * Fixed formatting error commit 3f44dfe Author: Soumya Ghosh <[email protected]> Date: Sun Jul 14 00:35:38 2024 +0530 Lowercase bool values in table properties (apache#924) commit b11cdb5 Author: Sung Yun <[email protected]> Date: Fri Jul 12 16:45:04 2024 -0400 Deprecate to_requested_schema (apache#918) * deprecate to_requested_schema * prep for release commit a3dd531 Author: Honah J <[email protected]> Date: Fri Jul 12 13:14:40 2024 -0700 Glue endpoint config variable, continue apache#530 (apache#920) Co-authored-by: Seb Pretzer <[email protected]> commit 32e8f88 Author: Sung Yun <[email protected]> Date: Fri Jul 12 15:26:00 2024 -0400 support PyArrow timestamptz with Etc/UTC (apache#910) Co-authored-by: Fokko Driesprong <[email protected]> commit f6d56e9 Author: Sung Yun <[email protected]> Date: Fri Jul 12 05:31:06 2024 -0400 fix invalidation logic (apache#911) commit 6488ad8 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu Jul 11 22:56:48 2024 -0700 Bump coverage from 7.5.4 to 7.6.0 (apache#917) Bumps [coverage](https://github.com/nedbat/coveragepy) from 7.5.4 to 7.6.0. - [Release notes](https://github.com/nedbat/coveragepy/releases) - [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst) - [Commits](nedbat/coveragepy@7.5.4...7.6.0) --- updated-dependencies: - dependency-name: coverage dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit dceedfa Author: Sung Yun <[email protected]> Date: Thu Jul 11 20:32:14 2024 -0400 Check if schema is compatible in `add_files` API (apache#907) Co-authored-by: Fokko Driesprong <[email protected]> commit aceed2a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu Jul 11 15:52:06 2024 +0200 Bump mypy-boto3-glue from 1.34.136 to 1.34.143 (apache#912) Bumps [mypy-boto3-glue](https://github.com/youtype/mypy_boto3_builder) from 1.34.136 to 1.34.143. - [Release notes](https://github.com/youtype/mypy_boto3_builder/releases) - [Commits](https://github.com/youtype/mypy_boto3_builder/commits) --- updated-dependencies: - dependency-name: mypy-boto3-glue dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 1b9b884 Author: Fokko Driesprong <[email protected]> Date: Thu Jul 11 12:45:20 2024 +0200 PyArrow: Don't enforce the schema when reading/writing (apache#902) * PyArrow: Don't enforce the schema PyIceberg struggled with the different type of arrow, such as the `string` and `large_string`. They represent the same, but are different under the hood. My take is that we should hide these kind of details from the user as much as possible. Now we went down the road of passing in the Iceberg schema into Arrow, but when doing this, Iceberg has to decide if it is a large or non-large type. This PR removes passing down the schema in order to let Arrow decide unless: - The type should be evolved - In case of re-ordering, we reorder the original types * WIP * Reuse Table schema * Make linter happy * Squash some bugs * Thanks Sung! Co-authored-by: Sung Yun <[email protected]> * Moar code moar bugs * Remove the variables wrt file sizes * Linting * Go with large ones for now * Missed one there! --------- Co-authored-by: Sung Yun <[email protected]> commit 8f47dfd Author: Soumya Ghosh <[email protected]> Date: Thu Jul 11 11:52:55 2024 +0530 Move determine_partitions and helper methods to io.pyarrow (apache#906) commit 5aa451d Author: Soumya Ghosh <[email protected]> Date: Thu Jul 11 07:57:05 2024 +0530 Rename data_sequence_number to sequence_number in ManifestEntry (apache#900) commit 77a07c9 Author: Honah J <[email protected]> Date: Wed Jul 10 03:56:13 2024 -0700 Support MergeAppend operations (apache#363) * add ListPacker + tests * add merge append * add merge_append * fix snapshot inheritance * test manifest file and entries * add doc * fix lint * change test name * address review comments * rename _MergingSnapshotProducer to _SnapshotProducer * fix a serious bug * update the doc * remove merge_append as public API * make default to false * add test description * fix merge conflict * fix snapshot_id issue commit 66b92ff Author: Fokko Driesprong <[email protected]> Date: Wed Jul 10 10:09:20 2024 +0200 GCS: Fix incorrect token description (apache#909) commit c25e080 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue Jul 9 20:50:29 2024 -0700 Bump zipp from 3.17.0 to 3.19.1 (apache#905) Bumps [zipp](https://github.com/jaraco/zipp) from 3.17.0 to 3.19.1. - [Release notes](https://github.com/jaraco/zipp/releases) - [Changelog](https://github.com/jaraco/zipp/blob/main/NEWS.rst) - [Commits](jaraco/zipp@v3.17.0...v3.19.1) --- updated-dependencies: - dependency-name: zipp dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 301e336 Author: Sung Yun <[email protected]> Date: Tue Jul 9 23:35:11 2024 -0400 Cast 's', 'ms' and 'ns' PyArrow timestamp to 'us' precision on write (apache#848) commit 3f574d3 Author: Fokko Driesprong <[email protected]> Date: Tue Jul 9 11:36:43 2024 +0200 Support partial deletes (apache#569) * Add option to delete datafiles This is done through the Iceberg metadata, resulting in efficient deletes if the data is partitioned correctly * Pull in main * WIP * Change DataScan to accept Metadata and io For the partial deletes I want to do a scan on in memory metadata. Changing this API allows this. * fix name-mapping issue * WIP * WIP * Moar tests * Oops * Cleanup * WIP * WIP * Fix summary generation * Last few bits * Fix the requirement * Make ruff happy * Comments, thanks Kevin! * Comments * Append rather than truncate * Fix merge conflicts * Make the tests pass * Add another test * Conflicts * Add docs (apache#33) * docs * docs * Add a partitioned overwrite test * Fix comment * Skip empty manifests --------- Co-authored-by: HonahX <[email protected]> Co-authored-by: Sung Yun <[email protected]> commit cdc3e54 Author: Fokko Driesprong <[email protected]> Date: Tue Jul 9 08:28:27 2024 +0200 Disallow writing empty Manifest files (apache#876) * Disallow writing empty Avro files/blocks Raising an exception when doing this might look extreme, but there is no real good reason to allow this. * Relax the constaints a bit commit b68e109 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jul 8 22:16:23 2024 -0700 Bump fastavro from 1.9.4 to 1.9.5 (apache#904) Bumps [fastavro](https://github.com/fastavro/fastavro) from 1.9.4 to 1.9.5. - [Release notes](https://github.com/fastavro/fastavro/releases) - [Changelog](https://github.com/fastavro/fastavro/blob/master/ChangeLog) - [Commits](fastavro/fastavro@1.9.4...1.9.5) --- updated-dependencies: - dependency-name: fastavro dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 90547bb Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Jul 8 22:15:39 2024 -0700 Bump moto from 5.0.10 to 5.0.11 (apache#903) Bumps [moto](https://github.com/getmoto/moto) from 5.0.10 to 5.0.11. - [Release notes](https://github.com/getmoto/moto/releases) - [Changelog](https://github.com/getmoto/moto/blob/master/CHANGELOG.md) - [Commits](getmoto/moto@5.0.10...5.0.11) --- updated-dependencies: - dependency-name: moto dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 7dff359 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Jul 7 07:50:19 2024 +0200 Bump tenacity from 8.4.2 to 8.5.0 (apache#898) commit 4aa469e Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sat Jul 6 22:30:59 2024 +0200 Bump certifi from 2024.2.2 to 2024.7.4 (apache#899) Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4. - [Commits](certifi/python-certifi@2024.02.02...2024.07.04) --- updated-dependencies: - dependency-name: certifi dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit aa7ad78 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sat Jul 6 20:37:51 2024 +0200 Bump deptry from 0.16.1 to 0.16.2 (apache#897) Bumps [deptry](https://github.com/fpgmaas/deptry) from 0.16.1 to 0.16.2. - [Release notes](https://github.com/fpgmaas/deptry/releases) - [Changelog](https://github.com/fpgmaas/deptry/blob/main/CHANGELOG.md) - [Commits](fpgmaas/deptry@0.16.1...0.16.2) --- updated-dependencies: - dependency-name: deptry dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

support Etc/UTC

7803fcd

sungwy changed the title ~~support Etc/UTC~~ support PyArrow timestamptz with Etc/UTC Jul 10, 2024

sungwy requested review from kevinjqliu and HonahX July 10, 2024 22:04

HonahX approved these changes Jul 11, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

Fokko approved these changes Jul 11, 2024

View reviewed changes

sungwy and others added 4 commits July 11, 2024 08:33

Update pyiceberg/io/pyarrow.py

fa3ba66

Super Co-authored-by: Fokko Driesprong <[email protected]>

tests

fd53138

Merge branch 'main' into etc-utc

12e6faa

fix

bf70252

sungwy commented Jul 11, 2024

View reviewed changes

Fokko reviewed Jul 11, 2024

View reviewed changes

sungwy added 2 commits July 11, 2024 21:02

parenthesis - thanks Fokko!

9375e87

adopt review feedback: more tests, refactoring, stricter checks

ce643f6

sungwy commented Jul 11, 2024

View reviewed changes

sungwy added 2 commits July 12, 2024 00:47

Merge branch 'main' into etc-utc

2b504a3

make _to_requested_schema private

413b4f7

This was referenced Jul 12, 2024

Deprecate to_requested_schema #918

Merged

Allow writing dataframes that are either a subset of table schema or in arbitrary order #829

Closed

kevinjqliu approved these changes Jul 12, 2024

View reviewed changes

HonahX merged commit 32e8f88 into apache:main Jul 12, 2024
7 checks passed

kevinjqliu mentioned this pull request Jul 13, 2024

[suggestion] Write path optimization #926

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support PyArrow timestamptz with Etc/UTC #910

support PyArrow timestamptz with Etc/UTC #910

sungwy commented Jul 10, 2024

HonahX left a comment

sungwy Jul 11, 2024

sungwy commented Jul 11, 2024

sungwy Jul 11, 2024

Fokko Jul 11, 2024

Fokko Jul 11, 2024

sungwy Jul 11, 2024

Fokko Jul 11, 2024

sungwy Jul 11, 2024

Fokko Jul 11, 2024

sungwy Jul 11, 2024

sungwy Jul 11, 2024

sungwy Jul 11, 2024

sungwy Jul 11, 2024

HonahX Jul 12, 2024

sungwy Jul 12, 2024

kevinjqliu left a comment

HonahX commented Jul 12, 2024

	and values.type.unit in {"s", "ms", "us"}
	and values.type.unit in {"s", "ms"}

	@to.register(TimestampType)
	def _(self, _: TimestampType) -> Literal[int]:
	return TimestampLiteral(timestamp_to_micros(self.value))

	@to.register(TimestamptzType)
	def _(self, _: TimestamptzType) -> Literal[int]:
	return TimestampLiteral(timestamptz_to_micros(self.value))

support PyArrow timestamptz with Etc/UTC #910

support PyArrow timestamptz with Etc/UTC #910

Conversation

sungwy commented Jul 10, 2024

HonahX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy commented Jul 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

HonahX commented Jul 12, 2024