Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC Support date_bin on timestamps with timezone #9

Closed

Conversation

appletreeisyellow
Copy link
Owner

@appletreeisyellow appletreeisyellow commented May 28, 2024

This PR is a POC to support date_bin on timestamps with timezone. There are two things covered in this PR:

  1. adds a UDF function to_local_time():
    • only accept 1 input in type Timestamp(..., *)
    • return in type Timestamp(..., None)
    • for example:
> select to_local_time('2024-04-01T00:00:20Z'::timestamp AT TIME ZONE 'Europe/Brussels');
+---------------------------------------------+
| to_local_time(Utf8("2024-04-01T00:00:20Z")) |
+---------------------------------------------+
| 2024-04-01T00:00:20                         |
+---------------------------------------------+
1 row(s) fetched.
Elapsed 0.010 seconds.

> select to_local_time('2024-04-01T00:00:20'::timestamp AT TIME ZONE 'Europe/Brussels');
+--------------------------------------------+
| to_local_time(Utf8("2024-04-01T00:00:20")) |
+--------------------------------------------+
| 2024-04-01T00:00:20                        |
+--------------------------------------------+
1 row(s) fetched.
Elapsed 0.008 seconds.

> select
  time,
  arrow_typeof(time) as type,
  to_local_time(time) as to_local_time,
  arrow_typeof(to_local_time(time)) as to_local_time_type
from (
  select '2024-04-01T00:00:20Z'::timestamp AT TIME ZONE 'Europe/Brussels' as time
);
+---------------------------+------------------------------------------------+---------------------+-----------------------------+
| time                      | type                                           | to_local_time       | to_local_time_type          |
+---------------------------+------------------------------------------------+---------------------+-----------------------------+
| 2024-04-01T00:00:20+02:00 | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-04-01T00:00:20 | Timestamp(Nanosecond, None) |
+---------------------------+------------------------------------------------+---------------------+-----------------------------+
1 row(s) fetched.
Elapsed 0.017 seconds.

Combine to_local_time() with date_bin() will look like:

> select date_bin(interval '1 day', to_local_time('2024-04-01T00:00:20Z'::timestamp AT TIME ZONE 'Europe/Brussels'));
+----------------------------------------------------------------------------------------------------+
| date_bin(IntervalMonthDayNano("18446744073709551616"),to_local_time(Utf8("2024-04-01T00:00:20Z"))) |
+----------------------------------------------------------------------------------------------------+
| 2024-04-01T00:00:00                                                                                |
+----------------------------------------------------------------------------------------------------+


> select date_bin(interval '1 day', to_local_time('2024-04-01T00:00:20Z'::timestamp AT TIME ZONE 'Europe/Brussels')) AT TIME ZONE 'Europe/Brussels';
+----------------------------------------------------------------------------------------------------+
| date_bin(IntervalMonthDayNano("18446744073709551616"),to_local_time(Utf8("2024-04-01T00:00:20Z"))) |
+----------------------------------------------------------------------------------------------------+
| 2024-04-01T00:00:00+02:00                                                                          |
+----------------------------------------------------------------------------------------------------+
  1. Support date_bin on timestamps with timezone -- the original date_bin works correctly. the confusion happens at the display of the timestamp when there is a timezone. Even though the underlying time, the UTC time is correct, but when there is a timezone, the display automatically adjust the time, which result the display time to be wrong. Adjusting the underlaying time can make the final display time to our expected result.

@appletreeisyellow appletreeisyellow force-pushed the chunchun/prototype-date_bin-with-timezone branch from cb0e2c4 to d642b49 Compare June 6, 2024 16:07
@appletreeisyellow appletreeisyellow force-pushed the chunchun/prototype-date_bin-with-timezone branch from 34cba19 to ce4e560 Compare June 7, 2024 13:43
Comment on lines -705 to +798
"2020-09-08T00:00:00+05",
"2020-09-08T00:00:00+05",
"2020-09-08T00:00:00+05",
"2020-09-08T00:00:00+05",
"2020-09-08T00:00:00+05",
"2020-09-07T19:00:00+05:00",
"2020-09-07T19:00:00+05:00",
"2020-09-07T19:00:00+05:00",
"2020-09-07T19:00:00+05:00",
"2020-09-07T19:00:00+05:00",
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right.. this is a change in behavior

@appletreeisyellow
Copy link
Owner Author

With the implementation in 54f1bdc, there is a change in behavior #9 (comment), which is not right.

Then I looked into Postgres to see how it works. This is what I found:

  1. psql date_bin() displays the result either with timezone in UTC or without timezone. They don't display it in a different timezone, or maybe I'm missing something
postgres =# SELECT
  date_bin('1 d', '2021-10-31T01:00:00', '1970-01-01T00:00:00'),
  date_bin('1 d', '2021-10-31T01:00:00'::timestamp, '1970-01-01T00:00:00'),
  date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00'),
  date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00'),
  date_bin('1 d', '2021-10-31T01:00:00', '1970-01-01T00:00:00') AT TIME ZONE 'Europe/Brussels',
  date_bin('1 d', '2021-10-31T01:00:00', '1970-01-01T00:00:00') AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels'
;
        date_bin        |      date_bin       |      date_bin       |        date_bin        |      timezone       |        timezone
------------------------+---------------------+---------------------+------------------------+---------------------+------------------------
 2021-10-31 00:00:00+00 | 2021-10-31 00:00:00 | 2021-10-31 00:00:00 | 2021-10-30 00:00:00+00 | 2021-10-31 02:00:00 | 2021-10-30 22:00:00+00
(1 row)

This is very different from what DataFusion is currently doing, with cargo build from the current main branch:

> SELECT
  date_bin(interval '1 day', '2021-10-31T01:00:00', '1970-01-01T00:00:00') AS date_bin_1,
  date_bin(interval '1 day', '2021-10-31T01:00:00'::timestamp, '1970-01-01T00:00:00') AS date_bin_2,
  date_bin(interval '1 day', '2021-10-31T01:00:00' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00') AS date_bin_3,
  date_bin(interval '1 day', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00') AS date_bin_4,
  date_bin(interval '1 day', '2021-10-31T01:00:00', '1970-01-01T00:00:00') AT TIME ZONE 'Europe/Brussels' AS date_bin_5,
  date_bin(interval '1 day', '2021-10-31T01:00:00', '1970-01-01T00:00:00') AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels' AS date_bin_6
;
+---------------------+---------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| date_bin_1          | date_bin_2          | date_bin_3                | date_bin_4                | date_bin_5                | date_bin_6                |
+---------------------+---------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| 2021-10-31T00:00:00 | 2021-10-31T00:00:00 | 2021-10-30T02:00:00+02:00 | 2021-10-31T02:00:00+02:00 | 2021-10-31T00:00:00+02:00 | 2021-10-31T02:00:00+02:00 |
+---------------------+---------------------+---------------------------+---------------------------+---------------------------+---------------------------+
1 row(s) fetched.
Elapsed 0.031 seconds.
  1. Different number of timestamp casting will result in a different date type in terms of timezone:
postgres =# SELECT
  pg_typeof('2024-03-31T00:30:00') AS type_unknown,
  pg_typeof('2024-03-31T00:30:00' AT TIME ZONE 'Europe/Brussels') AS type_wo_tz, -- 1 casting
  pg_typeof('2024-03-31T00:30:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels') AS type_w_tz, -- 2 casting
  pg_typeof('2024-03-31T00:30:00'::timestamp AT TIME ZONE 'UTC') AS type_w_tz, -- 2 casting
  pg_typeof('2024-03-31T00:30:00'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels') AS type_wo_tz -- 3 casting
;
 type_unknown |         type_wo_tz          |        type_w_tz         |        type_w_tz         |         type_wo_tz
--------------+-----------------------------+--------------------------+--------------------------+-----------------------------
 unknown      | timestamp without time zone | timestamp with time zone | timestamp with time zone | timestamp without time zone
(1 row)

@appletreeisyellow
Copy link
Owner Author

appletreeisyellow commented Jul 2, 2024

My proposal is:

@alamb
Copy link

alamb commented Jul 3, 2024

My proposal is:

  • For short term fix, use UDF function to_local_time() combined with the current implementation of date_bin() to achieve the desired result

This makes sense to me

If we made the change proposed in apache/arrow-rs#5827, would that "solve" all these issues?

If so, how would someone implement timezone aware date binning?
Would it be to do like what postgres does and cast back to timestamp without timezone?

@appletreeisyellow
Copy link
Owner Author

appletreeisyellow commented Jul 4, 2024

If so, how would someone implement timezone aware date binning?
Would it be to do like what postgres does and cast back to timestamp without timezone?

I'm not sure. I think we need more discussion to define the behavior.

This is how postgres behaves:

  • If the input timestamp has timezone, date_bin result will be in the timezone of the system, instead of the input timezone! postgres date_bin aware the system timezone, but not the input timezone!
    • Currently, DataFusion date_bin returns the same timezone as the input timezone
    • Do we want to match the behavior with postgres?
  • If the input timestamp doesn't have timezone, date_bin result will not have timezone

For example, in postgres:

postgres =# select
  pg_typeof('2021-10-31T01:00:00'::timestamp) as input_datatype,
  date_bin('1 d', '2021-10-31T01:00:00'::timestamp, '1970-01-01T00:00:00'),
  pg_typeof(date_bin('1 d', '2021-10-31T01:00:00'::timestamp, '1970-01-01T00:00:00')) as result_datetype;
       input_datatype        |      date_bin       |       result_datetype
-----------------------------+---------------------+-----------------------------
 timestamp without time zone | 2021-10-31 00:00:00 | timestamp without time zone
(1 row)

postgres=# show timezone;
    TimeZone
-----------------
 America/Chicago
(1 row)

postgres=# select
  pg_typeof('2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels') as input_datatype,
  date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00'),
  pg_typeof(date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00')) as result_datetype;
      input_datatype      |        date_bin        |     result_datetype
--------------------------+------------------------+--------------------------
 timestamp with time zone | 2021-10-30 01:00:00-05 | timestamp with time zone
                                            -- -05 is the system timezone,  it is 01:00 instead of 00:00 because the origin time is '1970-01-01T00:00:00'

# if cast the origin to 'Europe/Brussels' timezone, the result will be at 00:00
postgres=# select date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels');
        date_bin
------------------------
 2021-10-31 00:00:00-05
(1 row)

postgres=# set time zone 'UTC';
SET

postgres=# show timezone;
 TimeZone
----------
 UTC
(1 row)

postgres=# select
  pg_typeof('2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels') as input_datatype,
  date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00'),
  pg_typeof(date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00')) as result_datetype;
      input_datatype      |        date_bin        |     result_datetype
--------------------------+------------------------+--------------------------
 timestamp with time zone | 2021-10-30 00:00:00+00 | timestamp with time zone
                                            -- -01 is the system timezone, instead of the input timezone
(1 row)

@appletreeisyellow appletreeisyellow deleted the chunchun/prototype-date_bin-with-timezone branch July 10, 2024 19:34
@appletreeisyellow
Copy link
Owner Author

My proposal is:

Here is the short term fix PR: apache#11347

For long term fix, there are still discussions going on in apache/arrow-rs#5827 which is the first step of the long term fix.

At this point, I conclude the POC is done, so closing this draft PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants