Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rounding of Pandas datetimes in ICON CMORizer #2529

Merged
merged 10 commits into from
Sep 26, 2024
Merged

Conversation

valeriupredoi
Copy link
Contributor

@valeriupredoi valeriupredoi commented Sep 23, 2024

Description

We have been pinning Pandas since at least three minor versions ago, see #2394 due to a bug in Pandas that they seem to be taking forever to fix issue pandas-dev/pandas#57002 and now we are yet again faced with the same issue with the new version of Pandas 2.2.3 - see failed nightly GA. The problem is simple: Pandas are simply unable to round to the next minute eg 12H:12M:39S should be 12H:13M:00S (for daily data), so I got fed up and wrote our own rounding function (that took me surprisingly long to get it right - about an hour, rounded to the next minute :rofl: I think this should not affect the result when Pandas finally fix their bug - but I'll let @schlunma decide 🍺

Closes #2530


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@valeriupredoi valeriupredoi added the bug Something isn't working label Sep 23, 2024
Copy link

codecov bot commented Sep 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.77%. Comparing base (ffa2347) to head (3b6ce2b).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2529   +/-   ##
=======================================
  Coverage   94.77%   94.77%           
=======================================
  Files         249      249           
  Lines       14094    14095    +1     
=======================================
+ Hits        13358    13359    +1     
  Misses        736      736           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@valeriupredoi
Copy link
Contributor Author

ayay - check out the devel test with bleeding edge deps:

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 0:00:00.997120
 x: array([cftime.DatetimeProlepticGregorian(2004, 11, 4, 13, 30, 0, 997120, has_year_zero=True)],
      dtype=object)
 y: array([datetime.datetime(2004, 11, 4, 13, 30)], dtype=object)

cftime plugging in some 0.997s out of thin air, they'd get absolutely destroyed if they worked in motorsports 😁

@schlunma schlunma mentioned this pull request Sep 26, 2024
9 tasks
@bouweandela
Copy link
Member

Have you tried this suggestion to work around the issue? pandas-dev/pandas#57002 (comment)

@valeriupredoi
Copy link
Contributor Author

Have you tried this suggestion to work around the issue? pandas-dev/pandas#57002 (comment)

Yes, it don't work, no Pd rounding works, including the one in the cmorizer 🍺

@schlunma
Copy link
Contributor

schlunma commented Sep 26, 2024

This also doesn't sound like a workaround to me...

However, what DOES work is the hint in the description of that issue:

# Round just date works
dates[0].round("30min")
# >Timestamp('2022-01-11 12:00:00')

See here:

>>> datetimes = year_month_day + day_float
>>> print(datetimes)
0   2004-11-04 13:59:59.997120
dtype: datetime64[ns]
>>> rounded_datetimes = pd.Series(dt.round('s') for dt in datetimes)
>>> print(rounded_datetimes)
0   2004-11-04 14:00:00
dtype: datetime64[ns]

@schlunma
Copy link
Contributor

Ahh wait, I just realized that I have an old version of pandas, sorry, will update you again once that is updated...

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Sep 26, 2024

You need to update ninja bear to 2.2.3 Manu 😁

@schlunma
Copy link
Contributor

Yes!! It does work with pandas v2.2.3:

>>> import pandas as pd
>>> pd.__version__
'2.2.3'
>>> time_points = [20041104.5833333]
>>> time_str = pd.Series(time_points, dtype=str)
>>> year_month_day_str = time_str.str.extract(r'(\d*)\.?\d*', expand=False)
>>> year_month_day = pd.to_datetime(year_month_day_str, format='%Y%m%d')
>>> day_float_str = time_str.str.extract(
  r'\d*(\.\d*)',...             r'\d*(\.\d*)', expand=False
...         ).fillna('0.0')
>>> day_float = pd.to_timedelta(day_float_str.astype(float), unit='D')
>>> datetimes = year_month_day + day_float
>>> print(datetimes)
0   2004-11-04 13:59:59.997120
dtype: datetime64[ns]

>>> # Old version:
>>> rounded_datetimes = datetimes.round('s')
>>> print(rounded_datetimes)
0   2004-11-04 13:59:59.997120
dtype: datetime64[ns]

>>> # New version:
>>> rounded_datetimes = pd.Series(dt.round('s') for dt in datetimes)
>>> print(rounded_datetimes)
0   2004-11-04 14:00:00
dtype: datetime64[ns]

@schlunma
Copy link
Contributor

You want to plug this in @valeriupredoi? You can simply replace the existing l.496 with

rounded_datetimes = pd.Series(dt.round('s') for dt in datetimes)

and should work 😁

I can also do it if you want, already got a local copy with this change.

@valeriupredoi
Copy link
Contributor Author

So why is the rounding done in the ICON cmorizer don't work then, and all the attempts I tried two days ago? You mind plopping the correct syntax in this PR, and get rid of the peasant rounding then plss, bud 🍺

@valeriupredoi
Copy link
Contributor Author

I can also do it if you want, already got a local copy with this change.

That'd be fab, cheers, bud! 🍺

@schlunma
Copy link
Contributor

In the pandas issue they say that only rounding the pd.Series objects fail, i.e.,

rounded_datetimes = datetimes.round('s')

This is what we currently have; hence, it fails.

However, rounding individual elements, i.e.,

rounded_datetimes = pd.Series(dt.round('s') for dt in datetimes)

works. Please don't ask me why...

@valeriupredoi
Copy link
Contributor Author

that'd be why then - I completely missed the conversion to pd.Series and was trying to round each element of the array, doing exactly what was broken in Pandas 🤦‍♂️

@schlunma schlunma changed the title Put a peasant's rounding of Panda datetimes in Icon cmorizer Fix rounding of Pandas datetimes in ICON CMORizer Sep 26, 2024
Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers 🍻

(not sure if that approval counts since I contributed to the code 😁)

@bouweandela
Copy link
Member

Did you time some representative test case @schlunma? Looping over every element of an array in Python and then doing something with it can be really slow.

@schlunma
Copy link
Contributor

Did you time some representative test case @schlunma? Looping over every element of an array in Python and then doing something with it can be really slow.

No, but I don't think it's critical here. All ICON output I've seen so far is scattered across many files with only very few time points (sometimes only one time step is stored). Even the high-res output (20 mins) we have only has 72 time points per file.

However, I am open to any better solution. All solutions I could think of involved Python loops in one way or another.

@valeriupredoi
Copy link
Contributor Author

we can always revert to the original implementation when Pandas finally fix their silly bug. Thanks a lot, Manu! @bouweandela would you be a trooper and merge this, pls 🍺

@valeriupredoi
Copy link
Contributor Author

However, I am open to any better solution. All solutions I could think of involved Python loops in one way or another.

Well, my peasant approach didn't - it involved a list not an array 🤣

@schlunma
Copy link
Contributor

Well, my peasant approach didn't - it involved a list not an array 🤣

But you still had a for loop in there 😝

@valeriupredoi
Copy link
Contributor Author

Well, my peasant approach didn't - it involved a list not an array 🤣

But you still had a for loop in there 😝

An even worse one too 🤣

@bouweandela
Copy link
Member

I did some timings (with pandas 2.1.4) and for 100 points the time it takes is quite marginal, even if the new approach is more than 10 times slower:

In [0]: import pandas as pd
   ...: 
   ...: dates = pd.Series([pd.to_datetime("2022-01-11 12:10:01.123")] * 100)

In [1]: %timeit pd.Series(dt.round('s') for dt in dates)
906 μs ± 16 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [2]: %timeit dates.round("s")
67.3 μs ± 1.15 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@bouweandela bouweandela merged commit f7ab83f into main Sep 26, 2024
6 of 7 checks passed
@bouweandela bouweandela deleted the round_pandas branch September 26, 2024 09:46
@valeriupredoi
Copy link
Contributor Author

Many thanks, gents 🍺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests failing with known Pandas rounding bug
3 participants