Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: allow even more flexible ISO 8601 datetime parsing #11899

Closed
femtotrader opened this issue Dec 25, 2015 · 5 comments
Closed

PERF: allow even more flexible ISO 8601 datetime parsing #11899

femtotrader opened this issue Dec 25, 2015 · 5 comments
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Milestone

Comments

@femtotrader
Copy link

Hello,

I noticed that there is a huge code speed difference with to_datetime execution when format is not given and when it's given.

I wonder if there is not some room for improvements here!

In [1]: %time df=pd.read_csv("AUDUSD-2014-01.csv", names=['Symbol', 'Date', 'Bid', 'Ask'])
CPU times: user 3.31 s, sys: 481 ms, total: 3.79 s
Wall time: 4.13 s

In [2]: df
Out[274]:
          Symbol                   Date      Bid      Ask
0        AUD/USD  20140101 21:55:34.404  0.88796  0.88922
1        AUD/USD  20140101 21:55:34.444  0.88805  0.88914
2        AUD/USD  20140101 21:55:34.475  0.88809  0.88910
3        AUD/USD  20140101 21:55:48.962  0.88811  0.88908
4        AUD/USD  20140101 21:56:38.293  0.88808  0.88887
...          ...                    ...      ...      ...
1947101  AUD/USD  20140131 21:59:48.048  0.87525  0.87589
1947102  AUD/USD  20140131 21:59:54.599  0.87527  0.87589
1947103  AUD/USD  20140131 21:59:56.927  0.87531  0.87588
1947104  AUD/USD  20140131 21:59:59.365  0.87531  0.87574
1947105  AUD/USD  20140131 22:00:00.038  0.87531  0.87574

[1947106 rows x 4 columns]

In [3]: %time pd.to_datetime(df['Date'])
CPU times: user 11min 44s, sys: 19.4 s, total: 12min 4s
Wall time: 12min 46s
Out[3]:
0         2014-01-01 21:55:34.404
1         2014-01-01 21:55:34.444
2         2014-01-01 21:55:34.475
3         2014-01-01 21:55:48.962
4         2014-01-01 21:56:38.293
                    ...
1947101   2014-01-31 21:59:48.048
1947102   2014-01-31 21:59:54.599
1947103   2014-01-31 21:59:56.927
1947104   2014-01-31 21:59:59.365
1947105   2014-01-31 22:00:00.038
Name: Date, dtype: datetime64[ns]

In [4]: fmt='%Y%m%d %H:%M:%S.%f'

In [5]: %time pd.to_datetime(df['Date'], format=fmt)
CPU times: user 37.3 s, sys: 1.31 s, total: 38.6 s
Wall time: 40 s

In [6]: timedelta(minutes=12, seconds=46) / timedelta(seconds=40)
Out[6]: 19.15

There is x19.15 factor!!!

Sample data can be found here
https://drive.google.com/file/d/0B8iUtWjZOTqla3ZZTC1FS0pkZXc/view?usp=sharing

See also pydata/pandas-datareader#153

@jreback
Copy link
Contributor

jreback commented Dec 26, 2015

this falls back to dateutil which is in python space for parsing. So it behooves you to specify the format in this case. No real easy way around this, unless the ISO 8601 parser was modified to accept this format (which is not that difficult as its the same format with no separators).

If you are another brave soul would like to submit a pull-request it would be great.

@jreback jreback added Datetime Datetime data dtype Performance Memory or execution speed performance labels Dec 26, 2015
@jreback jreback added this to the Someday milestone Dec 26, 2015
@jreback
Copy link
Contributor

jreback commented Dec 26, 2015

Here's an easy way to repro things like this:
You can also use infer_datetime_format=True which will effectively guess the format for you (if its close to ISO 8601)

In [16]: s = Series(pd.date_range('20130101',freq='ms',periods=10000)).dt.strftime('%Y%m%d %H:%M:%S.%f')

In [17]: s.head()
Out[17]: 
0    20130101 00:00:00.000000
1    20130101 00:00:00.001000
2    20130101 00:00:00.002000
3    20130101 00:00:00.003000
4    20130101 00:00:00.004000
dtype: object

In [18]: %timeit pd.to_datetime(s)
1 loops, best of 3: 1.09 s per loop

In [19]: %timeit pd.to_datetime(s,format='%Y%m%d %H:%M:%S.%f')
10 loops, best of 3: 61.9 ms per loop

In [20]: %timeit pd.to_datetime(s,infer_datetime_format=True)
10 loops, best of 3: 62.8 ms per loop

@femtotrader
Copy link
Author

Thanks @jreback for infer_datetime_format this is good for my use case.

Not sure if I need to send PR to pandas or to dateutil

https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py

@jreback
Copy link
Contributor

jreback commented Dec 26, 2015

I am referring to pandas of course, code is here

@jreback jreback changed the title to_datetime is very slow when no format is given PERF: allow even more flexible ISO 8601 datetime parsing Dec 26, 2015
@jreback
Copy link
Contributor

jreback commented Dec 26, 2015

xref #9714

@jreback jreback modified the milestones: 0.18.0, Someday Jan 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants