PERF: allow even more flexible ISO 8601 datetime parsing #11899

femtotrader · 2015-12-25T09:31:41Z

Hello,

I noticed that there is a huge code speed difference with to_datetime execution when format is not given and when it's given.

I wonder if there is not some room for improvements here!

In [1]: %time df=pd.read_csv("AUDUSD-2014-01.csv", names=['Symbol', 'Date', 'Bid', 'Ask'])
CPU times: user 3.31 s, sys: 481 ms, total: 3.79 s
Wall time: 4.13 s

In [2]: df
Out[274]:
          Symbol                   Date      Bid      Ask
0        AUD/USD  20140101 21:55:34.404  0.88796  0.88922
1        AUD/USD  20140101 21:55:34.444  0.88805  0.88914
2        AUD/USD  20140101 21:55:34.475  0.88809  0.88910
3        AUD/USD  20140101 21:55:48.962  0.88811  0.88908
4        AUD/USD  20140101 21:56:38.293  0.88808  0.88887
...          ...                    ...      ...      ...
1947101  AUD/USD  20140131 21:59:48.048  0.87525  0.87589
1947102  AUD/USD  20140131 21:59:54.599  0.87527  0.87589
1947103  AUD/USD  20140131 21:59:56.927  0.87531  0.87588
1947104  AUD/USD  20140131 21:59:59.365  0.87531  0.87574
1947105  AUD/USD  20140131 22:00:00.038  0.87531  0.87574

[1947106 rows x 4 columns]

In [3]: %time pd.to_datetime(df['Date'])
CPU times: user 11min 44s, sys: 19.4 s, total: 12min 4s
Wall time: 12min 46s
Out[3]:
0         2014-01-01 21:55:34.404
1         2014-01-01 21:55:34.444
2         2014-01-01 21:55:34.475
3         2014-01-01 21:55:48.962
4         2014-01-01 21:56:38.293
                    ...
1947101   2014-01-31 21:59:48.048
1947102   2014-01-31 21:59:54.599
1947103   2014-01-31 21:59:56.927
1947104   2014-01-31 21:59:59.365
1947105   2014-01-31 22:00:00.038
Name: Date, dtype: datetime64[ns]

In [4]: fmt='%Y%m%d %H:%M:%S.%f'

In [5]: %time pd.to_datetime(df['Date'], format=fmt)
CPU times: user 37.3 s, sys: 1.31 s, total: 38.6 s
Wall time: 40 s

In [6]: timedelta(minutes=12, seconds=46) / timedelta(seconds=40)
Out[6]: 19.15

There is x19.15 factor!!!

Sample data can be found here
https://drive.google.com/file/d/0B8iUtWjZOTqla3ZZTC1FS0pkZXc/view?usp=sharing

See also pydata/pandas-datareader#153

The text was updated successfully, but these errors were encountered:

jreback · 2015-12-26T00:20:04Z

this falls back to dateutil which is in python space for parsing. So it behooves you to specify the format in this case. No real easy way around this, unless the ISO 8601 parser was modified to accept this format (which is not that difficult as its the same format with no separators).

If you are another brave soul would like to submit a pull-request it would be great.

jreback · 2015-12-26T00:27:00Z

Here's an easy way to repro things like this:
You can also use infer_datetime_format=True which will effectively guess the format for you (if its close to ISO 8601)

In [16]: s = Series(pd.date_range('20130101',freq='ms',periods=10000)).dt.strftime('%Y%m%d %H:%M:%S.%f')

In [17]: s.head()
Out[17]: 
0    20130101 00:00:00.000000
1    20130101 00:00:00.001000
2    20130101 00:00:00.002000
3    20130101 00:00:00.003000
4    20130101 00:00:00.004000
dtype: object

In [18]: %timeit pd.to_datetime(s)
1 loops, best of 3: 1.09 s per loop

In [19]: %timeit pd.to_datetime(s,format='%Y%m%d %H:%M:%S.%f')
10 loops, best of 3: 61.9 ms per loop

In [20]: %timeit pd.to_datetime(s,infer_datetime_format=True)
10 loops, best of 3: 62.8 ms per loop

femtotrader · 2015-12-26T09:17:59Z

Thanks @jreback for infer_datetime_format this is good for my use case.

Not sure if I need to send PR to pandas or to dateutil

https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py

jreback · 2015-12-26T12:41:38Z

I am referring to pandas of course, code is here

jreback · 2015-12-26T12:43:26Z

xref #9714

jreback added Datetime Datetime data dtype Performance Memory or execution speed performance labels Dec 26, 2015

jreback added this to the Someday milestone Dec 26, 2015

jreback added Difficulty Intermediate labels Dec 26, 2015

jreback changed the title ~~to_datetime is very slow when no format is given~~ PERF: allow even more flexible ISO 8601 datetime parsing Dec 26, 2015

chris-b1 mentioned this issue Jan 16, 2016

PERF: more flexible iso8601 parsing #12060

Closed

jreback modified the milestones: 0.18.0, Someday Jan 19, 2016

jreback closed this as completed in 5de6b84 Jan 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: allow even more flexible ISO 8601 datetime parsing #11899

PERF: allow even more flexible ISO 8601 datetime parsing #11899

femtotrader commented Dec 25, 2015

jreback commented Dec 26, 2015

jreback commented Dec 26, 2015

femtotrader commented Dec 26, 2015

jreback commented Dec 26, 2015

jreback commented Dec 26, 2015

PERF: allow even more flexible ISO 8601 datetime parsing #11899

PERF: allow even more flexible ISO 8601 datetime parsing #11899

Comments

femtotrader commented Dec 25, 2015

jreback commented Dec 26, 2015

jreback commented Dec 26, 2015

femtotrader commented Dec 26, 2015

jreback commented Dec 26, 2015

jreback commented Dec 26, 2015