Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: more flexible iso8601 parsing #12060

Closed
wants to merge 1 commit into from

Conversation

chris-b1
Copy link
Contributor

closes #9714
closes #11899
closes #11871

makes ISO parser in C handle the following. I think 2 & 3 aren't actually iso8601 anymore, but close and unambiguous.

  1. dates without '-' separator
  2. dates without space before tz
  3. dates without leading 0s in month/day
    (this ONLY works if the date has separators, eg. "2015-1-1" parses, but "201511" doesn't because it's ambiguous)

asv results - adds a small amount of overhead to the standard case ('2014-01-01')

    before     after       ratio
  [1ae6384 ] [e922b05 ]
     5.17ms     5.34ms      1.03  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601
     5.15ms     5.50ms      1.07  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601_format
-  111.89ms     5.27ms      0.05  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601_format_no_sep
-     2.02s     5.33ms      0.00  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601_nosep
-     2.97s   218.25ms      0.07  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601_tz_spaceformat

@jorisvandenbossche
Copy link
Member

A question: is is worth the added complexity to the ISO 8601 C-parser to be able to handle the dates without separator (which is actually not really ISO 8601 anymore), if when passing the format is already as fast to parse this?
Of course, you then need to specify this, but this could also be alleviated by making infer_datetime_format=True the default.

@jorisvandenbossche jorisvandenbossche added Bug Datetime Datetime data dtype Performance Memory or execution speed performance labels Jan 16, 2016
@jorisvandenbossche jorisvandenbossche added this to the 0.18.0 milestone Jan 16, 2016
self.strings = [x.strftime('%Y-%m-%d %H:%M:%S') for x in self.rng]
self.strings = self.rng.strftime('%Y-%m-%d %H:%M:%S')
self.strings_nosep = self.rng.strftime('%Y%m%d %H:%M:%S')
self.strings_tz_space = pd.Series(self.rng.strftime('%Y-%m-%d %H:%M:%S')) + ' -0800'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strftime method on a DatetimeIndex is rather new. I don't really know what our policy on this should be, but I think this will make it difficult to run the asv benchmarks for a longer time back?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree here. why don't you leave the generation of the test cases using the list comprehension (how self.strings was)

@chris-b1
Copy link
Contributor Author

Hard to say if it's worth it (re dates with separators), not sure it's a super common format. But there is a substantial speedup even over providing a format to the strptime parser (20x in that asv bench).

Not that it really matters, but from I've read dates without separators would still be considered ISO 8601, e.g. http://www.cl.cam.ac.uk/~mgk25/iso-time.html.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2016

how hard would it to allow a selection of separators

eg - . / \
(assume that if one is found it would have to repeat for the date portions, iow don't allow mixed seps)

w/o sacrificing ambiguity?

@jreback
Copy link
Contributor

jreback commented Jan 16, 2016

also, let's add another issue to make infer_datetime_format=True as the default for to_datetime and read_csv. I don't think this will impact anything (except speed things up) and just make more things work.

@jorisvandenbossche
Copy link
Member

@chris-b1 Ah, yes, I missed that you also passed this to the ISO parser if that format is given. I misinterpreted the asv output so thinking that the generic format parser was as fast as the ISO parser, but indeed quite a bit slower.

@chris-b1
Copy link
Contributor Author

@jreback - I don't think it would be too bad to parse multiple separators, I'll try it out

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

@chris-b1 just some minor comments on the benchmarks themselves here.

if the addtl separators is a bit complicated can do in another PR.

pls rebase on master as lots of PEP changes recently and do a git diff master | flake8 --diff check

ping if you want to merge now (else, pls open a new issue for the other sep chars if not)

@chris-b1
Copy link
Contributor Author

@jreback - updated for that asv note. I was most of the way there already, so also updated to handle

  1. date separator can be any of {'-', '.', '/', '\\', ' '}, as long as consistent
  2. leading 0s can also be omitted from the time components (must have : sep)

Passes the current test suite (adjusted for a couple timezone changes), not sure if there are other edge cases to try?

a couple perf regressions to figure out too

    before     after       ratio
  [567bc5ce] [82c5d7b9]
+  154.12ms      4.24s     27.50  plotting.plot_timeseries_period.time_plot_timeseries_period
+    1.82ms      2.19s   1203.52  timeseries.datetimeindex_infer_dst.time_datetimeindex_infer_dst
-     2.10s     5.14ms      0.00  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601_nosep
-     3.13s   223.79ms      0.07  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601_tz_spaceformat

@@ -664,7 +722,7 @@ parse_iso_8601_datetime(char *str, int len,
if (sublen >= 2 && isdigit(substr[0]) && isdigit(substr[1])) {
out->min = 10 * (substr[0] - '0') + (substr[1] - '0');

if (out->hour < 0 || out->min >= 60) {
if (out->min < 0 || out->min >= 60) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is out->min < 0 ever True?

@chris-b1
Copy link
Contributor Author

@jreback - if you want to have a look, I've got everything working now.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2016

can you add tests for differnt seps (and invalid ones)?

}
}
if (i == valid_sep_len) {
goto parse_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this validate all seps are the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - This first check (sep between year and month) validates its in the list - the second check (between month/day) is required to match sep

@jreback
Copy link
Contributor

jreback commented Jan 22, 2016

did you figure out the perf regressions?

@chris-b1
Copy link
Contributor Author

Well, I seemed to fix it, but didn't figure it out.

tz_localize was very running very slow, but only a fresh install of pandas - wasn't slow if run a second time. Also seemed to be fine if stepping through with pdb. Undoing this change fixed it - very unclear to me how that caused it though.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2016

hmm, I think this interferes with the infer_datetime_format ? haven't really stepped thru the code that recently.

@chris-b1
Copy link
Contributor Author

@jreback - updated. Perf issue turned out to be unrelated - just some kind of caching that pytz does (also shows up on a fresh install of 0.17.1).

    before     after       ratio
  [5b5b2fe8] [5be82023]
   925.61us   929.00us      1.00  frame_methods.frame_assign_timeseries_index.time_frame_assign_timeseries_index
   145.86ms   145.63ms      1.00  plotting.plot_timeseries_period.time_plot_timeseries_period
     3.18ms     3.11ms      0.98  timeseries.dataframe_resample_max_numpy.time_dataframe_resample_max_numpy
     3.21ms     3.10ms      0.97  timeseries.dataframe_resample_max_string.time_dataframe_resample_max_string
     2.79ms     2.73ms      0.98  timeseries.dataframe_resample_mean_numpy.time_dataframe_resample_mean_numpy
     2.76ms     2.86ms      1.03  timeseries.dataframe_resample_mean_string.time_dataframe_resample_mean_string
     3.40ms     3.25ms      0.96  timeseries.dataframe_resample_min_numpy.time_dataframe_resample_min_numpy
     3.40ms     3.25ms      0.95  timeseries.dataframe_resample_min_string.time_dataframe_resample_min_string
   667.96us   661.20us      0.99  timeseries.datetimeindex_add_offset.time_datetimeindex_add_offset
     1.19ms     1.16ms      0.98  timeseries.datetimeindex_converter.time_datetimeindex_converter
     1.89ms     1.86ms      0.98  timeseries.datetimeindex_infer_dst.time_datetimeindex_infer_dst
     2.21ms     2.40ms      1.08  timeseries.datetimeindex_normalize.time_datetimeindex_normalize
   131.10us   127.90us      0.98  timeseries.datetimeindex_unique.time_datetimeindex_unique
   401.88us   395.18us      0.98  timeseries.dti_reset_index.time_dti_reset_index
   445.96us   462.25us      1.04  timeseries.dti_reset_index_tz.time_dti_reset_index_tz
    63.58ms    67.03ms      1.05  timeseries.period_setitem.time_period_setitem
     1.04ms     1.07ms      1.03  timeseries.timeseries_1min_5min_mean.time_timeseries_1min_5min_mean
     1.24ms     1.23ms      0.99  timeseries.timeseries_1min_5min_ohlc.time_timeseries_1min_5min_ohlc
    13.13ms    13.95ms      1.06  timeseries.timeseries_add_irregular.time_timeseries_add_irregular
     3.07ms     3.04ms      0.99  timeseries.timeseries_asof.time_timeseries_asof
     2.92ms     2.93ms      1.00  timeseries.timeseries_asof_nan.time_timeseries_asof_nan
    42.46us    43.79us      1.03  timeseries.timeseries_asof_single.time_timeseries_asof_single
    20.10us    23.67us      1.18  timeseries.timeseries_custom_bday_apply.time_timeseries_custom_bday_apply
    32.52us    33.25us      1.02  timeseries.timeseries_custom_bday_apply_dt64.time_timeseries_custom_bday_apply_dt64
    36.82us    41.39us      1.12  timeseries.timeseries_custom_bday_cal_decr.time_timeseries_custom_bday_cal_decr
    29.51us    30.93us      1.05  timeseries.timeseries_custom_bday_cal_incr.time_timeseries_custom_bday_cal_incr
    30.64us    31.04us      1.01  timeseries.timeseries_custom_bday_cal_incr_n.time_timeseries_custom_bday_cal_incr_n
    36.54us    35.03us      0.96  timeseries.timeseries_custom_bday_cal_incr_neg_n.time_timeseries_custom_bday_cal_incr_neg_n
    35.01us    34.81us      0.99  timeseries.timeseries_custom_bday_decr.time_timeseries_custom_bday_decr
    22.65us    24.13us      1.07  timeseries.timeseries_custom_bday_incr.time_timeseries_custom_bday_incr
   275.66us   262.08us      0.95  timeseries.timeseries_custom_bmonthbegin_decr_n.time_timeseries_custom_bmonthbegin_decr_n
   245.19us   255.52us      1.04  timeseries.timeseries_custom_bmonthbegin_incr_n.time_timeseries_custom_bmonthbegin_incr_n
   279.26us   292.99us      1.05  timeseries.timeseries_custom_bmonthend_decr_n.time_timeseries_custom_bmonthend_decr_n
   196.76us   202.95us      1.03  timeseries.timeseries_custom_bmonthend_incr.time_timeseries_custom_bmonthend_incr
   244.31us   237.53us      0.97  timeseries.timeseries_custom_bmonthend_incr_n.time_timeseries_custom_bmonthend_incr_n
     5.78ms     5.70ms      0.99  timeseries.timeseries_datetimeindex_offset_delta.time_timeseries_datetimeindex_offset_delta
    18.93ms    19.09ms      1.01  timeseries.timeseries_datetimeindex_offset_fast.time_timeseries_datetimeindex_offset_fast
    52.19ms    50.24ms      0.96  timeseries.timeseries_datetimeindex_offset_slow.time_timeseries_datetimeindex_offset_slow
    53.03us    59.37us      1.12  timeseries.timeseries_day_apply.time_timeseries_day_apply
    57.71us    58.96us      1.02  timeseries.timeseries_day_incr.time_timeseries_day_incr
    12.00ms    12.26ms      1.02  timeseries.timeseries_infer_freq.time_timeseries_infer_freq
     4.06ms     4.23ms      1.04  timeseries.timeseries_is_month_start.time_timeseries_is_month_start
   681.27ms   691.78ms      1.02  timeseries.timeseries_iter_datetimeindex.time_timeseries_iter_datetimeindex
    12.82ms    12.77ms      1.00  timeseries.timeseries_iter_datetimeindex_preexit.time_timeseries_iter_datetimeindex_preexit
      3.71s      3.65s      0.98  timeseries.timeseries_iter_periodindex.time_timeseries_iter_periodindex
    39.32ms    37.26ms      0.95  timeseries.timeseries_iter_periodindex_preexit.time_timeseries_iter_periodindex_preexit
    65.22us    65.56us      1.01  timeseries.timeseries_large_lookup_value.time_timeseries_large_lookup_value
    17.02ms    16.28ms      0.96  timeseries.timeseries_period_downsample_mean.time_timeseries_period_downsample_mean
     2.52ms     2.99ms      1.18  timeseries.timeseries_resample_datetime64.time_timeseries_resample_datetime64
     7.88ms     7.79ms      0.99  timeseries.timeseries_series_offset_delta.time_timeseries_series_offset_delta
    21.77ms    21.16ms      0.97  timeseries.timeseries_series_offset_fast.time_timeseries_series_offset_fast
    54.98ms    53.63ms      0.98  timeseries.timeseries_series_offset_slow.time_timeseries_series_offset_slow
    48.59us    47.74us      0.98  timeseries.timeseries_slice_minutely.time_timeseries_slice_minutely
     9.76ms     9.74ms      1.00  timeseries.timeseries_sort_index.time_timeseries_sort_index
     8.43ms     8.26ms      0.98  timeseries.timeseries_timestamp_downsample_mean.time_timeseries_timestamp_downsample_mean
    17.30us    16.55us      0.96  timeseries.timeseries_timestamp_tzinfo_cons.time_timeseries_timestamp_tzinfo_cons
    11.15ms    11.18ms      1.00  timeseries.timeseries_to_datetime_YYYYMMDD.time_timeseries_to_datetime_YYYYMMDD
     4.54ms     4.81ms      1.06  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601
     6.95ms     4.70ms      0.68  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601_format
-  114.53ms     4.67ms      0.04  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601_format_no_sep
-     1.98s     4.65ms      0.00  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601_nosep
-     2.91s   210.18ms      0.07  timeseries.timeseries_to_datetime_iso8601.time_timeseries_to_datetime_iso8601_tz_spaceformat

@@ -695,7 +753,7 @@ parse_iso_8601_datetime(char *str, int len,
if (sublen >= 2 && isdigit(substr[0]) && isdigit(substr[1])) {
out->sec = 10 * (substr[0] - '0') + (substr[1] - '0');

if (out->sec < 0 || out->sec >= 60) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these negative checks still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kawochen pointed this out - unless I'm misunderstanding something they will always be false.

@@ -461,6 +458,19 @@ def calc_with_mask(carg, mask):

return None

def _format_is_iso(f):
"""
Does format match the iso8601 set that can be handled by the C parser?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add here that we require same seps for date fields

@jreback
Copy link
Contributor

jreback commented Jan 24, 2016

some minor comments. ping when pushed / green.

@chris-b1
Copy link
Contributor Author

@jreback - updated for your comments

@jreback jreback closed this in 5de6b84 Jan 26, 2016
@jreback
Copy link
Contributor

jreback commented Jan 26, 2016

@chris-b1 thanks!

your PR's are always great!

@chris-b1 chris-b1 deleted the isoparsing branch September 24, 2016 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
4 participants