Speed up of Cfmessage #197

Snowda · 2021-01-26T04:05:29Z

Hi there,

I have a need to process NOAA files for HRRR data and corresponding forecasts but I'm running into issues around speed of reading downloaded files for parsing. It is taking roughly 40 seconds to load data for one location alone. I've already managed to make gains with a few other tricks external to the library but when I now run a profiler on the code, about 20% of the overall total execution time of my task is spent inside functions belonging to just CfMessage alone and it is the biggest source of execution delay that isn't attributed to xarray itself.

I'm very new to the internals of the code here, but at a glance a few areas of potential performance gains may involve:

Replacing string keys with assigned integer based constants
The addition of optional de-selection at a higher level of what content is needed from COMPUTED_KEYS as options like verifying_time or indexing_time do not have end use utility on my end.
Dropping the LOG initialization inside CfMessage as it isn't called elsewhere in file but still has an initialization overhead
The if-else stack inside build_valid_time function could potentially be replaced with a more divide and conquer approach but this would be a very minor improvement

The text was updated successfully, but these errors were encountered:

alexamici · 2021-01-26T17:17:31Z

@Snowda current master has a ton of performance improvements that are not released yet, so your best bet is to try it and see if your use case benefits from them. Otherwise you can just wait for the upcoming 0.9.9.0 release.

I'm interested in hearing about the bottlenecks that are still in master.

Snowda · 2021-01-26T19:30:41Z

I may fiddle with it a little but I would prefer to update via the package manager. ETA on 0.9.9.0 release?

alexamici · 2021-01-26T21:55:00Z

Mm... I would have said "imminent" since a couple of weeks, but distractions pop up all the time. First half of February is the most reasonable ETA.

Snowda · 2021-01-28T06:18:39Z

Ran 0.9.9.0 and have got, very roughly, a 6% improvement in execution. But the bulk of the delay remains here and, if anything, the overall percentage of execution time represented by CfMessage functions has gone up.

The main culprits for bottlenecks seem to be get_item usage both in from_filestream and build_variable_components which combined is 34% of execution time. Which makes sense anyway because from_filestream has that nested for loop and does a get for every iteration.

alexamici · 2021-01-28T08:04:56Z

@Snowda thanks a lot!

I noted your suggestions and labeled the issue as a good enhancement request.

BTW, note that from_filestream is only called the first time you access a GRIB file to build the on-disk index. Other calls of open_dataset should find the index on-disk and not incur in the first-time performance penalty.

Snowda · 2021-01-28T09:22:29Z

Unfortunately, a lot of my issue is around one time access only. If I need to revisit the file it's already been processed into a Pandas table and the cached data is stored with PyArrow / feather-format which is finding me quicker than any gains experienced from an idx lookup.

alexamici · 2021-01-28T10:34:41Z

@aurghs: "Replacing string keys with assigned integer based constants" this looks like a very promising performance boost once we enable selecting the data type for keys (that we need to do anyway due to #195) and then fetching the string representation only once like we do with the non-dimensional keys. You may have a look at it once you have time.

alexamici · 2021-03-10T17:58:43Z

@Snowda FYI we looked into the matter and in our tests you can gain around 15% of performance by not computing the time coordinate that you are not going to use. @aurghs will be working on a PR.

Everything else doesn't appear to be worth the effort in our benchmarks.

Snowda · 2021-03-10T20:50:20Z

Great to hear.

Yeah been trying to get something working on my end, and where there is overlap with the eecodes library things gets messy because it's all string usage down that low, so completely understand.

What I have noticed though regarding time coordinates, is that it has some heavy usage of building datetime objects for each row. When that scales up on large files it gets expensive. Another option would be to have a method to return a basic unix integer timestamp rather than generating full objects.

Another option I've been rattling around in my brain is some method to pass a latitude longitude filter to the initial file opening rather than post parsing the full file (which can have data spanning the entire world). But I'm unsure about the operation of the Message mechanism and how that is extracted so it looks like a lot of refactoring to get such a thing working.

alexamici added the enhancement New feature or request label Jan 28, 2021

alexamici mentioned this issue Apr 7, 2021

Prepare for the release (changelog, README, copyright, etc) #221

Merged

alexamici closed this as completed in #221 Apr 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up of Cfmessage #197

Speed up of Cfmessage #197

Snowda commented Jan 26, 2021

alexamici commented Jan 26, 2021

Snowda commented Jan 26, 2021

alexamici commented Jan 26, 2021

Snowda commented Jan 28, 2021 •

edited

Loading

alexamici commented Jan 28, 2021 •

edited

Loading

Snowda commented Jan 28, 2021

alexamici commented Jan 28, 2021 •

edited

Loading

alexamici commented Mar 10, 2021

Snowda commented Mar 10, 2021 •

edited

Loading

Speed up of Cfmessage #197

Speed up of Cfmessage #197

Comments

Snowda commented Jan 26, 2021

alexamici commented Jan 26, 2021

Snowda commented Jan 26, 2021

alexamici commented Jan 26, 2021

Snowda commented Jan 28, 2021 • edited Loading

alexamici commented Jan 28, 2021 • edited Loading

Snowda commented Jan 28, 2021

alexamici commented Jan 28, 2021 • edited Loading

alexamici commented Mar 10, 2021

Snowda commented Mar 10, 2021 • edited Loading

Snowda commented Jan 28, 2021 •

edited

Loading

alexamici commented Jan 28, 2021 •

edited

Loading

alexamici commented Jan 28, 2021 •

edited

Loading

Snowda commented Mar 10, 2021 •

edited

Loading