Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up of Cfmessage #197

Closed
Snowda opened this issue Jan 26, 2021 · 9 comments · Fixed by #221
Closed

Speed up of Cfmessage #197

Snowda opened this issue Jan 26, 2021 · 9 comments · Fixed by #221
Labels
enhancement New feature or request

Comments

@Snowda
Copy link

Snowda commented Jan 26, 2021

Hi there,

I have a need to process NOAA files for HRRR data and corresponding forecasts but I'm running into issues around speed of reading downloaded files for parsing. It is taking roughly 40 seconds to load data for one location alone. I've already managed to make gains with a few other tricks external to the library but when I now run a profiler on the code, about 20% of the overall total execution time of my task is spent inside functions belonging to just CfMessage alone and it is the biggest source of execution delay that isn't attributed to xarray itself.

I'm very new to the internals of the code here, but at a glance a few areas of potential performance gains may involve:

  • Replacing string keys with assigned integer based constants
  • The addition of optional de-selection at a higher level of what content is needed from COMPUTED_KEYS as options like verifying_time or indexing_time do not have end use utility on my end.
  • Dropping the LOG initialization inside CfMessage as it isn't called elsewhere in file but still has an initialization overhead
  • The if-else stack inside build_valid_time function could potentially be replaced with a more divide and conquer approach but this would be a very minor improvement
@alexamici
Copy link
Contributor

@Snowda current master has a ton of performance improvements that are not released yet, so your best bet is to try it and see if your use case benefits from them. Otherwise you can just wait for the upcoming 0.9.9.0 release.

I'm interested in hearing about the bottlenecks that are still in master.

@Snowda
Copy link
Author

Snowda commented Jan 26, 2021

I may fiddle with it a little but I would prefer to update via the package manager. ETA on 0.9.9.0 release?

@alexamici
Copy link
Contributor

Mm... I would have said "imminent" since a couple of weeks, but distractions pop up all the time. First half of February is the most reasonable ETA.

@Snowda
Copy link
Author

Snowda commented Jan 28, 2021

Ran 0.9.9.0 and have got, very roughly, a 6% improvement in execution. But the bulk of the delay remains here and, if anything, the overall percentage of execution time represented by CfMessage functions has gone up.

The main culprits for bottlenecks seem to be get_item usage both in from_filestream and build_variable_components which combined is 34% of execution time. Which makes sense anyway because from_filestream has that nested for loop and does a get for every iteration.

@alexamici alexamici added the enhancement New feature or request label Jan 28, 2021
@alexamici
Copy link
Contributor

alexamici commented Jan 28, 2021

@Snowda thanks a lot!

I noted your suggestions and labeled the issue as a good enhancement request.

BTW, note that from_filestream is only called the first time you access a GRIB file to build the on-disk index. Other calls of open_dataset should find the index on-disk and not incur in the first-time performance penalty.

@Snowda
Copy link
Author

Snowda commented Jan 28, 2021

Unfortunately, a lot of my issue is around one time access only. If I need to revisit the file it's already been processed into a Pandas table and the cached data is stored with PyArrow / feather-format which is finding me quicker than any gains experienced from an idx lookup.

@alexamici
Copy link
Contributor

alexamici commented Jan 28, 2021

@aurghs: "Replacing string keys with assigned integer based constants" this looks like a very promising performance boost once we enable selecting the data type for keys (that we need to do anyway due to #195) and then fetching the string representation only once like we do with the non-dimensional keys. You may have a look at it once you have time.

@alexamici
Copy link
Contributor

@Snowda FYI we looked into the matter and in our tests you can gain around 15% of performance by not computing the time coordinate that you are not going to use. @aurghs will be working on a PR.

Everything else doesn't appear to be worth the effort in our benchmarks.

@Snowda
Copy link
Author

Snowda commented Mar 10, 2021

Great to hear.

Yeah been trying to get something working on my end, and where there is overlap with the eecodes library things gets messy because it's all string usage down that low, so completely understand.

What I have noticed though regarding time coordinates, is that it has some heavy usage of building datetime objects for each row. When that scales up on large files it gets expensive. Another option would be to have a method to return a basic unix integer timestamp rather than generating full objects.

Another option I've been rattling around in my brain is some method to pass a latitude longitude filter to the initial file opening rather than post parsing the full file (which can have data spanning the entire world). But I'm unsure about the operation of the Message mechanism and how that is extracted so it looks like a lot of refactoring to get such a thing working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants