-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up of Cfmessage #197
Comments
@Snowda current I'm interested in hearing about the bottlenecks that are still in |
I may fiddle with it a little but I would prefer to update via the package manager. ETA on 0.9.9.0 release? |
Mm... I would have said "imminent" since a couple of weeks, but distractions pop up all the time. First half of February is the most reasonable ETA. |
Ran 0.9.9.0 and have got, very roughly, a 6% improvement in execution. But the bulk of the delay remains here and, if anything, the overall percentage of execution time represented by CfMessage functions has gone up. The main culprits for bottlenecks seem to be get_item usage both in |
@Snowda thanks a lot! I noted your suggestions and labeled the issue as a good enhancement request. BTW, note that |
Unfortunately, a lot of my issue is around one time access only. If I need to revisit the file it's already been processed into a Pandas table and the cached data is stored with PyArrow / feather-format which is finding me quicker than any gains experienced from an idx lookup. |
@aurghs: "Replacing string keys with assigned integer based constants" this looks like a very promising performance boost once we enable selecting the data type for keys (that we need to do anyway due to #195) and then fetching the string representation only once like we do with the non-dimensional keys. You may have a look at it once you have time. |
Great to hear. Yeah been trying to get something working on my end, and where there is overlap with the eecodes library things gets messy because it's all string usage down that low, so completely understand. What I have noticed though regarding time coordinates, is that it has some heavy usage of building datetime objects for each row. When that scales up on large files it gets expensive. Another option would be to have a method to return a basic unix integer timestamp rather than generating full objects. Another option I've been rattling around in my brain is some method to pass a latitude longitude filter to the initial file opening rather than post parsing the full file (which can have data spanning the entire world). But I'm unsure about the operation of the Message mechanism and how that is extracted so it looks like a lot of refactoring to get such a thing working. |
Hi there,
I have a need to process NOAA files for HRRR data and corresponding forecasts but I'm running into issues around speed of reading downloaded files for parsing. It is taking roughly 40 seconds to load data for one location alone. I've already managed to make gains with a few other tricks external to the library but when I now run a profiler on the code, about 20% of the overall total execution time of my task is spent inside functions belonging to just CfMessage alone and it is the biggest source of execution delay that isn't attributed to xarray itself.
I'm very new to the internals of the code here, but at a glance a few areas of potential performance gains may involve:
COMPUTED_KEYS
as options likeverifying_time
orindexing_time
do not have end use utility on my end.LOG
initialization inside CfMessage as it isn't called elsewhere in file but still has an initialization overheadThe text was updated successfully, but these errors were encountered: