You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not sure if this belongs here, but I have a question about Climada's performance and either I am missing something or I believe an improving could be made to the use of hazard intensities. Searching through the repositories' issues I only found a reference to this on #874, stating A significant amount of time is spent slicing the hazard intensity matrix, which is related to what I found.
As I found, currently Hazard.intensity is stored as a "row vector" spicy.sparse matrix in CSR format. I'm making this assumption from the provided example in here, specifically
the following code section in Part 1:
%matplotlib inline
import numpy as np
from climada.hazard import Hazard
from climada.util.constants import HAZ_DEMO_FL
# to hide the warnings
import warnings
warnings.filterwarnings('ignore')
# read intensity from raster file HAZ_DEMO_FL and set frequency for the contained event
haz_ven = Hazard.from_raster([HAZ_DEMO_FL], attrs={'frequency':np.ones(1)/2}, haz_type='FL')
haz_ven.check()
Inspecting haz_ven.intensityyields the following format:
<1x1032226 sparse matrix of type '<class 'numpy.float32'>'
with 161671 stored elements in Compressed Sparse Row format>
Yet, this appears to be a 1 row x 1032226 columns matrix, in essence making the CSR format not the most efficient when slicing/indexing is needed. Why are hazard intensities saved in CSR format and not CSC format? In essence this would making slicing much more efficient. Time-wise would be a massive improvement if hazard matrices are huge. Again, with the above hazard example and with a simple indexing in CSR vs. CSC format:
# taking some random indexes to slice
idx = np.random.randint(0, 1032226, 100)
# get hazard.intensity in csc format
csc_format = haz_ven.intensity.tocsc()
58.3 µs ± 71.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
711 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This is ~12 times faster for this very small problem. In the "real" problem I'm currently working I can get even a bigger speed up with a matrix of shape 1x1072057600, which takes about ~4.5s to index ~20k cells, whereas with CSC format I can slice in about 400 µs.
Thus, why are intensities stored in this manner? Do you require them to be this way for other calculations? Is there a faster way to slice intensities given an array of centroids that I'm missing?
The text was updated successfully, but these errors were encountered:
CLIMADA is originally built to be able to handle large amounts of events (think of large probabilistic sets). Thus, slicing over events is important in that setting. Only recently have applications with a focus on singular large events emerged for which a centroid slicing might be more important.
Hi,
Not sure if this belongs here, but I have a question about Climada's performance and either I am missing something or I believe an improving could be made to the use of hazard intensities. Searching through the repositories' issues I only found a reference to this on #874, stating
A significant amount of time is spent slicing the hazard intensity matrix
, which is related to what I found.As I found, currently
Hazard.intensity
is stored as a "row vector"spicy.sparse
matrix in CSR format. I'm making this assumption from the provided example in here, specificallythe following code section in Part 1:
Inspecting
haz_ven.intensity
yields the following format:Yet, this appears to be a 1 row x 1032226 columns matrix, in essence making the CSR format not the most efficient when slicing/indexing is needed. Why are hazard intensities saved in CSR format and not CSC format? In essence this would making slicing much more efficient. Time-wise would be a massive improvement if hazard matrices are huge. Again, with the above hazard example and with a simple indexing in CSR vs. CSC format:
Which yields:
This is ~12 times faster for this very small problem. In the "real" problem I'm currently working I can get even a bigger speed up with a matrix of shape 1x1072057600, which takes about ~4.5s to index ~20k cells, whereas with CSC format I can slice in about 400 µs.
Thus, why are intensities stored in this manner? Do you require them to be this way for other calculations? Is there a faster way to slice intensities given an array of centroids that I'm missing?
The text was updated successfully, but these errors were encountered: