Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hazard.intensity in CSR format and not CSC format #938

Open
mncosta opened this issue Aug 16, 2024 · 1 comment
Open

Hazard.intensity in CSR format and not CSC format #938

mncosta opened this issue Aug 16, 2024 · 1 comment

Comments

@mncosta
Copy link

mncosta commented Aug 16, 2024

Hi,

Not sure if this belongs here, but I have a question about Climada's performance and either I am missing something or I believe an improving could be made to the use of hazard intensities. Searching through the repositories' issues I only found a reference to this on #874, stating A significant amount of time is spent slicing the hazard intensity matrix, which is related to what I found.

As I found, currently Hazard.intensity is stored as a "row vector" spicy.sparse matrix in CSR format. I'm making this assumption from the provided example in here, specifically
the following code section in Part 1:

%matplotlib inline
import numpy as np
from climada.hazard import Hazard
from climada.util.constants import HAZ_DEMO_FL
# to hide the warnings
import warnings
warnings.filterwarnings('ignore')

# read intensity from raster file HAZ_DEMO_FL and set frequency for the contained event
haz_ven = Hazard.from_raster([HAZ_DEMO_FL], attrs={'frequency':np.ones(1)/2}, haz_type='FL')
haz_ven.check()

Inspecting haz_ven.intensityyields the following format:

<1x1032226 sparse matrix of type '<class 'numpy.float32'>'
	with 161671 stored elements in Compressed Sparse Row format>

Yet, this appears to be a 1 row x 1032226 columns matrix, in essence making the CSR format not the most efficient when slicing/indexing is needed. Why are hazard intensities saved in CSR format and not CSC format? In essence this would making slicing much more efficient. Time-wise would be a massive improvement if hazard matrices are huge. Again, with the above hazard example and with a simple indexing in CSR vs. CSC format:

# taking some random indexes to slice
idx = np.random.randint(0, 1032226, 100)

# get hazard.intensity in csc format
csc_format = haz_ven.intensity.tocsc()
# profiling slicing
%timeit csc_format[:, idx]
%timeit haz_ven.intensity[:, idx]

Which yields:

58.3 µs ± 71.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
711 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This is ~12 times faster for this very small problem. In the "real" problem I'm currently working I can get even a bigger speed up with a matrix of shape 1x1072057600, which takes about ~4.5s to index ~20k cells, whereas with CSC format I can slice in about 400 µs.

Thus, why are intensities stored in this manner? Do you require them to be this way for other calculations? Is there a faster way to slice intensities given an array of centroids that I'm missing?

@chahank
Copy link
Member

chahank commented Aug 16, 2024

Dear Miguel,

CLIMADA is originally built to be able to handle large amounts of events (think of large probabilistic sets). Thus, slicing over events is important in that setting. Only recently have applications with a focus on singular large events emerged for which a centroid slicing might be more important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants