Wildfire data on API has non-canonical format, resulting in erroneous hazard #891

peanutfun · 2024-06-13T10:41:28Z

Describe the bug
We had noticed some strange patterns before when plotting the wildfire data from the API. However, the statistics on the Hazard.intensity.data looked fine, so we assumed a plotting issue. Turns out, the Hazard.intensity is not in canonical format, meaning that multiple entries in intensity.data map to the same entry in the intensity matrix. This causes erroneously high intensities for about 6% of all centroids.

To Reproduce
Steps to reproduce the behavior/error:

Download a wildfire dataset (here: Mexico) and plot it. Observe weird rectangular patterns with high intensity
Notice that the matrix is not in canonical format, and that its maximum is way larger than the data maximum.
- To add to the confusion, calling max will bring it into canonical format.

from climada.util.api_client import Client

client = Client()
wf_mexico = client.get_hazard("wildfire", properties={"country_iso3alpha": "MEX"})

print(f"Canonical format: {wf_mexico.intensity.has_canonical_format}")
print(f"Min data: {wf_mexico.intensity.data.min()}")
print(f"Max data: {wf_mexico.intensity.data.max()}")
print(f"Min value: {wf_mexico.intensity.min()}"). # This implicitly ensures canonical format!
print(f"Max value: {wf_mexico.intensity.max()}")
print(f"Canonical format: {wf_mexico.intensity.has_canonical_format}")
print(f"Min data: {wf_mexico.intensity.data.min()}")
print(f"Max data: {wf_mexico.intensity.data.max()}")

wf_mexico.plot_intensity(0)

Output:

Canonical format: False
Min data: 300.2
Max data: 506.2
Min value: 0.0
Max value: 1695.1
Canonical format: True
Min data: 300.2
Max data: 1695.1

Background

The reason for this is that multiple entries in intensity.data point to the same matrix entries of intensity. csr_matrix supports this, and sums up these values.

wf_mexico = client.get_hazard("wildfire", properties={"country_iso3alpha": "MEX"})
indices_event_0 = wf_mexico.intensity.indices[
    wf_mexico.intensity.indptr[0] : wf_mexico.intensity.indptr[1]
]
value_counts = pd.Series(indices_event_0).value_counts()
print(value_counts[value_counts > 1])
print(
    "Data pointing to centroid 84509: "
    f"{wf_mexico.intensity.data[np.nonzero(indices_event_0 == 84509)]}"
)
# Even in non-canonical format, data entries are summed
print(f"Intensity for centroid 84509: {wf_mexico.intensity[0, 84509]}")

Output:

84509    4
88529    4
77621    4
84766    4
97340    4
        ..
91851    2
28969    2
90838    2
184      2
81380    2
Name: count, Length: 296, dtype: int64
Data pointing to centroid 84509: [320.3 338.6 319.8 323.8]
Intensity for centroid 84509: 1302.5

The canonical format can be explicitly reached by calling sum_duplicates(). Summing the values is the default behavior and causes the exacerbated intensities.

How to solve this

The following code prunes the doubled entries and calls an aggregation method. Unfortunately, the doubled entries are actually not doubled (as you can see above) but show different values. It is thus unclear how they should be merged into a single sensible value. It is possible to take the mean or the max.

I want to make clear that this is not just a fix. The following function modifies the hazard data. There seems to have been some underlying problem with merging the original wildfire data or with the hazard definition that should be fixed in order to avoid this inconsistent hazard definition from the start.

def prune_hazard(hazard, agg):
    intensity = hazard.intensity
    num_rows = intensity.shape[0]

    # Iterate over events (rows)
    for row in range(num_rows):
        # Extract indices and data for event
        indices_event = intensity.indices[
            intensity.indptr[row] : intensity.indptr[row + 1]
        ]
        data_event = intensity.data[
            intensity.indptr[row] : intensity.indptr[row + 1]
        ]

        # Iterate over duplicate indices and aggregate the data
        index_counts = pd.Series(indices_event).value_counts()
        for idx, count in index_counts[index_counts > 1].items():
            data_idx = np.nonzero(indices_event == idx)
            data_event[data_idx] = agg(data_event[data_idx]) / count

    # Bring into canonical format
    intensity.sum_duplicates()

# Call like this:
prune_hazard(wf_mexico, np.max)  # or np.mean
wf_mexico.plot_intensity(0)

Result:

Climada Version: develop

System Information (please complete the following information):

Operating system and version: macOS
Python version: 3.9

@emanuel-schmid @chahank @samluethi

The text was updated successfully, but these errors were encountered:

emanuel-schmid · 2024-06-17T07:38:38Z

🙌 Thanks for analysis and elaboration!

I guess we'll apply the proposed prune_hazard function to all wildfire datasets and upload the purged files then.

Are we sure it is a problem of wildfire alone or need we check whether other hazard objects are affected too?

peanutfun added bug Data API Related to the Data API, not to this code base directly labels Jun 13, 2024

peanutfun mentioned this issue Jun 14, 2024

Ensure csr matrices are in "canonical format" before impact calculation #893

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wildfire data on API has non-canonical format, resulting in erroneous hazard #891

Wildfire data on API has non-canonical format, resulting in erroneous hazard #891

peanutfun commented Jun 13, 2024 •

edited

Loading

emanuel-schmid commented Jun 17, 2024 •

edited

Loading

Wildfire data on API has non-canonical format, resulting in erroneous hazard #891

Wildfire data on API has non-canonical format, resulting in erroneous hazard #891

Comments

peanutfun commented Jun 13, 2024 • edited Loading

emanuel-schmid commented Jun 17, 2024 • edited Loading

peanutfun commented Jun 13, 2024 •

edited

Loading

emanuel-schmid commented Jun 17, 2024 •

edited

Loading