Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/fraction is none #541

Merged
merged 26 commits into from
Nov 4, 2022
Merged

Feature/fraction is none #541

merged 26 commits into from
Nov 4, 2022

Conversation

chahank
Copy link
Member

@chahank chahank commented Sep 30, 2022

Changes proposed in this PR:

  • The hazard.fraction attribute is by default empty. For the impact calculation, fraction is ignored if empty (this might in some very edge cases lead to inaccurate impacts)
  • The hazard.fraction should however be of the same shape as hazard.intensity, even when empty. This prevents code using automatic slicing of all sparse.csr_matrix attributes to fail, while still providing accurate results. For instance, hazard.select or hazard.check.
  • Ignoring fraction when it is anyway not needed allows to divide by a factor 2 the size of files, and speeds-up the impact computations.
  • Not all tests have been updated yet, waiting for confirmation that this is a wanted change. Thus, some unit tests might fail currently.
  • The docstrings and the documentation have not yet been updated.
  • If this PR is eventually accepted, the data on the API should be adapted accordingly in order to significantly reduce the dataset sizes.

The changes are also implemented in the climada_petals hazard modules on the branch https://github.com/CLIMADA-project/climada_petals/tree/feature/fraction_is_none

PR Author Checklist

PR Reviewer Checklist

Chahan Kropf added 3 commits September 26, 2022 18:51
When fraction is empty, it is ignored in the impact calculation. The shape must still be n_events x n_exp
@emanuel-schmid
Copy link
Collaborator

Do I get this right: after the PR, a hazard with a fraction of [0] would yield the exact same impact as one with a fraction of [1]?
In other words we would introduce a discontinuity?
Hm... Somehow this doesn't feel quite right. Couldn't we make the whole fraction None instead?

@chahank
Copy link
Member Author

chahank commented Oct 4, 2022

Do I get this right: after the PR, a hazard with a fraction of [0] would yield the exact same impact as one with a fraction of [1]?
In other words we would introduce a discontinuity?
Hm... Somehow this doesn't feel quite right. Couldn't we make the whole fraction None instead?

It would not really be a discontinuity, but a convention. A Hazard with fraction zero everywhere does not make much sense, as it would mean that basically, it is affecting 0% of the area...

But yes, one could also make it None. This would however require a lot more rewriting of code. This was just a minimalistic proposal.

@emanuel-schmid
Copy link
Collaborator

A Hazard with fraction zero everywhere does not make much sense,

Not from a user's perspective, right, but the maths make perfect sense. Yet.

This would however require a lot more rewriting of code. This was just a minimalistic proposal.

True. But the minimalistic approach is also a bit hacky in this case, right?

Apart from the introduced mathematical discontinuity, the fact that the proposed convention is not completely intuitive (I cannot possibly guess that all zero means all one in a sparse matrix I must know.) bothers me and - more than that - the mashing of control elements within the raw data.

I do expect future development in the impact calculation and the hazard implementation. This change will not unlikely cause headaches then. Even though a cleaner approach means more work it bears the hope that it may eventually pay off.

@emanuel-schmid
Copy link
Collaborator

For reducing the file size, I'd rather suggest to change the hazard's read and write methods and leave the impact calculation as it is.

@emanuel-schmid
Copy link
Collaborator

Or perhaps try to change Hazard.get_fraction() to achieve the goal... ?

@chahank
Copy link
Member Author

chahank commented Oct 5, 2022

For reducing the file size, I'd rather suggest to change the hazard's read and write methods and leave the impact calculation as it is.

I must say I do not understand what this means...

@chahank
Copy link
Member Author

chahank commented Oct 5, 2022

Or perhaps try to change Hazard.get_fraction() to achieve the goal... ?

Also here I do not understand what is suggested...

@chahank
Copy link
Member Author

chahank commented Oct 5, 2022

A Hazard with fraction zero everywhere does not make much sense,

Not from a user's perspective, right, but the maths make perfect sense. Yet.

This would however require a lot more rewriting of code. This was just a minimalistic proposal.

True. But the minimalistic approach is also a bit hacky in this case, right?

Apart from the introduced mathematical discontinuity, the fact that the proposed convention is not completely intuitive (I cannot possibly guess that all zero means all one in a sparse matrix I must know.) bothers me and - more than that - the mashing of control elements within the raw data.

I do expect future development in the impact calculation and the hazard implementation. This change will not unlikely cause headaches then. Even though a cleaner approach means more work it bears the hope that it may eventually pay off.

It might indeed. It will however also lead to bugs on the way. But I can try to change it to None, although it will take quite some time and some review ;).

@emanuel-schmid
Copy link
Collaborator

File size & read write methods:
What I wanted to say is that for reducing file size alone, the impact calculation should not be changed. The Hazard <-> File interactions should not influence the interactions between a Hazard object and any other object.
It's maybe tedious but certainly possible to change write_raster, write_hdf5, read_raster and read_hdf5 while keeping all Hazard objects as they are and only change their respective files.

Hazard.get_fraction()
That is an alternative I was thinking about: instead of treating 0 as 1 during impact calculation, one could transform undefined into 1 within get_fraction, by creating a sparse matrix on the fly that contains 1s for all intensity values where fraction is undefined. And even None in case no fraction is defined at all. (If it's None - skip the multiplication step.)

@chahank
Copy link
Member Author

chahank commented Oct 6, 2022

Thanks for claryfing!!

File size & read write methods: What I wanted to say is that for reducing file size alone, the impact calculation should not be changed. The Hazard <-> File interactions should not influence the interactions between a Hazard object and any other object. It's maybe tedious but certainly possible to change write_raster, write_hdf5, read_raster and read_hdf5 while keeping all Hazard objects as they are and only change their respective files.

The point is not only to reduce file size, but to remove redundant information from an object that does not need it. In addition, it is not only file size, but also very much RAM usage that should be reduced, as well as computation time reduced. Thus, it should be handled at the source, and not hacked into the file read/write.

Hazard.get_fraction() That is an alternative I was thinking about: instead of treating 0 as 1 during impact calculation, one could transform undefined into 1 within get_fraction, by creating a sparse matrix on the fly that contains 1s for all intensity values where fraction is undefined. And even None in case no fraction is defined at all. (If it's None - skip the multiplication step.)

This would not solve the RAM nor the computation time issues. Besides, it would make the object rather inconsistent, since the attribute fraction would be different from what is obtained from get_fraction. The fraction cannot have undefined values as it is a scipy.sparse.csr_matrix object.

On the other hand, if fraction is None, it can be skipped in the multiplication step and it does resolve the memory issues. The idea with using an empty scipy matrix instead was just that it is much much easier to implement. It does not treat '0' as '1'. It would just a convention for the None identifier. It makes no sense to define a Hazard object with fraction being zero everywhere. Thus, having fraction 0 as a surrogate for None is not a big problem in principle. But, if you prefer, we can go for the None implementation. It will just require a lot of if None checks in the read/write methods, the plotting methods, the checking methods, the concatenation/splitting methods. I do not mind, as I said, the idea with using fraction 0 instead makes things much easier, but is not a must.

Comment on lines 417 to 420
if self.hazard.fraction.nonzero()[0].size == 0:
return mdr.multiply(exp_values_csr)

fract = self.hazard.get_fraction(cent_idx)
Copy link
Collaborator

@emanuel-schmid emanuel-schmid Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.hazard.fraction.nonzero()[0].size == 0:
return mdr.multiply(exp_values_csr)
fract = self.hazard.get_fraction(cent_idx)
fract = self.hazard.get_fraction(cent_idx)
if fract is None:
return mdr.multiply(exp_values_csr)

This plus adding this line at the top of get_fraction:

        if self.fraction.nnz == 0: return None

I'd still prefer to leave as much responsibility as possible to the hazard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that works well!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just that the   hazard.get_fraction is not the standard getter for the attribute hazard.fraction. It is to get the fraction for a subset of centroids. Maybe this should be slightly updated too, so that get_fraction() returns simply fraction (or None)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Good idea. Should be fairly easy too, since the fraction is hardly ever read. I've been looking for occurrences of fraction-readings throughout the code base and apart from this line here I found them only in tests and in hazards for rewriting or plotting.

@emanuel-schmid
Copy link
Collaborator

emanuel-schmid commented Oct 6, 2022

... since the attribute fraction would be different from what is obtained from get_fraction.

But that's fine, even to be expected in OO programming, no? See suggested change in impact_calc, line 417ff .

@peanutfun
Copy link
Member

peanutfun commented Oct 13, 2022

If I could add my bits to this discussion...

Hazard.get_fraction()

I very much like the idea by @emanuel-schmid of a "get" method for the fraction, because this can handle the fraction logic internally without exposing it to the code using the fraction values. However, a "get" method is a very "C++" thing. Python can actually make a getter/setter function look like a regular property with the @property decorator.

Here's how that works:

class Hazard:
    _fraction = None  # NOTE: Now marked private

    @property
    def fraction(self):
        """Return the fraction or construct one from the intensity if it is not set"""
        if _fraction is None:
            row_idx, col_idx, _ = scipy.sparse.find(self.intensity)
            return scipy.sparse.csr_matrix((np.ones(len(row_idx)), (row_idx, col_idx)), shape=self.intensity.shape)
        return self._fraction


# Logic is completely hidden:
hazard = Hazard(...)
fraction = hazard.fraction  # NOTE: `self.fraction` can also be called from within the Hazard object

Edit: For reference: https://www.freecodecamp.org/news/python-property-decorator/

@peanutfun
Copy link
Member

I just discussed my proposal with @chahank and we concluded that it does not solve the issue of reducing RAM usage and memory required to store the hazard object. The solution provided here should be as non-intrusive as possible. We will later update the Hazard.__init__ in the same way as Impact.__init__, such that all attributes have to be set during initialization. This gives the opportunity to build a consistent Hazard object when the intensity is not set.

@chahank
Copy link
Member Author

chahank commented Oct 13, 2022

In 9497976 I set the default for get_fraction to return the whole fraction. Furthermore, in b6021fb I set get_fraction to return None if the fraction is empty, and the impact_calc to then ignore fraction in the computation of the impact.

@chahank
Copy link
Member Author

chahank commented Oct 24, 2022

All tests are passing. Should we proceed with this? I.e., accept the changes, update the docstrings and the tutorials, check climada petals.

@chahank chahank mentioned this pull request Oct 28, 2022
11 tasks
Copy link
Member

@peanutfun peanutfun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I agree that this is the least intrusive solution, as only very few lines of production code change. Well done! I found two small issues, see the comments.

climada/hazard/base.py Outdated Show resolved Hide resolved
climada/hazard/test/test_base.py Outdated Show resolved Hide resolved
@chahank
Copy link
Member Author

chahank commented Oct 28, 2022

Thanks! Agree with both comments and implemented the changes.

@chahank
Copy link
Member Author

chahank commented Oct 31, 2022

I think this PR is mostly ready now, with all tests updated and the tutorials updated too.

@emanuel-schmid : what is missing is a clear handling of the test files. I created a new test file based on the previous one. The changes are that the test file is in .hdf5 format instead of .mat, and the fraction is None. However, the handling of the loading of the test datafiles was not homogeneous through the unit test files. I would like to harmonize this before merging. What is the best practice here?

@peanutfun
Copy link
Member

Just pitching in: I think it's safest to load test files in test fixtures. This can be done by adding a setUp (and tearDown) method to a UnitTest class. See my changes to climada/hazard/test/test_base.py for an example

@chahank
Copy link
Member Author

chahank commented Oct 31, 2022

Thanks! This would be one way, but I think requires quite a lot of rewriting. Let me make my question more precise. Should the test data be loaded:

  • from the API?
  • from the filename defined in the constants?

@chahank
Copy link
Member Author

chahank commented Oct 31, 2022

Just pitching in: I think it's safest to load test files in test fixtures. This can be done by adding a setUp (and tearDown) method to a UnitTest class. See my changes to climada/hazard/test/test_base.py for an example

Maybe it is best to first close this PR, and then make the change to the Hazard __init__, or vice-versa?

@peanutfun
Copy link
Member

Maybe it is best to first close this PR, and then make the change to the Hazard init, or vice-versa?

Why close? Depending on which PR will be merged first, the other one has to be adapted. I think this one can go first, but the other way round also would not be an issue.

@chahank
Copy link
Member Author

chahank commented Nov 3, 2022

@emanuel-schmid : can you please have a final look, in particular for the test file handling?

@emanuel-schmid
Copy link
Collaborator

👍 checking...

@emanuel-schmid
Copy link
Collaborator

🎉

@emanuel-schmid emanuel-schmid merged commit eaefbef into develop Nov 4, 2022
@emanuel-schmid emanuel-schmid deleted the feature/fraction_is_none branch November 4, 2022 13:31
@peanutfun peanutfun mentioned this pull request Jan 13, 2023
11 tasks
simonameiler pushed a commit that referenced this pull request Aug 24, 2023
* Assume fraction is one if None

* Allow fraction to be empty 

When fraction is empty, it is ignored in the impact calculation. The shape must still be n_events x n_exp

* Correct shape to argument

* Remove unecessary else

* Return None in get_fraction

* Set default return full fraction

* Update test TC for fraction None

* Remove test nozero fraction elements

* Make get_fraction private to _get_fraction

* Correct typo get_fraction -> _get_fraction in test

* Add note that fraction is optional in hazard tutorial

* Add note that probabilistic does not mean no storyline

* Fix equation rendering

* Fix linter issues: Single statement per line

Co-authored-by: Lukas Riedel <[email protected]>

* Set test to assert_array_equal for get_fraction

* Update hazard test file to hdf5 without fraction

* Update climada/engine/impact_calc.py

Note, the pylint message is then disabled for the whole function.

Co-authored-by: Lukas Riedel <[email protected]>

* Make pylint ignore warning for one line only

* Update loading of tc hazard file

* Update loading of test hazard file

* Fix test file path

* test file handling consolidation

* hazard.base: get_fraction has become private

Co-authored-by: Chahan Kropf <[email protected]>
Co-authored-by: Lukas Riedel <[email protected]>
Co-authored-by: emanuel-schmid <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants