-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pandas uncertainty array #184
Conversation
ideally would need to revive this pandas PR the backport of the NDArrayMixin is a bit outdated causing some tests to fail in new pandas versions |
@andrewgsavage Thanks - this looks interesting. I admit to not knowing much about Pandas development, but I might suggest that the goal here could be to target Pandas 2+ only (or even 2.1+), or perhaps 1.5+. Would that change what needed to be backported? Perhaps can you give some idea of what versions of Pandas are targeted and/or supported here? |
I was using the same approach as python-db-dtypes-pandas, which uses the backported NDArrayMixin and NDArrayBackedExtensionArray to reduce the code needed for the ExtensionArray. The version I'd copied in the backport file is from a few pandas versions ago. The NDArrayMixin and NDArrayBackedExtensionArray are part of pandas' private namespace pandas.core.arrays._mixins. I swapped to using that, but it is liable to change or move location in future pandas versions. I'd only expect to support the current pandas version. |
@andrewgsavage Thanks -- I guess that I would a bit hesitant about expecting to add I wonder if there might be a simpler way. That is, if I have to admit, I do not have any experience trying to subclass Does that seem reasonable? |
pandas would be an optional dependency, or the code in this PR can be moved to a seperate package if that's preferred. We went with the second option for pint-pandas.. If you go with a seperate package you can make me a maintainer for it as I'm following pandas' develop for pint-pandas anyway. A UArray with For interoperating with pandas, you'd still need code for a ExtensionArray similar to what's in this PR. Here the UncertaintyArray is stroing the ufloats in a object dtype numpy array, which could be changed to a UArray in the future. That is to say this PR does not preclude making a UArray. FYI, pandas and xarray have both created apis to allow other modules like pint-pandas, pint-xarray to be created. These sit outside of pandas or xarray, so the core module maintainers have less to deal with. I've mostly put this together because @MichaelTiemannOSC has been using uncertainties with pint-pandas and wanting to add code to support uncertainties to it, which I thought was better suited to living in an uncertainties module, and would benefit more people if it allowed people to use uncertainties with pandas without needing the unit support from pint-pandas. I do need to run this through linting and such before it's properly reviewed! If we go with a seperate module for uncertainties-pandas, it might be easier to set up the CI for that there since this is using pytest but uncertainties is not at the moment. |
@andrewgsavage OK, thanks. Even as an "optional dependency" (is it fair to ask "Well, is it optional or is it a dependency"?), having that code here implies that the authors/maintainers of uncertainties will assume the responsibility for understanding and maintaining this code. I would guess that it might be more in keeping with the basic goals of this package to support an object for "ndarray of values with uncertainties" by subclassing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's exciting to see this work becoming aligned with mainstream pandas (and other developments). My changes were just enough to make uncertainties work as wrappers within the NumPy / Pandas worlds. This is clearly next-level.
"sum", | ||
"max", | ||
"min", | ||
"mean", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mean
is a non-trivail calculation in the world of uncertainties. I don't think uncertainties
does it. I've created code that does it:
# https://stackoverflow.com/a/74137209/1291237
def umean(unquantified_data):
"""
Assuming Gaussian statistics, uncertainties stem from Gaussian parent distributions. In such a case,
it is standard to weight the measurements (nominal values) by the inverse variance.
Following the pattern of np.mean, this function is really nan_mean, meaning it calculates based on non-NaN values.
If there are no such, it returns np.nan, just like np.mean does with an empty array.
This function uses error propagation on the to get an uncertainty of the weighted average.
:param: A set of uncertainty values
:return: The weighted mean of the values, with a freshly calculated error term
"""
arr = np.array(
[v if isinstance(v, ITR.UFloat) else ITR.ufloat(v, 0) for v in unquantified_data if not ITR.isnan(v)]
)
N = len(arr)
if N == 0:
return np.nan
if N == 1:
return arr[0]
nominals = ITR.nominal_values(arr)
if any(ITR.std_devs(arr) == 0):
# We cannot mix and match "perfect" measurements with uncertainties
# Instead compute the mean and return the "standard error" as the uncertainty
# e.g. ITR.umean([100, 200]) = 150 +/- 50
w_mean = sum(nominals) / N
w_std = np.std(nominals) / np.sqrt(N - 1)
else:
# Compute the "uncertainty of the weighted mean", which apparently
# means ignoring whether or not there are large uncertainties
# that should be created by elements that disagree
# e.g. ITR.umean([100+/-1, 200+/-1]) = 150.0+/-0.7 (!)
w_sigma = 1 / sum([1 / (v.s**2) for v in arr])
w_mean = sum([v.n / (v.s**2) for v in arr]) * w_sigma
w_std = w_sigma * np.sqrt(sum([1 / (v.s**2) for v in arr]))
result = ITR.ufloat(w_mean, w_std)
return result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the sort of thing that __array_function__
would help with, so I could do np.mean(UArray)
without needing to understand the uncertainty logic
"max", | ||
"min", | ||
"mean", | ||
# "prod", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason we cannot reduce with prod.
|
||
def _validate_scalar(self, value): | ||
""" | ||
Validate and convert a scalar value to datetime64[ns] for storage in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should not reference datetime64[ns]
Just looked at this and realised I'd need to add pandas to the testing matrix, which would slow tests down somewhat. |
I would agree, as it would more than double the normal test time (double, because all the tests would have to run with or without uncertainties, and more than double because uncertain magnitudes are slower than float64 magnitudes). |
The tests are very fast right now, so test time is not too bad (though I haven't looked at how much time the new tests take). I would say to weigh how much the code would be coupled to uncertainties (sorry, I haven't looked at it closely yet). Will it need coordinated releases with uncertainties to add new features, or easily remain independent? How convenient would it be to have everything in one package for a user? Already there is numpy support included in an optional way. |
It would probably be best to test with and without pandas. I don't think testing runtime is currently much of a concern. But, as elsewhere, this does not seem as high a priority as getting a release with cleaned-up tests and code base. Can this wait? |
I've made this a standalone module and released it to pypi so that it's available for people to use now |
@andrewgsavage Nice! What is your feeling on whether this should be kept as a separate package vs folded into uncertainties as "available if pandas can be imported"? As a "not very often" pandas user, I don't have a strong preference and I would defer to you. |
Either option work fine. I've a slight preference for a seperate package as it'll make changing things easier while the package isn't as mature as the rest of uncertainties |
@MichaelTiemannOSC here's where I got to with creating a pandas extensionarray.
pandas backports taken from
https://github.com/googleapis/python-db-dtypes-pandas/blob/main/db_dtypes/pandas_backports.py