Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remaking NDData to support LSST needs (APE11) #14

Closed
wants to merge 17 commits into from

Conversation

perrygreenfield
Copy link
Member

This is a first draft of an APE to support the issues raised by LSST and the subsequent email discussions regarding that

@MSeifert04
Copy link

MSeifert04 commented Jul 26, 2016

I'm not sure where the discussion should be, so I'll post it here.

I agree on many points of your APE but I don't really like the proposed implementation.

Let me summarize the current standard:

  • NDDataBase - Interface definition of "common" attributes
  • NDData - Container based on NDDataBase, the restrictions here are: data: numpy-array-like, meta: dictionary-like, uncertainty: NDUncertainty-like
  • NDDataRef - NDData with specified slicing, arithmetics and io-interface.

So the planned refactor would promote NDData (or NDArr) from a pure container to a fully functional implementation. That's not bad but worth mentioning because it does not only break backwards compatibility (possibly) but also alter the basic data structure.

Additionally the current NDData interface would be totally (or mostly) compatible with your suggestions (except for the two-layered structure):

  • bitmasks as mask: No problem there. It wouldn't work with "NDArithmeticMixin" (by default) but the current implementation would these allow with a minimum of tweaking, it may be as easy as altering one parameter ("handle_mask"). For "NDData" the mask is designed as property but there is no reason why a property shouldn't return a callable. So there is currently no problem with just using your "mask" class. As alternative there is a flags attribute in NDDataArray which could be reinterpreted as the bitmask without seriously altering the base "mask" at all.
  • offset: Slicing is implemented in "NDSlicingMixin". It's to some extend customizable and adding a "offset" wouldn't be any problem. It doesn't support all possible numpy indexing possibilities if astropy.WCS is used. But it also is highly customizable so there is actually no problem for example to include an offset instead of slicing the wcs.
  • variance: I actually have a working implementation of a Variance-uncertainty based on NDUncertainty which supports the proposed operations (+, -, /, *). So it's no problem to implement something like that within the current framework.
  • ExposureInfo: I think this would be far too specific for NDData. You assume that the NDData is used for exposures is very strict in my opinion.
  • Two-layer-class-structure: Could you clarify what would be seperated, especially when and why this would be needed or not achievable with the current implementation?

Sorry this was so negative. I think a lot of your APE makes sense, for example the bitmask, exposureinfo and variance uncertainty would be really great additions. But I somehow disagree on the proposed implementation. I think these could also be integrated without changing (much) of NDData itself. Altough I have to admit that I'm probably biased on these points because I did change a lot in NDData 😄

Just to cross-link it here: @crawfordsm already mentioned some or even all of these points in https://groups.google.com/forum/#!topic/astropy-dev/I_co2botSNw

@perrygreenfield
Copy link
Member Author

@MSeifert04, I should say that the primary goal of this effort was to see if we couldn't align NDData better with LSST needs and it was deemed as an important goal by the coordinators, so I gave it a shot. I'll try to address some of the comments, but really, it is probably better if someone representing LSST (e.g., @r-owen) responds to the justification for these since I don't really feel I should speak for them in the end.

First, the main goal is not to break backward compatibility, at least for normal uses. This may not be true for those building subclasses that depend on deeper aspects of NDData and associated classes. Are you in that category?

Some have questioned the need for the dualism of NDDataBase and NDData (is there any existing compelling use of this split yet?).

The need for NDArr:

I'm guessing the main justification LSST has for this is for an object for which mathematical operations have unambiguous results. This isn't true for your proposed solution of using a subclass since meta or other attributes do not have well accepted semantics in this case. Yes, they could use these with conventions (ignore these attributes, or not set them) but then you have mixed usages of these objects that isn't clear from the class alone. Having subclasses that disable attributes or capabilities in the parent class is generally not a good idea. The proposed NDArr class removes these sorts of problems. It seems much cleaner to me than multiple flavors of NDData. Imho, when there are many different dimensions to how a class will be subclassed, e.g. through mixins of different kinds or other means), one is probably asking for a lot of trouble and confusion down the road. Reducing the number of such variants is a good thing if it can be done flexibly.

Regarding masks:

I don't think a property that returns a boolean in some cases and a callable in others is a good solution.

Regarding offsets

Yet one more mixin. My previous concern applies

Regarding ExposureInfo:

I wasn't clear enough. I don't think such an attribute is justified. It is something that LSST asked for, but I feel that it could be handled on their part by a special class or subclass if they need it in their software with a simple function to extract the needed info from an NDData instance (or other constructors if they don't need an NDArr instance).

Again, I welcome LSST comments and any others that have a better solution to their needs than this APE currently provides.

In the end, it is up to the coordinators to decide whether to accept this or some other variant.

Cheers, Perry

@mhvk
Copy link
Contributor

mhvk commented Jul 27, 2016

I think it may be most useful to split the discussion along with the two types: a simple data array with mask and uncertainty on which mathematical operations are possible (NDArr) is extremely useful and we should support it. The larger one is something that maybe astropy could provide a good base for, but some aspects could also be added by a specific LSST subclass.

For NDArr, my guess is that much of it could be made to work from a combination of the various mixin's that we have. It would be lovely to have some specific tests written that would check whether it works correctly, so we can try.

One specific comment: I don't quite see why NDArr would not have a wcs -- it just has to be one that is limited to carry slicing information. (Or, in other words, it would seem to me that the slicing information should be kept in a wcs-like attribute -- no need to reinvent wheels.)

@perrygreenfield
Copy link
Member Author

@mhvk, I don't think WCS as we usually understand it should be part of it since having it makes operations much more fragile or presents many options, none of which are the obvious default. If two NDArr objects with different WCS are combined, what is to be done? Generally resampling is required and that is a complex operation with many issues to decide.

But it is true the slice aspect with regard to the parent is a kind of a wcs, but it isn't any sort of absolute coordinate system. It is much simpler to handle though. It is something to discuss as to whether that information should affect operations on these arrays.

@mhvk
Copy link
Contributor

mhvk commented Jul 28, 2016

@perrygreenfield - yes, I agree. It was more that I felt that even an offset in an original frame is best done via a wcs. But this may just be an implementation detail.

@r-owen
Copy link

r-owen commented Aug 3, 2016

Here is the way LSST views things and what we are trying to emulate with NDArr: offset is a pair of ints that gives the position of the lower left corner of an image or subimage with respect to the parent image. It is 0,0 for the parent image (or a subimage whose lower left corner matches the parent). Pixel position = index on image + offset and is used almost everywhere that a pixel-based position is wanted, e.g. for pixel position of detected sources and to translate pixel to sky via a WCS.

We are keen to avoid having a WCS in NDArr, because as @perrygreenfield says, it makes simple operations such as addition ambiguous (are the arrays aligned via the WCS or via pixels?). Also, NDArr is nicely self-contained; many times a WCS is not required (and may not even be known).

For uniformity and expandability, I suggest that WCS stored as part of meta, e.g. meta.wcs. Other attributes that may prove useful (but should not be required) include exposure date and time, filter name and a PSF model. Having suggested names and implementations for these may provide useful uniformity, even while recognizing that not all will be present for all images.

LSST finds that the metadata is often useful without the pixels, so a single object containing all of it (including the WCS) is a big win. Some use cases:

  • Identify reference objects that overlap an image
  • Identify images taken with a particular filter and with good seeing that overlap each other, for coaddition.

As @mhvk notes, the offset of the NDArr, and also the dimensions (which LSST combines into one object called a bounding box) is needed in order to perform many or most operations with the metadata. I recommend making bounding box a required part of the metadata, and having the slicing operator update it as required (e.g. when obtaining a metadata for a subregion). The bounding box is the only item that requires this treatment. WCS can simply be copied when taking a subregion (since pixel position is defined with respect to the parent image).

APE9.rst Outdated
NDArr will support optional units since the propagation of units is
unambiguous in mathematical operations.

Supported numerical operations for NDData are: (+,-,*,/). While the discussion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will NDData (or rather NDArr) really support these numerical operations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a typo. It was missed in editing. NDData here should be NDArr.

@crawfordsm
Copy link
Member

Thanks @perrygreenfield and @r-owen. I've made a few comments in line but I've also made a pull request against @perrygreenfield with an alternative solution.

The reason for the alternative solution is that I really don't like the idea of NDData being required to use NDArr. I think that limits the flexibility of NDData, which was one of its features, and if it isn't required to use NDArr, then it increases the ambiguity, which makes compatibility more difficult.

The bounding box/origin just seems like a very restricted case of WCS (where the only transformation is a translation), but I can see where having a shortcut like that could simplify the coding and be very useful so I could see that being included immediately into NDData regardless of this APE.

@perrygreenfield
Copy link
Member Author

@crawfordsm could you provide an example or two where an NDData implementation using NDArr reduces its flexibility. I'm not sure I see the point being made.

APE9.rst Outdated
object that mimics one). [An alternative option is to assume that the mask
attribute of NDArr is a callable whereas the mask attribute of NDData represents
a boolean array; this would be workable, but also likely confusing]
This APE proposes that the new attribute have the name: lone_ranger.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

@crawfordsm
Copy link
Member

The problem that arose previously was that if NDData has to have a specific type of uncertainty. That reduces the flexibility and it is not clear in your APE whether or not that is required as it would look that for NDArr variance is required and if NDData requires a valid NDArr that would imply that it required a variance.

@eteq
Copy link
Member

eteq commented Aug 22, 2016

Quick note: the ordering of APEs has traditionally been based on pull request number. So at some point this will need to be rebased to rename it from "APE9" to "APE11". (#12 is APE9 and #13 is APE10)

@eteq eteq changed the title Remaking NDData to support LSST needs Remaking NDData to support LSST needs (APE11) Aug 22, 2016
@jehturner
Copy link
Member

jehturner commented Sep 8, 2016

Sorry, I started replying to the prior email thread during my vacation in July, but then this APE got circulated before I finished my message and I'm just getting back to it after catching up with other things :-(.

The basic requirements here sound completely fine to me, with a few nice new features -- except that I still disagree with NDData (or a sub-class) not supporting mathematical operations. While I don't doubt there is valuable experience behind the objection to arithmetic, I'm thinking in particular of my own usage here; I would like to work with high-level objects describing exposures and all their logically-associated information while retaining the major convenience of being able to operate on them concisely, in a clear and ergonomic sequence. I really don't want bookeeping or conceptual operations that take several steps to perform in my scientific process. These are things I want to manipulate hands-on. If the default behaviour for arithmetic doesn't make sense, it can be altered in a similar way to your dmask.apply_flags() example or the attribute to disable variance propagation. Otherwise I am just going to end up extending it to work anyway (in fact I will have to if I re-base DataFile on it).

I still haven't thought of a great alternative name, but NDArr seems like it could easily be confused with NDArray. What do people think of something like NDPixData? Then again, people might get used to either.

I'm not clear on the memory implications of all these (strong?) references to parent objects, though this is something LSST must have considered carefully. Obviously in the ideal case, if one were to slice a data array, drop the variance/mask somehow and delete the original reference(s) to the parent, then only the main subarray would survive (might be awkward to implement without shooting oneself in the foot?).

By the way, Gemini has also been (with a bit of encouragement) re-basing a version of the AstroData class used in our pipeline on NDData.

I would favour retaining backwards compatibility insofar as it doesn't distort the design unduly.

I'll make a couple more comments in-line then have a better look over the discussion...

Thanks (& sorry for being a bit flaky),

James.

@jehturner
Copy link
Member

jehturner commented Sep 8, 2016

BTW: +1 to bit planes (I think we could probably deprecate flags?) and variance (insofar as it's the most convenient thing for me and Gemini, other people's use cases notwithstanding).

attribute. Making attributes such as these at the top level could be done
through subclasses, though it may lead to many variants. Perhaps the best way
to deal with this is for applications or libraries to state their requirements
for items required to be in meta and leave it at that.
Copy link
Member

@jehturner jehturner Sep 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought it will be useful to have more functionality than just a meta dictionary and WCS. While it's not at all practical to standardize everything comprehensively, some sort of fairly general meta-data abstraction (like Gemini has in our mostly-internal AstroData class) would help write data reduction routines that can be used with data from different instruments etc. (or at least avoid hard-wiring more conventions than are necessary). I thought we also came up with some useful attributes to parse into objects besides WCS at the meeting but will have to dig out the photo. One thing I wonder about, off hand, is tracking associations between NDData objects besides parent-subarray (eg. calibrations that go with a science exposure). By the way, is the implied alternative to an ExposureInfo object here just to keep meta in NDData?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jehturner The devil's in the details. If there is a broad consensus for extra attributes and their meaning, fine. But one could quickly get into a VO-like morass before long. I'd suggest putting that off and settling the other issues first.

@mhvk
Copy link
Contributor

mhvk commented Sep 21, 2017

@parejkoj - my specific suggestion would be to include a bit plane "USER", which is the one that gets affected by setting mask. But really this is jumping far ahead - I would definitely start by just allowing mask to be calculable, and not allowing it to be set.

@jehturner
Copy link
Member

Deriving a binary mask from bitplanes sounds fine to me. I just need the bitplanes...

I should be able to switch to NDArr (or whatever it's called), as and when I find time, as long as the functionality is there. There is already a bit of a proliferation of classes, which might be confusing and offputting for new users.

@jehturner
Copy link
Member

So it seems like there may be some consensus on NDArr, but I'm not clear on where meta is now kept, if NDDataArray and NDDataRef are deprecated? Does NDData have a .arr and we then sub-class that to implement all the functionality that NDDataArray had (or not)?

@mwcraig
Copy link
Member

mwcraig commented Sep 22, 2017

Good point; I'd been so focused on NDArr I forgot about that. A few thoughts:

  • NDDataArray/NDDataRef present almost the same public interface as before, but the internal wiring is a little different.
  • The numerical bits of NDDataArray/NDDataRef are handled by NDArr rather than mixins.
  • The tricky issue then becomes one of slicing and WCS, but that can be resolved in a sensible way I think (and maybe also in a backwards-compatible way).

I'll try to formalize all of the recent suggestions here into an update of the APE over the weekend.

@jehturner
Copy link
Member

That sounds reasonable. While NDArr might be the broadest standard, I think there's also collaborative value in code being able to accept an NDDataArray / NDDataRef-like object as input -- which is more awkward when there are two possible parents that do similar things. Could we therefore get rid of one of them and keep the other (especially if we're getting rid of mixins and are already resigned to some disruption anyway)?

The most obvious justification for having both classes would seem to be compatibility with old code, but I wonder whether one could be sub-classed from the other, so that it doesn't necessarily break but there is really only one implementation and isinstance(x, 'whichever_is_the_parent') will work for both (as well as duck typing, where that's better).

@parejkoj
Copy link

@mwcraig Just curious what the status of the updated APE is? I think we've had enough LSST-astropy agreement now that there shouldn't be many issues from our end, but we'll all probably have to read through the final document again to remind ourselves of everything that we've discussed over the past year+.

@mwcraig
Copy link
Member

mwcraig commented Oct 11, 2017

@parejkoj — I’ve been about 2 days from finishing the APE update for almost 2 weeks. Just wrapped up something that had been consuming a lot of time, so hopeful I’ll have the update ready by the weekend. Then one more broad call for comments, and assuming no major objections, the APE gets merged and the coding starts to implement it!

@mwcraig
Copy link
Member

mwcraig commented Dec 6, 2017

An update to the APE language is pending (perrygreenfield#3) but in the meantime I'll outline the revisions in the comments here.

API additions

  • The attribute origin is added to NDDataBase and its subclasses.
  • These classes are added to the nddata package: NDArr,
    VarianceUncertainty, and a bit plane class BitPlane.
  • A bit_plane property is added to NDDataArray and CCDData.

API changes

  • Only one of mask and bit_plane can be set on those object that have a bit_plane property.
  • If the bit_plane property is set, then mask is calculated from bit_plane (the user must have indicated in some way which bit planes to use).

API not changing

Though it is unusual to call out things that are not changing, it seems appropriate to do so for mask given the extensive discussion of the topic.

  • The meaning of mask will be the same as in current ndddata: a binary mask that follows the numpy masked array convention of True means ignore , False means use.
  • mask is still settable as in the past, though it will raise an exception if one sets mask and then tries to set bit_plane.

API deprecations in astropy 3 for removal in astropy 4

  • The FlagCollection class and the flags attribute of NDDataArray
    and CCDData. This functionality will be replaced by an implementation
    of bit planes.
  • The __array__ and __array_prepare__ methods of NDDataArray
    and CCDData, which allow them to be used as numpy arrays in expressions.

@mwcraig
Copy link
Member

mwcraig commented Dec 6, 2017

The diagram below summarizes the new class structure; most of the changes are in NDDataArray and its subclasses. That is an effort, in part, to ensure backwards compatibility of NDData (the container class) as much as possible.

I do realize the bit plane class is not on here yet...

EDIT: strikeout indicates deprecated attributes/properties/methods. Mint green indicates a change (addition or deletion).

new-nddata

@mwcraig
Copy link
Member

mwcraig commented Dec 15, 2017

@parejkoj -- this is ready for more feedback. I know there still needs to be some work on the details of the bit plane object, but I agree with @TallJimbo's advice above that users should be presented with a way to specify which planes are included in calculating the mask rather than access to individual planes.

Not sure I'll have time to implement anything more than a very basic bit plane before feature freeze for 3.0 next Friday, but my intent is to get as much of the stuff in this APE in there as I can....

@mwcraig
Copy link
Member

mwcraig commented Dec 20, 2017

@jehturner @crawfordsm @MSeifert04 -- comments welcome!

@jehturner
Copy link
Member

@mwcraig: Many thanks for the summary and sorry I can't really do this justice before the code freeze (I only became conscious of the latter recently and just couldn't find much time).

I think this sounds quite sensible overall, at least at a glance. Does this mean that the name NDArr is about to be baked in? I'm still a bit wary of that, but it can't hold things up indefinitely. I'm OK with deprecating flags, even though it means I'll have to rewrite my code (which I have to finish making Python 3 compatible anyway). Any special reason for deprecating __array__ and __array_prepare__? I think those seem useful off hand, albeit not in the spirit of LSST's usage.

Will NDArr contain (or behave like) a MaskedArray? I'd think that could help with compatibility, eg. where existing functions elsewhere know what to do with one (such as LinearLSQFitter as of today).

The way you have divided things up for backwards compatibility seems fine, but I'm again thinking that. in principle, there might be some code that could be compatible with either NDDataArray or NDDataRef (or similar classes, but not NDDataBase) and it's a bit messy asking, for example, "is x an NDData-like thing with arithmetic?", without a common ancestor. But maybe we just want to move away from supporting NDDataRef.

OK, have to go now. Hope that's slightly helpful (& thanks for the work and happy Christmas!).

@pllim
Copy link
Member

pllim commented May 31, 2023

It has been 6 years since the last reply, should we just say this won't go in at this point, or are the interested parties going to revisit this?

@parejkoj
Copy link

Thanks for the reminder. I think we're still interested in looking at differences in how we handle mask bit fields: we've already adopted CCDData for our alert packets, and submitted a PR to include a PSF cutout to support that. I'll see if we can find some time to discuss this more on the LSST side.

@pllim
Copy link
Member

pllim commented Apr 15, 2024

Thank you for your response, @parejkoj !

It has been almost another year since and CoCo reached out to VRO again. It appears while there is definitely interest, unfortunately there is no available bandwidth to pursue this in the near future. Therefore, CoCo has decided to reject this APE for now. However, I will open a follow-up issue (#100) to revisit should both interest and bandwidth return in the future.

Thank you all again and sorry this did not work out this time around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.