Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which (FITS) header should be stored in Spectrum1D.meta (and where)? #617

Open
dhomeier opened this issue Mar 23, 2020 · 16 comments
Open

Which (FITS) header should be stored in Spectrum1D.meta (and where)? #617

dhomeier opened this issue Mar 23, 2020 · 16 comments

Comments

@dhomeier
Copy link
Contributor

Loaders for FITS files are saving the full information from the header cards in Spectrum1D.meta['header']. In #608 (comment) I noted some ambiguity as to which header should be stored.
Most of the default_loaders for such formats, namely apogee, hst_*, muscles, subaru_pfs and sdss spec_loader, are pulling the info from the FITS header of the first HDU, while the spectral data are read from one (or several) of the extension HDUs.
This primary HDU header will usually contain general information on the target, observing run etc., while the HDU with the spectrum will include more specific info on the dataset. In principle both could be of interest. Should there be a general recommendation for new default (or custom) loaders (and a fix to existing ones) how to handle this?

I the following options:

  1. Leave it at the primary HDU header as with the majority of current default_loaders.
    Pros: no API changes
    Cons: discarding potentially useful information; inconsistent with Astropy's io.fits behaviour for Table (see below)

  2. Always read the header for the HDU containing the data.
    This is already implicitly done by generic_spectrum_from_table and spectrum_from_column_mapping as they are loading table.meta under the hood.
    Pros: Matches the Table.meta data; possibility to use io.fits mechanism to filter relevant keywords (i.e. excluding REMOVE_KEYWORDS and all, that have already been used in defining column formats and units).
    Cons: Losing info from the primary HDU header (among the current formats with data in a BINTABLE extension, at least the Apogee, HST, MUSCLES and SDSS also have potentially interesting information in the primary HDU).
    Some formats, e.g. apogee, are parsing spectrum data from several different extensions.

  3. Combine all header info from primary HDU and any extension HDU from which data are read.
    This is already implemented in the jwst_reader by using the (FITS) header.extend method to update the primary header with the data extension header.
    Pros: No or minimal loss of information.
    Cons: Possibly excessive accumulation of values in spectrum.meta['header']; writing the spectrum back will produce a FITS file with different headers (but currently there is no guarantee that any of the loaders will write back identically formatted files anyway).

I tend to the 3rd option, using a similar implementation to jwst_reader. It might still be preferable to use the Astropy Table.meta dict for updating to take advantage of the io.fits mechanism for stripping excess cards.

Addendum
In the above I was assuming that the Spectrum1D format always has the header info collected in one dictionary within the meta dictionary as Spectrum1D.meta['header']; this is also suggested in the docs for creating a custom loader by prescribing
meta = {'header': header}
and followed by most of the default_loaders.
But in fact neither parsing_utils nor jwst_reader follow this scheme, but instead put all saved header cards directly into Spectrum1D.meta.
I therefore suggest to first settle on a consistent scheme for organising the meta dict.

@hcferguson
Copy link
Contributor

I'm a bit worried about carrying around a lot of metadata by default in a data structure that could otherwise be relatively lightweight. Many users will want to do various things with the spectrum, perhaps combining it with data from other files. When writing out the data, they can go back to their FITS files and extract the relevant metadata, but I think it's very hard to anticipate in any general way what they will want.

Perhaps, at least as an interim solution, this could be handled in the docs by adding example or two of how to copy your FITS metadata into the Spectrum1D object.

That leaves the user free to put in the header or headers or select the subset of keywords they want. And also gives them a lot of freedom on the data structure to uses within meta -- e.g. a header string just copied from the FITS header as meta={'header': header}, a dictionary of key-value pairs, or a dictionary key-(value,comment) tuples, or a hierarchical dictionary {ext: {key:(value,comment)}}.

@dhomeier
Copy link
Contributor Author

I'm a bit worried about carrying around a lot of metadata by default in a data structure that could otherwise be relatively lightweight. Many users will want to do various things with the spectrum, perhaps combining it with data from other files. When writing out the data, they can go back to their FITS files and extract the relevant metadata, but I think it's very hard to anticipate in any general way what they will want.

So your suggestion would be as a 4th option to not read any header info (by default) at all?
I would make this at least selectable with a kwarg, as I am sceptical how willing users would be to open their files separately through io.fits just to access the metadata.
If the available/reasonable/useful data are read in, say with
Spectrum1D.read(..., read_header=True)
the dictionary in Spectrum1D.meta can still be filtered, reformatted, stripped down etc.

@hcferguson
Copy link
Contributor

Ah. Good point about having to use io.fits separately to get the metadata. But even if we add an argument read_header=True, there is still the issue of "which FITS header?"

I don't think one can default to storing both the primary and extension header in the same string or flat dictionary, because there can be naming collisions between the two headers. So one either needs dictionary of strings or a nested dictionary, or something like an HDU list.

@dhomeier
Copy link
Contributor Author

Yes, the header.extend method that jwst_reader is currently using does provide some control over that through the update and unique keywords, but it does not include a convenient method to store duplicate cards under unique names.
I had been thinking about (my original) option 4, making spectrum.meta['header'] a list, but that seemed most likely to break current usage. But since the current default_loaders do not return a consistent format right now, maybe it's acceptable.

@hcferguson
Copy link
Contributor

Since we already have a dependency on io.fits to read the file -- and with the same drawback of breaking current usage -- might it make sense to have something like spectrum.meta['fits_headers'] be a list of astropy.io.fits.header.Header objects? (If read_header=True...I'd still suggest not making this the default).

@kelle
Copy link
Member

kelle commented Nov 8, 2021

I think it's totally fine (and good) to collect header/meta data. The problem I always encounter is that there is not enough meta data to fully characterize the data, rather than the other way around.

I don't have strong feelings about read_header = True | False but I do think that when a Spectrum1D object is created, it should contain a header with a minimal amount of information.

I would as much as possible like to take advantage of and mirror the functionality which exists already for headers so I think I am in favor of meta including a list of header objects.

@eteq
Copy link
Member

eteq commented Nov 8, 2021

Having gone back through this with a bit of hindsight, I think I want to vote for @dhomeier's original option 3. My reasoning is twofold:

  1. As I user I don't want to do spectrum.meta['something'][psosibly_somthing_else]['t_exp']. I want spectrum.meta['t_exp']. Anything more feels like putting effort on the user that should be instead in the reader.
  2. A Spectrum1D object is meant to represent either a single spectrum, or a vector of similar spectra, not a file. So it doesn't make sense to have the meta be the metadata of a spectrum, not of all the information that was in one file.

2 in particular leads to the impedance mismatch we're struggling with: FITS or similar files in principle can store multiple spectra in one file, but Spectrum1D by design is not so loose. So I think we necessarily have to do some funneling, and that's OK.

As a middle-ground of #617 (comment) and this option 3, another possibility (Option 3.5?) is to do option 3 and also what @hcferguson suggests in #617 (comment) - then the user can see the "originals" if they really want to, but we tell them the only thing that will be used by downstream tools is what's in meta itself.

As for read_header=False/option 4, I'm broadly against that. I think in general it's better to assume that if the user doesn't know what they're doing, we're better off just carrying as much of the information they want as possible (and using warnings or exceptions only when the information is in conflict). Moreover, I think for the user who really wants/has the time to customize their meta in this way, you tell them instead to just do something like:

spectrum = Spectrum1D.read(...)
spectrum.meta.clear()

That's only one line more, and is more explicit in that it makes it clear that the user is intentionally starting fresh. And that leaves us free to implement option 3 in read and in this case that @hcferguson raises that would be option 4, the user still gets what they want.

@kelle
Copy link
Member

kelle commented Dec 7, 2021

I'm excited to move this discussion forward and am happy with the header data (a dictionary) be in the meta. If we are all in favor, should we document this decision somewhere?

In the docs, I think we need to include info about meta in

  • Basic Spectrum Creation
  • Somewhere in the readers. E.g., does the iraf reader put the IRAF header into the Spectrum1d meta?

If there's code which needs to be written for the writers/readers, I'm happy to contribute, especially if someone pointed me in the right direction. E.g., open some issues and assign me. :)

@kelle
Copy link
Member

kelle commented Dec 21, 2021

I've done some "round trip" testing using the tabular-fits format and it doesn't handle multiple dictionaries in meta well at all. For example, if I have two dictionaries in meta['header'] and meta['header2'], only one of them gets preserved in the resulting FITS file. And when I read it back in, all the header cards are in meta and not in meta['header']. Bottom line, any dictionary sub-structure within the meta dictionary is not currently preserved and it would take some work to make it so.

As a result, I re-affirm my vote that we recommend that all header cards be stored directly in meta.
Update Nov 16, 2023. I change my vote to store the Primary header in meta['header'].

I can also update the tabular-fits writer to be less sensitive to there being a meta['header'] and instead find all the dictionaries in meta.

@hamogu
Copy link
Member

hamogu commented Dec 22, 2021

I agree with putting everthing directly in meta because that's the most usable.

I do want to point out something though to keep in mind for both users and developers: Operations that the user performs on the Spectrum1D object can make the values saved in the header wrong, e.g. I can manually update the wavelength solution or divide the flux by 5. In those cases I should probably also update the meta info (e.g. divide the EXPTIME by 5), but that's easily forgotten and, given the range of keywords used in fits headers, not possible automatically. So, there is an argument not to propagate too much information - propagate only what's right and drop what's dubious unless the user explicitly chooses to keep e.g. the EXPTIME (presumably they know way). People have come to trust fits header and I would hate it to receive a file from a collaborator that looks exactly like a HST/COS spectrum, but has been modified in some way that's not recorded in the headers. So, I think writing out fewer keywords or at least header in a different format is "a good thing".

@tepickering
Copy link
Contributor

tepickering commented Dec 23, 2021

it's always going to be possible for data and metadata to get out of sync somehow (if i had a nickel for every time i've had this happen with a FITS file...). i don't think dropping metadata on the floor is the answer, though. anything in meta should be included in an output file somehow. that's easier in some formats than others with FITS being particularly restrictive.

@kelle
Copy link
Member

kelle commented Nov 15, 2023

There seems to be consensus that the Primary Header, e.g., whatever you'd put in HDU0, "should" be in meta['header'].

There is not a consensus around what to do with extension headers and I am currently ok with that.

I propose that docs be updated to reflect the general expectation that a header be put in meta['header']. I am going to open a dedicated issue with this bytesize suggestion.

@dhomeier
Copy link
Contributor Author

As a result, I re-affirm my vote that we recommend that all header cards be stored directly in meta.

I agree with putting everthing directly in meta because that's the most usable.

I understood those two just to be the opposite, storing the cards directly as keys of meta?

@kelle
Copy link
Member

kelle commented Nov 16, 2023

Yes, you're right, I've changed my mind. I'm gonna edit my comment.

@cshanahan1
Copy link
Contributor

Since there are a lot of open issues on this topic, I went ahead and made a discussion to consolidate these discussions and so we can decide how to proceed - we have a ticket on our board this sprint to start this work. I have a proposed plan there (storing everything in 'meta' as Header objects, and keeping the primary header separate for those formats that want to preserve that structure, or combine them) but I would appreciate some input before I start this work to make sure we are in agreement. Thanks!

#1125

@simontorres
Copy link
Contributor

I think most of the information should go into the primary header, according to my understanding of multi extension files their use case is to store different planes of the same data not just a pile of that, for that case another model should be used. I've also seen multidetector instruments use MEFs, where each detector is stored as a separate extension . In either case the user must be able to control where a given metadata is going to be stored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants