-
-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix default_loaders tabular-fits format and automatic recognition #573
Conversation
@@ -22,7 +22,7 @@ | |||
# need to add `format="my-format"` in the `Spectrum1D.read` call. | |||
def identify_generic_fits(origin, *args, **kwargs): | |||
return (isinstance(args[0], str) and | |||
fits.connect.is_fits(origin, *args) and | |||
fits.connect.is_fits(origin, origin, *args) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the astropy is_fits
function, it seems that the first argument isn't even considered, and the second argument just checks for extension matches. Might it be better to pass an open fits object as the third parameter to check for the fits signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it actually expects a file object (and unfortunately does not work with urlopen
due to the seek
requirements) - thought that the identify_*
functions were already called with the open file, but of course it is explicitly testing for a name string above...
These are all loaders for formats without tests or sample files for the format; now that I've been looking into the other loaders it always seems to be preferable to use the with fits.open()
context, which gives you the signature check for free plus the hdulist for further investigation of the content. Will try to fix the remaining ones as a use case nonetheless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched the other loaders, which access the header for further investigation, to with fits.open
; that leaves subaru_pfs_spec
as only case directly calling is_fits
, and that is tested now.
|
||
meta = {'header': header} | ||
uncertainty = StdDevUncertainty(tab["ERROR"]) | ||
data = tab["FLUX"] | ||
wavelength = tab["WAVELENGTH"] | ||
data = Quantity(tab["FLUX"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this end up unitless? Should it be looking for some unit defined in the header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit is already parsed with the Table.read()
. I've added a check for this to the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I always assumed one would have to use QTable.read
to get columns as quantity objects. Good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact the regular Table Column
supports basic quantity features, but is still subtly different Therefore the cast to Quantity
is still needed - alternatively one could indeed get it directly with QTable.read
.
@@ -69,6 +68,10 @@ def tabular_fits_loader(file_name, column_mapping=None, hdu=1, **kwargs): | |||
|
|||
tab = Table.read(file_name, format='fits', hdu=hdu) | |||
|
|||
# Minimal consistency check for wcs consistency with table data | |||
if wcs.naxis != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not always guaranteed that the number of world coordinate axes in the fits file will be one (e.g. a data cube, or data with a time dimension). Perhaps it'd be better to check explicitly for a spectral axis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure to what extent tabular-fits
does, or is supposed to, support >1D spectra (thought that generic_cube
was for that case, but again I don't have any sample files to test). But if any, I would rather expect it to support the simplest case of a spectral cube or time series with a common (single 1D) spectral axis. Other cases, where each flux column could have its own zero point or completely individual dispersion scale, or possible intermediate cases where some subset of the spectra shares a spectra axis, are IMO really better left to generic_cube
; at least I have no idea how to correctly identify such cases where
-
the flux data have >1 dimension
-
the spectra in the flux cube do not have identical spectral axes and
-
these wavelength/frequency scales are not included in the
BINTABLE
, but must be constructed from a WCS defined in the header.
Perhaps @teuben who wrote the initial version of generic_cube
has a better idea about what possible data formats could be encountered.
So I would tend to just check for
if not (wcs.naxis == 1 and wcs.array_shape[0] == len(tab):
which could be extended to
if not (wcs.naxis == 1 and wcs.array_shape[0] == len(tab) or
wcs.array_shape == tab[fluxname].shape):
just to allow for the case with all individual spectral axes, but note that then the correct column for the flux would have to be identified from either column_mapping
or analogously to _find_spectral_column
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have included the 2nd option, but simply assumed that the flux data would be in the first table column - should of course also work it there is a spectral_axis
, but in that case it's unclear why bother with the WCS at all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This is tricky. Ideally, the wcs should tell us the column axis type (e.g. wcs.world_axis_names
, wcs.world_axis_types
, etc). But obviously this requires that the fits standard be well followed in all data files (😆). Maybe @eteq has some insight.
I'm tempted to just say "yeah, you're right -- people should just using generic_cube
", but if that's the case, we should raise an appropriate error when wcs.naxis !=
to point users to generic_cube
for those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is also I cannot even tell if generic_cube
works, as I don't really know how a valid file should look like.
@@ -28,33 +28,52 @@ | |||
|
|||
def spec_identify(origin, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be worth adding to the /io/default_loaders/parsing_utils.py
file. If _spec_pattern
is unique between loader definitions, maybe adding in another parameter so that it can passed to the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To provide a more generic wrapper for is_fits
, with _spec_pattern
to be passed by the loader? That might be useful, although I would in general recommend to try to identify the spectrum types on file content rather than naming schemes. But it's useful to have various file type and path checks provided in one function, agreed.
Note that is_fits
called with fileobj=None
will identify any valid name pattern as a FITS file, whether filepath
actually points to an existing file or not. For _fits_identify_by_name
I have made it a requirement that the file exists, since the loaders will eventually access it anyway.
833c099
to
eecd00d
Compare
@nmearl I have rebased to pull in the updates to the JWST reader from #579, and while at that added a more readable MUSCLES label. Wondering if the |
f142303
to
24f61da
Compare
@dhomeier I don't want to hold up this PR anymore, but I was hoping to get some comments from Erik concerning the checking for the spectral axis. Would you mind opening a PR capturing the conversation about spectral axis checking and what how explicit we should be knowing that we have the |
On the question of checking for spectral axis: I think I like the implementation as it stands here, at least as a solid iteration. I also interpreted
Indicates, I think there's not really a foolproof way to do this. The one case that might be useful to test is a fits file where the flux column is a multi-dimensional array - I've never seen that in the wild, but that doesn't mean it doesn't exist, and it seems natural (and trivial to implement) that that would load into We might consider an additional reader down the line that can handle "multiple spectra that aren't cubes", but I'd say that's a follow-on PR and doesn't need to hold this up. So @dhomeier, if you want to add a quick test of the above case, please do so, but if we'd rather think of this as a separate follow-on as well, I'm fine with merging as this stands. |
(Also, as I side note, some info about the guiding principles separating these loaders should be in the docs, but #394 is really the main issue for that) |
Oh, I have actually produced a couple of those, which might still be out somewhere - the column is not flux but
Since it's not as quick to produce a simple example case and requires a number of changes to When looking at the I've also tested a number of improvements for the auto-detection of |
Can you expand on this a bit? How are you accessing the data within the table? Specutils follows the numpy row-major formalism; it may be that this is just a quirk of data tables being preferentially column-major in general. Otherwise, it sounds like this PR is good to go with the caveat of opening (as I see it) two new issues to be addressed in follow-up PRs:
|
I think it's the consequence of
which is loaded as a
and
which I would actually consider the more natural representation – |
I guess that would require
|
I do not think we want to implement this handling in the intializer. The supported inputs are specifically
I'm not sure I'm seeing what the issue is here? If the flux array is multi-dimensional, the slicing is intended to occur on the elements in the flux array, returning a new |
OK, then it can be handled in
Slicing would straightforwardly extract a desired wavelength range (assuming you have found the corresponding indices). For the 2d spectrum I don't see how to get a comparable functionality, even if you checked for the different
|
I've now implemented this as described above, trying to write it as general as possible, but really only checked and added tests for |
wlu = {'wavelength': u.AA, 'frequency': u.GHz, 'energy': u.eV, | ||
'wavenumber': u.cm**-1} | ||
# Create a small data set with 2D flux + uncertainty | ||
disp = np.arange(1, 1.1, 0.01)*wlu[spectral_axis] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we're going to want to ensure that any spectral axis in frequency space is strictly descending. I'm not sure we need to worry about that case right now, but something to think about.
disp = np.arange(1, 1.1, 0.01)*wlu[spectral_axis] | |
disp = np.arange(1, 1.1, 0.01)*wlu[spectral_axis] | |
if spectral_axis == 'frequency': | |
disp = disp[::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we're going to want to ensure that any spectral axis in frequency space is strictly descending. I'm not sure we need to worry about that case right now, but something to think about.
That's actually everything but wavelength, right? Yes, if there is functionality that depends on descending order in frequency/energy, that's probably good to keep as reminder.
Feel free to add a new section for 1.1 and include a change log entry, otherwise we generally comb through the PRs and add any merged ones to the PR right before we do a release. This looks good! |
@dhomeier can you rebase to get rid of that last merge commit? |
9edd8cd
to
845b9df
Compare
@nmearl sorry, had already updated locally, but I hope everything is in order now. |
Thanks for your work on this @dhomeier! |
Thanks for the feedback @nmearl! |
Trying to clean up with a couple of problems I encountered in
default_loaders
:spectrum_from_column_mapping
was failing due to use of wrongequivalence
keyword, also did not accept unit names the way indicated in the docstring (promoted totabular_fits_loader
).tabular_fits_loader
withoutcolumn_mapping
keyword was still failing on any file I could find, because it unconditionally used even an (essentially empty)WCS
from headers without any wcs info, instead of using dispersion information from the table.I had botched the correct
is_fits
calling method in some loaders.muscles-sed
seemed not to properly handle currentUnit
andQuantity
usage; no tests either.Added header-based identification for the
apogee
formats + tests.In
tabular_fits.py
I am making only a very basic check if thewcs
can represent a 1Dspectral_axis
to fix #531; maybe a closer comparison with the shape of the flux data is in place.