Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix default_loaders tabular-fits format and automatic recognition #573

Merged
merged 16 commits into from
Mar 31, 2020

Conversation

dhomeier
Copy link
Contributor

Trying to clean up with a couple of problems I encountered in default_loaders:

  1. spectrum_from_column_mapping was failing due to use of wrong equivalence keyword, also did not accept unit names the way indicated in the docstring (promoted to tabular_fits_loader).

  2. tabular_fits_loader without column_mapping keyword was still failing on any file I could find, because it unconditionally used even an (essentially empty) WCS from headers without any wcs info, instead of using dispersion information from the table.

  3. I had botched the correct is_fits calling method in some loaders.

  4. muscles-sed seemed not to properly handle current Unit and Quantity usage; no tests either.

  5. Added header-based identification for the apogee formats + tests.

In tabular_fits.py I am making only a very basic check if the wcs can represent a 1D spectral_axis to fix #531; maybe a closer comparison with the shape of the flux data is in place.

@dhomeier dhomeier requested review from nmearl and kelle January 23, 2020 02:10
@@ -22,7 +22,7 @@
# need to add `format="my-format"` in the `Spectrum1D.read` call.
def identify_generic_fits(origin, *args, **kwargs):
return (isinstance(args[0], str) and
fits.connect.is_fits(origin, *args) and
fits.connect.is_fits(origin, origin, *args) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the astropy is_fits function, it seems that the first argument isn't even considered, and the second argument just checks for extension matches. Might it be better to pass an open fits object as the third parameter to check for the fits signature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it actually expects a file object (and unfortunately does not work with urlopen due to the seek requirements) - thought that the identify_* functions were already called with the open file, but of course it is explicitly testing for a name string above...
These are all loaders for formats without tests or sample files for the format; now that I've been looking into the other loaders it always seems to be preferable to use the with fits.open() context, which gives you the signature check for free plus the hdulist for further investigation of the content. Will try to fix the remaining ones as a use case nonetheless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched the other loaders, which access the header for further investigation, to with fits.open; that leaves subaru_pfs_spec as only case directly calling is_fits, and that is tested now.


meta = {'header': header}
uncertainty = StdDevUncertainty(tab["ERROR"])
data = tab["FLUX"]
wavelength = tab["WAVELENGTH"]
data = Quantity(tab["FLUX"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this end up unitless? Should it be looking for some unit defined in the header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit is already parsed with the Table.read(). I've added a check for this to the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I always assumed one would have to use QTable.read to get columns as quantity objects. Good to know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact the regular Table Column supports basic quantity features, but is still subtly different Therefore the cast to Quantity is still needed - alternatively one could indeed get it directly with QTable.read.

@@ -69,6 +68,10 @@ def tabular_fits_loader(file_name, column_mapping=None, hdu=1, **kwargs):

tab = Table.read(file_name, format='fits', hdu=hdu)

# Minimal consistency check for wcs consistency with table data
if wcs.naxis != 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not always guaranteed that the number of world coordinate axes in the fits file will be one (e.g. a data cube, or data with a time dimension). Perhaps it'd be better to check explicitly for a spectral axis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure to what extent tabular-fits does, or is supposed to, support >1D spectra (thought that generic_cube was for that case, but again I don't have any sample files to test). But if any, I would rather expect it to support the simplest case of a spectral cube or time series with a common (single 1D) spectral axis. Other cases, where each flux column could have its own zero point or completely individual dispersion scale, or possible intermediate cases where some subset of the spectra shares a spectra axis, are IMO really better left to generic_cube; at least I have no idea how to correctly identify such cases where

  1. the flux data have >1 dimension

  2. the spectra in the flux cube do not have identical spectral axes and

  3. these wavelength/frequency scales are not included in the BINTABLE, but must be constructed from a WCS defined in the header.

Perhaps @teuben who wrote the initial version of generic_cube has a better idea about what possible data formats could be encountered.

So I would tend to just check for

    if not (wcs.naxis == 1 and wcs.array_shape[0] == len(tab):

which could be extended to

    if not (wcs.naxis == 1 and wcs.array_shape[0] == len(tab) or
            wcs.array_shape == tab[fluxname].shape):

just to allow for the case with all individual spectral axes, but note that then the correct column for the flux would have to be identified from either column_mapping or analogously to _find_spectral_column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have included the 2nd option, but simply assumed that the flux data would be in the first table column - should of course also work it there is a spectral_axis, but in that case it's unclear why bother with the WCS at all...

Copy link
Contributor

@nmearl nmearl Feb 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is tricky. Ideally, the wcs should tell us the column axis type (e.g. wcs.world_axis_names, wcs.world_axis_types, etc). But obviously this requires that the fits standard be well followed in all data files (😆). Maybe @eteq has some insight.

I'm tempted to just say "yeah, you're right -- people should just using generic_cube", but if that's the case, we should raise an appropriate error when wcs.naxis != to point users to generic_cube for those cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is also I cannot even tell if generic_cube works, as I don't really know how a valid file should look like.

specutils/io/parsing_utils.py Outdated Show resolved Hide resolved
specutils/io/parsing_utils.py Outdated Show resolved Hide resolved
specutils/tests/test_loaders.py Outdated Show resolved Hide resolved
specutils/tests/test_loaders.py Outdated Show resolved Hide resolved
specutils/tests/test_loaders.py Outdated Show resolved Hide resolved
specutils/tests/test_loaders.py Outdated Show resolved Hide resolved
specutils/tests/test_loaders.py Outdated Show resolved Hide resolved
@nmearl nmearl added this to the v0.8 milestone Jan 23, 2020
@@ -28,33 +28,52 @@

def spec_identify(origin, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be worth adding to the /io/default_loaders/parsing_utils.py file. If _spec_pattern is unique between loader definitions, maybe adding in another parameter so that it can passed to the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide a more generic wrapper for is_fits, with _spec_pattern to be passed by the loader? That might be useful, although I would in general recommend to try to identify the spectrum types on file content rather than naming schemes. But it's useful to have various file type and path checks provided in one function, agreed.

Note that is_fits called with fileobj=None will identify any valid name pattern as a FITS file, whether filepath actually points to an existing file or not. For _fits_identify_by_name I have made it a requirement that the file exists, since the loaders will eventually access it anyway.

@dhomeier
Copy link
Contributor Author

@nmearl I have rebased to pull in the updates to the JWST reader from #579, and while at that added a more readable MUSCLES label. Wondering if the tabular-fits should not also be changed to something more descriptive like "Generic FITS Table", but there might be more existing code relying on the old label.

@dhomeier dhomeier force-pushed the loaders-fix-tabular branch 2 times, most recently from f142303 to 24f61da Compare February 18, 2020 18:23
@nmearl
Copy link
Contributor

nmearl commented Feb 19, 2020

@dhomeier I don't want to hold up this PR anymore, but I was hoping to get some comments from Erik concerning the checking for the spectral axis. Would you mind opening a PR capturing the conversation about spectral axis checking and what how explicit we should be knowing that we have the generic_cube loader. Then, I think we can go ahead and merge this.

@eteq
Copy link
Member

eteq commented Feb 21, 2020

On the question of checking for spectral axis: I think I like the implementation as it stands here, at least as a solid iteration. I also interpreted tabular-fits as being "a table that maps to a single spectrum" And as the 😆 surrounding:

But obviously this requires that the fits standard be well followed in all data files

Indicates, I think there's not really a foolproof way to do this.

The one case that might be useful to test is a fits file where the flux column is a multi-dimensional array - I've never seen that in the wild, but that doesn't mean it doesn't exist, and it seems natural (and trivial to implement) that that would load into Spectrum1D as expected.

We might consider an additional reader down the line that can handle "multiple spectra that aren't cubes", but I'd say that's a follow-on PR and doesn't need to hold this up.

So @dhomeier, if you want to add a quick test of the above case, please do so, but if we'd rather think of this as a separate follow-on as well, I'm fine with merging as this stands.

@eteq
Copy link
Member

eteq commented Feb 21, 2020

(Also, as I side note, some info about the guiding principles separating these loaders should be in the docs, but #394 is really the main issue for that)

@dhomeier
Copy link
Contributor Author

The one case that might be useful to test is a fits file where the flux column is a multi-dimensional array - I've never seen that in the wild, but that doesn't mean it doesn't exist, and it seems natural (and trivial to implement) that that would load into Spectrum1D as expected.

Oh, I have actually produced a couple of those, which might still be out somewhere - the column is not flux but Intensity (for different angles), but one might expect that this can be loaded with column_mapping.
Turns out this fails in various places:

  1. _find_spectral_column tries to transform the flux column to "Jy" creating an equivalency from spectral_axis - this could be easily overcome by broadcasting the latter to the flux shape.

  2. Spectrum1D expects the last axis of the flux initialiser to match spectral_axis, while Astropy Table is loading it as a (NAXIS1, shape(dtype(flux)) array. This will have to be transposed first. Conversely the writer currently fails when given a spectrum with such a flux of shape (M, NAXIS1) because Table again requires the transposed format. All this would be a bit easier if Spectrum1D and Table/FITS weren't using orthogonal conventions for the order of axes.

We might consider an additional reader down the line that can handle "multiple spectra that aren't cubes", but I'd say that's a follow-on PR and doesn't need to hold this up.

So @dhomeier, if you want to add a quick test of the above case, please do so, but if we'd rather think of this as a separate follow-on as well, I'm fine with merging as this stands.

Since it's not as quick to produce a simple example case and requires a number of changes to parsing_utils, Spectrum1D.__init__ and the loaders, I concur that this is a PR of its own.
Also this still does not address the wcs verification in tabular_fits, which deals with a spectral_axis that is not part of the table, but encoded as a WCS object.

When looking at the wcs1d-fits examples, their WCS returns a spectral_axis of shape (0, NAXIS1), so maybe tabular-fits should also test for
wcs.naxis == 1 and wcs.array_shape[-1] == len(tab)
but then again in the wcs1d-fits files the flux is stored as NAXIS1x1 dim image, so I don't really know what one should expect for the combination TABLE + WCS, so this can probably wait for the follow-up PR.

I've also tested a number of improvements for the auto-detection of wcs1d-fits, but I'd defer them to their own PR as well.

@dhomeier
Copy link
Contributor Author

(Also, as I side note, some info about the guiding principles separating these loaders should be in the docs, but #394 is really the main issue for that)

Perhaps also #584 for ranking the loaders by priority.

@nmearl
Copy link
Contributor

nmearl commented Mar 3, 2020

Spectrum1D expects the last axis of the flux initialiser to match spectral_axis, while Astropy Table is loading it as a (NAXIS1, shape(dtype(flux)) array

Can you expand on this a bit? How are you accessing the data within the table? Specutils follows the numpy row-major formalism; it may be that this is just a quirk of data tables being preferentially column-major in general.

Otherwise, it sounds like this PR is good to go with the caveat of opening (as I see it) two new issues to be addressed in follow-up PRs:

  1. Support broadcasting in _find_spectral_column for file loading.
  2. Extend tabular-fits to include multiple spectra that aren't cubes.

Does this sound right @dhomeier, @eteq?

@dhomeier
Copy link
Contributor Author

Spectrum1D expects the last axis of the flux initialiser to match spectral_axis, while Astropy Table is loading it as a (NAXIS1, shape(dtype(flux)) array

Can you expand on this a bit? How are you accessing the data within the table? Specutils follows the numpy row-major formalism; it may be that this is just a quirk of data tables being preferentially column-major in general.

I think it's the consequence of fits.io loading it as a recarray, e.g. my file in question is structured like this (actual second column is intensity, but renamed Flux here to allow generic_spectrum_from_table to find it).

TTYPE1  = 'Wavelength'
TFORM1  = 'E       '
TUNIT1  = 'Angstrom'
TTYPE2  = 'Flux     '          / I_mu/<I>
TFORM2  = '20E     '
TUNIT2  = 'erg/s/cm2/Angstrom'

which is loaded as a FITS_rec of

>>> hdu.data.shape
(314931,)
>>> hdu.data.dtype
dtype((numpy.record, [('Wavelength', '>f4'), ('Flux', '>f4', (20,))]))

and Table.read() makes it a

<Table length=314931>
Wavelength      Flux [20]            
 Angstrom     erg / (Angstrom cm2 s)     
 float32      float32             
---------- --------------------------------
   3.0e+03  9.5420114e-07 ..  1.3871580e+00

which I would actually consider the more natural representation – Wavelength and Flux column having the same length (number of rows). So I'd tend to adapt Spectrum1D.__init__ to handle this for tables in general, since you could not even construct a table from a length-314931 wavelength and a length-20 flux column.

@dhomeier
Copy link
Contributor Author

dhomeier commented Mar 30, 2020

I guess that would require Spectrum1D.__init__ to distinguish between Table/recarray columns and plain arrays, so the current handling of 2D arrays is not broken. Although the latter seems to have some flaws e.g. with slicing:

>>> sp1d = Spectrum1D(spectral_axis=np.arange(5)*u.nm, flux=np.ones(5)*u.Jy)
>>> sp1d[:2]
<Spectrum1D(flux=<Quantity [1., 1.] Jy>, spectral_axis=<Quantity [0., 1.] nm>)>
>>> sp2d = specutils.Spectrum1D(spectral_axis=np.arange(5)*u.nm, flux=np.ones((3, 5))*u.Jy)
>>> sp2d[:2]
<Spectrum1D(flux=<Quantity [[1., 1., 1., 1., 1.],
           [1., 1., 1., 1., 1.]] Jy>, spectral_axis=<Quantity [0., 1., 2., 3., 4.] nm>)>

@nmearl
Copy link
Contributor

nmearl commented Mar 30, 2020

I guess that would require Spectrum1D.init to distinguish between Table/recarray columns and plain arrays

I do not think we want to implement this handling in the intializer. The supported inputs are specifically Quantity objects, and we want to avoid the case where we're bloating the initializer by doing users' data transformations for them (e.g. I can image getting future requests for taking pandas dataframes, dask objects, etc), and we certainly do not want that burden. It should be on the user and the data loaders to change the flux and spectral axis values to the appropriate Quantity format.

Although the latter seems to have some flaws e.g. with slicing:

I'm not sure I'm seeing what the issue is here? If the flux array is multi-dimensional, the slicing is intended to occur on the elements in the flux array, returning a new Spectrum1D.

@dhomeier
Copy link
Contributor Author

I do not think we want to implement this handling in the intializer. The supported inputs are specifically Quantity objects, and we want to avoid the case where we're bloating the initializer by doing users' data transformations for them (e.g. I can image getting future requests for taking pandas dataframes, dask objects, etc), and we certainly do not want that burden.

OK, then it can be handled in parsing_utils. Using tables or columns seemed a common way to initialise the spectrum (and IMO are a bit different from pandas frames etc. since they are an Astropy standard data format and already come with Quantity support), but you're right that the docstrings do not mention them in any way.
I think the astropy.nddata.NDData part should be removed then, since they are not accepted either.

Although the latter seems to have some flaws e.g. with slicing:

I'm not sure I'm seeing what the issue is here? If the flux array is multi-dimensional, the slicing is intended to occur on the elements in the flux array, returning a new Spectrum1D.

Slicing would straightforwardly extract a desired wavelength range (assuming you have found the corresponding indices). For the 2d spectrum I don't see how to get a comparable functionality, even if you checked for the different ndim and shape:

>>> sp2.shape
(3, 5)
>>> sp2[:,:2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/derek/lib/python3.8/site-packages/specutils/spectra/spectrum1d.py", line 172, in __getitem__
    return self._copy(
  File "/Users/derek/lib/python3.8/site-packages/specutils/spectra/spectrum1d.py", line 199, in _copy
    return self.__class__(**alt_kwargs)
  File "/Users/derek/lib/python3.8/site-packages/specutils/spectra/spectrum1d.py", line 111, in __init__
    raise ValueError('Spectral axis ({}) and the last flux axis ({}) lengths must be the same'.format(
ValueError: Spectral axis (5) and the last flux axis (2) lengths must be the same

@dhomeier
Copy link
Contributor Author

I've now implemented this as described above, trying to write it as general as possible, but really only checked and added tests for ndim=2.
I haven't added a changelog entry as there is no section for 1.0 or 1.1 yet.

wlu = {'wavelength': u.AA, 'frequency': u.GHz, 'energy': u.eV,
'wavenumber': u.cm**-1}
# Create a small data set with 2D flux + uncertainty
disp = np.arange(1, 1.1, 0.01)*wlu[spectral_axis]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we're going to want to ensure that any spectral axis in frequency space is strictly descending. I'm not sure we need to worry about that case right now, but something to think about.

Suggested change
disp = np.arange(1, 1.1, 0.01)*wlu[spectral_axis]
disp = np.arange(1, 1.1, 0.01)*wlu[spectral_axis]
if spectral_axis == 'frequency':
disp = disp[::-1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we're going to want to ensure that any spectral axis in frequency space is strictly descending. I'm not sure we need to worry about that case right now, but something to think about.

That's actually everything but wavelength, right? Yes, if there is functionality that depends on descending order in frequency/energy, that's probably good to keep as reminder.

@nmearl
Copy link
Contributor

nmearl commented Mar 30, 2020

I haven't added a changelog entry as there is no section for 1.0 or 1.1 yet.

Feel free to add a new section for 1.1 and include a change log entry, otherwise we generally comb through the PRs and add any merged ones to the PR right before we do a release.

This looks good!

@nmearl
Copy link
Contributor

nmearl commented Mar 31, 2020

@dhomeier can you rebase to get rid of that last merge commit?

@dhomeier
Copy link
Contributor Author

@nmearl sorry, had already updated locally, but I hope everything is in order now.

@nmearl nmearl merged commit de2a442 into astropy:master Mar 31, 2020
@nmearl
Copy link
Contributor

nmearl commented Mar 31, 2020

Thanks for your work on this @dhomeier!

@dhomeier
Copy link
Contributor Author

Thanks for the feedback @nmearl!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spectrum1D.read fails on reading its own files written as 'tabular-fits' or 'wcs-fits'
3 participants