Fix default_loaders tabular-fits format and automatic recognition #573

dhomeier · 2020-01-23T01:52:33Z

Trying to clean up with a couple of problems I encountered in default_loaders:

spectrum_from_column_mapping was failing due to use of wrong equivalence keyword, also did not accept unit names the way indicated in the docstring (promoted to tabular_fits_loader).
tabular_fits_loader without column_mapping keyword was still failing on any file I could find, because it unconditionally used even an (essentially empty) WCS from headers without any wcs info, instead of using dispersion information from the table.
I had botched the correct is_fits calling method in some loaders.
muscles-sed seemed not to properly handle current Unit and Quantity usage; no tests either.
Added header-based identification for the apogee formats + tests.

In tabular_fits.py I am making only a very basic check if the wcs can represent a 1D spectral_axis to fix #531; maybe a closer comparison with the shape of the flux data is in place.

nmearl · 2020-01-23T15:57:45Z

specutils/io/default_loaders/generic_cube.py

@@ -22,7 +22,7 @@
 # need to add `format="my-format"` in the `Spectrum1D.read` call.
 def identify_generic_fits(origin, *args, **kwargs):
    return (isinstance(args[0], str) and
-            fits.connect.is_fits(origin, *args) and
+            fits.connect.is_fits(origin, origin, *args) and


Looking at the astropy is_fits function, it seems that the first argument isn't even considered, and the second argument just checks for extension matches. Might it be better to pass an open fits object as the third parameter to check for the fits signature?

You're right, it actually expects a file object (and unfortunately does not work with urlopen due to the seek requirements) - thought that the identify_* functions were already called with the open file, but of course it is explicitly testing for a name string above...
These are all loaders for formats without tests or sample files for the format; now that I've been looking into the other loaders it always seems to be preferable to use the with fits.open() context, which gives you the signature check for free plus the hdulist for further investigation of the content. Will try to fix the remaining ones as a use case nonetheless.

Switched the other loaders, which access the header for further investigation, to with fits.open; that leaves subaru_pfs_spec as only case directly calling is_fits, and that is tested now.

nmearl · 2020-01-23T15:58:49Z

specutils/io/default_loaders/muscles_sed.py


        meta = {'header': header}
        uncertainty = StdDevUncertainty(tab["ERROR"])
-        data = tab["FLUX"]
-        wavelength = tab["WAVELENGTH"]
+        data = Quantity(tab["FLUX"])


Does this end up unitless? Should it be looking for some unit defined in the header?

The unit is already parsed with the Table.read(). I've added a check for this to the test.

Interesting, I always assumed one would have to use QTable.read to get columns as quantity objects. Good to know.

In fact the regular Table Column supports basic quantity features, but is still subtly different Therefore the cast to Quantity is still needed - alternatively one could indeed get it directly with QTable.read.

nmearl · 2020-01-23T17:31:52Z

specutils/io/default_loaders/tabular_fits.py

@@ -69,6 +68,10 @@ def tabular_fits_loader(file_name, column_mapping=None, hdu=1, **kwargs):

    tab = Table.read(file_name, format='fits', hdu=hdu)

+    # Minimal consistency check for wcs consistency with table data
+    if wcs.naxis != 1:


It's not always guaranteed that the number of world coordinate axes in the fits file will be one (e.g. a data cube, or data with a time dimension). Perhaps it'd be better to check explicitly for a spectral axis?

I was not sure to what extent tabular-fits does, or is supposed to, support >1D spectra (thought that generic_cube was for that case, but again I don't have any sample files to test). But if any, I would rather expect it to support the simplest case of a spectral cube or time series with a common (single 1D) spectral axis. Other cases, where each flux column could have its own zero point or completely individual dispersion scale, or possible intermediate cases where some subset of the spectra shares a spectra axis, are IMO really better left to generic_cube; at least I have no idea how to correctly identify such cases where

the flux data have >1 dimension

the spectra in the flux cube do not have identical spectral axes and

these wavelength/frequency scales are not included in the BINTABLE, but must be constructed from a WCS defined in the header.

Perhaps @teuben who wrote the initial version of generic_cube has a better idea about what possible data formats could be encountered.

So I would tend to just check for

if not (wcs.naxis == 1 and wcs.array_shape[0] == len(tab):

which could be extended to

if not (wcs.naxis == 1 and wcs.array_shape[0] == len(tab) or wcs.array_shape == tab[fluxname].shape):

just to allow for the case with all individual spectral axes, but note that then the correct column for the flux would have to be identified from either column_mapping or analogously to _find_spectral_column.

I have included the 2nd option, but simply assumed that the flux data would be in the first table column - should of course also work it there is a spectral_axis, but in that case it's unclear why bother with the WCS at all...

Hmm. This is tricky. Ideally, the wcs should tell us the column axis type (e.g. wcs.world_axis_names, wcs.world_axis_types, etc). But obviously this requires that the fits standard be well followed in all data files (😆). Maybe @eteq has some insight.

I'm tempted to just say "yeah, you're right -- people should just using generic_cube", but if that's the case, we should raise an appropriate error when wcs.naxis != to point users to generic_cube for those cases.

Problem is also I cannot even tell if generic_cube works, as I don't really know how a valid file should look like.

specutils/io/parsing_utils.py

specutils/tests/test_loaders.py

nmearl · 2020-02-03T15:40:22Z

specutils/io/default_loaders/subaru_pfs_spec.py

@@ -28,33 +28,52 @@

 def spec_identify(origin, *args, **kwargs):


This may be worth adding to the /io/default_loaders/parsing_utils.py file. If _spec_pattern is unique between loader definitions, maybe adding in another parameter so that it can passed to the function.

To provide a more generic wrapper for is_fits, with _spec_pattern to be passed by the loader? That might be useful, although I would in general recommend to try to identify the spectrum types on file content rather than naming schemes. But it's useful to have various file type and path checks provided in one function, agreed.

Note that is_fits called with fileobj=None will identify any valid name pattern as a FITS file, whether filepath actually points to an existing file or not. For _fits_identify_by_name I have made it a requirement that the file exists, since the loaders will eventually access it anyway.

dhomeier · 2020-02-18T17:26:58Z

@nmearl I have rebased to pull in the updates to the JWST reader from #579, and while at that added a more readable MUSCLES label. Wondering if the tabular-fits should not also be changed to something more descriptive like "Generic FITS Table", but there might be more existing code relying on the old label.

nmearl · 2020-02-19T16:25:07Z

@dhomeier I don't want to hold up this PR anymore, but I was hoping to get some comments from Erik concerning the checking for the spectral axis. Would you mind opening a PR capturing the conversation about spectral axis checking and what how explicit we should be knowing that we have the generic_cube loader. Then, I think we can go ahead and merge this.

eteq · 2020-02-21T19:07:23Z

On the question of checking for spectral axis: I think I like the implementation as it stands here, at least as a solid iteration. I also interpreted tabular-fits as being "a table that maps to a single spectrum" And as the 😆 surrounding:

But obviously this requires that the fits standard be well followed in all data files

Indicates, I think there's not really a foolproof way to do this.

The one case that might be useful to test is a fits file where the flux column is a multi-dimensional array - I've never seen that in the wild, but that doesn't mean it doesn't exist, and it seems natural (and trivial to implement) that that would load into Spectrum1D as expected.

We might consider an additional reader down the line that can handle "multiple spectra that aren't cubes", but I'd say that's a follow-on PR and doesn't need to hold this up.

So @dhomeier, if you want to add a quick test of the above case, please do so, but if we'd rather think of this as a separate follow-on as well, I'm fine with merging as this stands.

eteq · 2020-02-21T19:49:28Z

(Also, as I side note, some info about the guiding principles separating these loaders should be in the docs, but #394 is really the main issue for that)

dhomeier · 2020-02-23T20:09:40Z

The one case that might be useful to test is a fits file where the flux column is a multi-dimensional array - I've never seen that in the wild, but that doesn't mean it doesn't exist, and it seems natural (and trivial to implement) that that would load into Spectrum1D as expected.

Oh, I have actually produced a couple of those, which might still be out somewhere - the column is not flux but Intensity (for different angles), but one might expect that this can be loaded with column_mapping.
Turns out this fails in various places:

_find_spectral_column tries to transform the flux column to "Jy" creating an equivalency from spectral_axis - this could be easily overcome by broadcasting the latter to the flux shape.
Spectrum1D expects the last axis of the flux initialiser to match spectral_axis, while Astropy Table is loading it as a (NAXIS1, shape(dtype(flux)) array. This will have to be transposed first. Conversely the writer currently fails when given a spectrum with such a flux of shape (M, NAXIS1) because Table again requires the transposed format. All this would be a bit easier if Spectrum1D and Table/FITS weren't using orthogonal conventions for the order of axes.

We might consider an additional reader down the line that can handle "multiple spectra that aren't cubes", but I'd say that's a follow-on PR and doesn't need to hold this up.

So @dhomeier, if you want to add a quick test of the above case, please do so, but if we'd rather think of this as a separate follow-on as well, I'm fine with merging as this stands.

Since it's not as quick to produce a simple example case and requires a number of changes to parsing_utils, Spectrum1D.__init__ and the loaders, I concur that this is a PR of its own.
Also this still does not address the wcs verification in tabular_fits, which deals with a spectral_axis that is not part of the table, but encoded as a WCS object.

When looking at the wcs1d-fits examples, their WCS returns a spectral_axis of shape (0, NAXIS1), so maybe tabular-fits should also test for
wcs.naxis == 1 and wcs.array_shape[-1] == len(tab)
but then again in the wcs1d-fits files the flux is stored as NAXIS1x1 dim image, so I don't really know what one should expect for the combination TABLE + WCS, so this can probably wait for the follow-up PR.

I've also tested a number of improvements for the auto-detection of wcs1d-fits, but I'd defer them to their own PR as well.

dhomeier · 2020-02-23T20:13:05Z

(Also, as I side note, some info about the guiding principles separating these loaders should be in the docs, but #394 is really the main issue for that)

Perhaps also #584 for ranking the loaders by priority.

nmearl · 2020-03-03T15:14:42Z

Spectrum1D expects the last axis of the flux initialiser to match spectral_axis, while Astropy Table is loading it as a (NAXIS1, shape(dtype(flux)) array

Can you expand on this a bit? How are you accessing the data within the table? Specutils follows the numpy row-major formalism; it may be that this is just a quirk of data tables being preferentially column-major in general.

Otherwise, it sounds like this PR is good to go with the caveat of opening (as I see it) two new issues to be addressed in follow-up PRs:

Support broadcasting in _find_spectral_column for file loading.
Extend tabular-fits to include multiple spectra that aren't cubes.

Does this sound right @dhomeier, @eteq?

dhomeier · 2020-03-30T15:36:53Z

Spectrum1D expects the last axis of the flux initialiser to match spectral_axis, while Astropy Table is loading it as a (NAXIS1, shape(dtype(flux)) array

Can you expand on this a bit? How are you accessing the data within the table? Specutils follows the numpy row-major formalism; it may be that this is just a quirk of data tables being preferentially column-major in general.

I think it's the consequence of fits.io loading it as a recarray, e.g. my file in question is structured like this (actual second column is intensity, but renamed Flux here to allow generic_spectrum_from_table to find it).

TTYPE1  = 'Wavelength'
TFORM1  = 'E       '
TUNIT1  = 'Angstrom'
TTYPE2  = 'Flux     '          / I_mu/<I>
TFORM2  = '20E     '
TUNIT2  = 'erg/s/cm2/Angstrom'

which is loaded as a FITS_rec of

>>> hdu.data.shape
(314931,)
>>> hdu.data.dtype
dtype((numpy.record, [('Wavelength', '>f4'), ('Flux', '>f4', (20,))]))

and Table.read() makes it a

<Table length=314931>
Wavelength      Flux [20]            
 Angstrom     erg / (Angstrom cm2 s)     
 float32      float32             
---------- --------------------------------
   3.0e+03  9.5420114e-07 ..  1.3871580e+00

which I would actually consider the more natural representation – Wavelength and Flux column having the same length (number of rows). So I'd tend to adapt Spectrum1D.__init__ to handle this for tables in general, since you could not even construct a table from a length-314931 wavelength and a length-20 flux column.

dhomeier · 2020-03-30T16:40:54Z

I guess that would require Spectrum1D.__init__ to distinguish between Table/recarray columns and plain arrays, so the current handling of 2D arrays is not broken. Although the latter seems to have some flaws e.g. with slicing:

>>> sp1d = Spectrum1D(spectral_axis=np.arange(5)*u.nm, flux=np.ones(5)*u.Jy)
>>> sp1d[:2]
<Spectrum1D(flux=<Quantity [1., 1.] Jy>, spectral_axis=<Quantity [0., 1.] nm>)>
>>> sp2d = specutils.Spectrum1D(spectral_axis=np.arange(5)*u.nm, flux=np.ones((3, 5))*u.Jy)
>>> sp2d[:2]
<Spectrum1D(flux=<Quantity [[1., 1., 1., 1., 1.],
           [1., 1., 1., 1., 1.]] Jy>, spectral_axis=<Quantity [0., 1., 2., 3., 4.] nm>)>

nmearl · 2020-03-30T16:48:48Z

I guess that would require Spectrum1D.init to distinguish between Table/recarray columns and plain arrays

I do not think we want to implement this handling in the intializer. The supported inputs are specifically Quantity objects, and we want to avoid the case where we're bloating the initializer by doing users' data transformations for them (e.g. I can image getting future requests for taking pandas dataframes, dask objects, etc), and we certainly do not want that burden. It should be on the user and the data loaders to change the flux and spectral axis values to the appropriate Quantity format.

Although the latter seems to have some flaws e.g. with slicing:

I'm not sure I'm seeing what the issue is here? If the flux array is multi-dimensional, the slicing is intended to occur on the elements in the flux array, returning a new Spectrum1D.

dhomeier · 2020-03-30T17:08:56Z

I do not think we want to implement this handling in the intializer. The supported inputs are specifically Quantity objects, and we want to avoid the case where we're bloating the initializer by doing users' data transformations for them (e.g. I can image getting future requests for taking pandas dataframes, dask objects, etc), and we certainly do not want that burden.

OK, then it can be handled in parsing_utils. Using tables or columns seemed a common way to initialise the spectrum (and IMO are a bit different from pandas frames etc. since they are an Astropy standard data format and already come with Quantity support), but you're right that the docstrings do not mention them in any way.
I think the astropy.nddata.NDData part should be removed then, since they are not accepted either.

Although the latter seems to have some flaws e.g. with slicing:

I'm not sure I'm seeing what the issue is here? If the flux array is multi-dimensional, the slicing is intended to occur on the elements in the flux array, returning a new Spectrum1D.

Slicing would straightforwardly extract a desired wavelength range (assuming you have found the corresponding indices). For the 2d spectrum I don't see how to get a comparable functionality, even if you checked for the different ndim and shape:

>>> sp2.shape
(3, 5)
>>> sp2[:,:2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/derek/lib/python3.8/site-packages/specutils/spectra/spectrum1d.py", line 172, in __getitem__
    return self._copy(
  File "/Users/derek/lib/python3.8/site-packages/specutils/spectra/spectrum1d.py", line 199, in _copy
    return self.__class__(**alt_kwargs)
  File "/Users/derek/lib/python3.8/site-packages/specutils/spectra/spectrum1d.py", line 111, in __init__
    raise ValueError('Spectral axis ({}) and the last flux axis ({}) lengths must be the same'.format(
ValueError: Spectral axis (5) and the last flux axis (2) lengths must be the same

dhomeier · 2020-03-30T19:17:13Z

I've now implemented this as described above, trying to write it as general as possible, but really only checked and added tests for ndim=2.
I haven't added a changelog entry as there is no section for 1.0 or 1.1 yet.

nmearl · 2020-03-30T19:49:27Z

specutils/tests/test_loaders.py

+    wlu = {'wavelength': u.AA, 'frequency': u.GHz, 'energy': u.eV,
+           'wavenumber': u.cm**-1}
+    # Create a small data set with 2D flux + uncertainty
+    disp = np.arange(1, 1.1, 0.01)*wlu[spectral_axis]


Eventually we're going to want to ensure that any spectral axis in frequency space is strictly descending. I'm not sure we need to worry about that case right now, but something to think about.

Suggested change

disp = np.arange(1, 1.1, 0.01)*wlu[spectral_axis]

disp = np.arange(1, 1.1, 0.01)*wlu[spectral_axis]

if spectral_axis == 'frequency':

disp = disp[::-1]

Eventually we're going to want to ensure that any spectral axis in frequency space is strictly descending. I'm not sure we need to worry about that case right now, but something to think about.

That's actually everything but wavelength, right? Yes, if there is functionality that depends on descending order in frequency/energy, that's probably good to keep as reminder.

nmearl · 2020-03-30T19:52:15Z

I haven't added a changelog entry as there is no section for 1.0 or 1.1 yet.

Feel free to add a new section for 1.1 and include a change log entry, otherwise we generally comb through the PRs and add any merged ones to the PR right before we do a release.

This looks good!

nmearl · 2020-03-31T01:50:09Z

@dhomeier can you rebase to get rid of that last merge commit?

…& reading

dhomeier · 2020-03-31T13:23:37Z

@nmearl sorry, had already updated locally, but I hope everything is in order now.

nmearl · 2020-03-31T14:06:10Z

Thanks for your work on this @dhomeier!

dhomeier · 2020-03-31T14:29:25Z

Thanks for the feedback @nmearl!

dhomeier requested review from nmearl and kelle January 23, 2020 02:10

nmearl reviewed Jan 23, 2020

View reviewed changes

nmearl added this to the v0.8 milestone Jan 23, 2020

nmearl reviewed Feb 3, 2020

View reviewed changes

dhomeier mentioned this pull request Feb 12, 2020

Update SpectrumList loader for JWST data #579

Merged

dhomeier force-pushed the loaders-fix-tabular branch from 833c099 to eecd00d Compare February 18, 2020 17:22

dhomeier force-pushed the loaders-fix-tabular branch 2 times, most recently from f142303 to 24f61da Compare February 18, 2020 18:23

eteq approved these changes Feb 21, 2020

View reviewed changes

eteq mentioned this pull request Feb 21, 2020

Document the default loaders #394

Closed

nmearl mentioned this pull request Mar 30, 2020

Slicing with multi-dimensional flux does not allow normal spectral axis slicing #628

Closed

nmearl reviewed Mar 30, 2020

View reviewed changes

dhomeier added 3 commits March 31, 2020 04:37

BUG: fix spectrum_from_column_mapping unit handling

eceaf53

BUG: fix is_fits calls in default_loaders

ac4883a

BUG: fix muscles_sed loader and identify

5b89edd

dhomeier added 13 commits March 31, 2020 04:37

ENH: identify apogee formats by header info

be185d1

BUG: implement tabular-fits auto-detection and loading with missing WCS

38bac8d

TST: tabular-fits read-back; apogee + muscles-sed remote recognition …

dbef20a

…& reading

Refine test for valid WCS in tabular-fits

701603b

Fix FITS identification for Subaru PFS

651e941

Change identify functions to with fits.open

8d61402

Document file-like objects as possible input to loaders

5d55e7d

Extend and cleanup loaders tests

0a8c295

Fix markup [docs only]

4fb356e

Moved FITS identifier into parsing_utils utility function

f001375

Fixed typo from JWST merge; edited MUSCLES label and docstring

4cede60

tabular-fits read/write support for multiple spectra (tested 2D)

974db10

Fixed order of frequency axes in test spectra; changelog entry

845b9df

dhomeier force-pushed the loaders-fix-tabular branch from 9edd8cd to 845b9df Compare March 31, 2020 12:54

nmearl merged commit de2a442 into astropy:master Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix default_loaders tabular-fits format and automatic recognition #573

Fix default_loaders tabular-fits format and automatic recognition #573

dhomeier commented Jan 23, 2020

nmearl Jan 23, 2020

dhomeier Jan 24, 2020

dhomeier Jan 28, 2020

nmearl Jan 23, 2020

dhomeier Jan 24, 2020

nmearl Jan 24, 2020

dhomeier Jan 24, 2020

nmearl Jan 23, 2020

dhomeier Jan 24, 2020

dhomeier Jan 28, 2020

nmearl Feb 3, 2020 •

edited

Loading

dhomeier Feb 3, 2020

nmearl Feb 3, 2020

dhomeier Feb 7, 2020

dhomeier commented Feb 18, 2020

nmearl commented Feb 19, 2020

eteq commented Feb 21, 2020

eteq commented Feb 21, 2020

dhomeier commented Feb 23, 2020

dhomeier commented Feb 23, 2020

nmearl commented Mar 3, 2020

dhomeier commented Mar 30, 2020

dhomeier commented Mar 30, 2020 •

edited

Loading

nmearl commented Mar 30, 2020 •

edited

Loading

dhomeier commented Mar 30, 2020

dhomeier commented Mar 30, 2020

nmearl Mar 30, 2020

dhomeier Mar 30, 2020

nmearl commented Mar 30, 2020

nmearl commented Mar 31, 2020

dhomeier commented Mar 31, 2020

nmearl commented Mar 31, 2020

dhomeier commented Mar 31, 2020

		@@ -28,33 +28,52 @@

		def spec_identify(origin, args, *kwargs):

Fix default_loaders tabular-fits format and automatic recognition #573

Fix default_loaders tabular-fits format and automatic recognition #573

Conversation

dhomeier commented Jan 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nmearl Feb 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhomeier commented Feb 18, 2020

nmearl commented Feb 19, 2020

eteq commented Feb 21, 2020

eteq commented Feb 21, 2020

dhomeier commented Feb 23, 2020

dhomeier commented Feb 23, 2020

nmearl commented Mar 3, 2020

dhomeier commented Mar 30, 2020

dhomeier commented Mar 30, 2020 • edited Loading

nmearl commented Mar 30, 2020 • edited Loading

dhomeier commented Mar 30, 2020

dhomeier commented Mar 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nmearl commented Mar 30, 2020

nmearl commented Mar 31, 2020

dhomeier commented Mar 31, 2020

nmearl commented Mar 31, 2020

dhomeier commented Mar 31, 2020

nmearl Feb 3, 2020 •

edited

Loading

dhomeier commented Mar 30, 2020 •

edited

Loading

nmearl commented Mar 30, 2020 •

edited

Loading