Unnecessary duplicate arrays when using asdf embedded in FITS files #1232

perrygreenfield · 2022-11-15T15:39:56Z

JP-2900 reports a failure of calwebb_spec3 when it tries to create source-based products containing slit cutouts from multiple exposures. In one case the asdf extension grew to larger than 4GB, exceeding the size that astropy.io.fits can handle for a binary table entry. This revealed two problems. First, the pipeline was generating cutouts that made no sense (with vertical sizes in pixel in the hundreds) illustrating a problem with the segmentation algorithm and the lack of a sensible filter to reject cases of considerable confusion of sources. This led to trying to stuff many large cutouts into one file. Second, and the one related to a problem with ASDF, is that the same cutout array appeared both in a FITS HDU, and in the ASDF content contained in the ASDF HDU, meaning that data was unnecessarily duplicated in the file.

The text was updated successfully, but these errors were encountered:

braingram · 2022-11-15T18:54:36Z

Is there a minimal case for the asdf portion of this issue?

I'm looking at the jwst.datamodels save code and it appears to be using the asdf validation step to create the hdu list based on the model schema. This step mentions the MultiExposureModel which includes some custom schema modification code.

Unfortunately, if I generate a MultiExposureModel, add an Image as an exposure and save it I am not seeing duplication of the data.

perrygreenfield · 2022-11-15T19:01:58Z

Read the custom schema modification code and tell me what you think it actually does.

What was the actual type of that image?

braingram · 2022-11-15T19:30:14Z

Here's the test I ran:
https://gist.github.com/braingram/1e7505fd98f66fc6e0c5b8852bdda2eb

It looks like it's taking the core schema meta data stripping all fits_keyword and fits_hdu key/value pairs , then copying it (overlaying it) into exposures.meta of the multiexposure schema. This sort of matches the first sentence of the preceding comment but I don't follow the second sentence.

        # Create a new core.meta that will co-locate
        # with each exposure entry. This is done
        # by saving the meta information in a separate
        # FITS HDU.

Is the duplicated data under the meta key of the exposures or somewhere else in the file?

braingram · 2022-11-16T16:44:31Z

Here's a minimal case without jwst

import asdf
import asdf.fits_embed
import astropy.io.fits
import numpy


arr = numpy.zeros((1000, 100), dtype='f4')
# make hdu list with array from tree
hdulist = astropy.io.fits.HDUList()
hdulist.append(astropy.io.fits.PrimaryHDU())
hdu = astropy.io.fits.ImageHDU(name='SCI')
# assign data after creation to avoid a copy
hdu.data = arr
hdulist.append(hdu)

# save tree and hdu list
aif = asdf.fits_embed.AsdfInFits(hdulist, {'arr': arr})
aif.write_to('test1.fits', overwrite=True)

# read in asdf in fits
aif2 = asdf.open('test1.fits', mode='r')
aif2.write_to('test2.fits', overwrite=True)

ff1 = astropy.io.fits.open('test1.fits')
ff2 = astropy.io.fits.open('test2.fits')
print(f"original size = {ff1[-1].size}")
print(f"resave size   = {ff2[-1].size}")

This produces the following output:

original size = 571
resave size   = 400616

Presumably we do not want to lose links between arrays in the tree and HDU items (is this the right term?). This could be made into a test to add to test_fits_embed.

Per some emails with @perrygreenfield this issue appears to be related to linking HDUs and arrays and can be improved by changing find_or_create_block_for_array in EmbeddedBlockManager in fits_embed and util.get_array_base which both don't handle NDArrayType.

diff --git a/asdf/fits_embed.py b/asdf/fits_embed.py
index b436281..093b416 100644
--- a/asdf/fits_embed.py
+++ b/asdf/fits_embed.py
@@ -92,7 +92,9 @@ class _EmbeddedBlockManager(block.BlockManager):
     def find_or_create_block_for_array(self, arr, ctx):
         from .tags.core import ndarray

-        if not isinstance(arr, ndarray.NDArrayType):
+        #import pdb; pdb.set_trace()
+        #if not isinstance(arr, ndarray.NDArrayType):
+        if True:
             base = util.get_array_base(arr)
             for hdu in self._hdulist:
                 if hdu.data is None:
diff --git a/asdf/util.py b/asdf/util.py
index a45c863..a615da4 100644
--- a/asdf/util.py
+++ b/asdf/util.py
@@ -85,8 +85,9 @@ def get_array_base(arr):
     For a given Numpy array, finds the base array that "owns" the
     actual data.
     """
+    from .tags.core import ndarray
     base = arr
-    while isinstance(base.base, np.ndarray):
+    while isinstance(base.base, (np.ndarray, ndarray.NDArrayType)):
         base = base.base
     return base

My original thoughts on using numpy.shares_memory instead of comparing base aren't compatible with the test_array_view_different_layout added in PR #930 to avoid writing invalid data in fits. I haven't replicated this but the diff above (which doesn't use shares_memory) doesn't cause the test to fail.

braingram · 2022-11-16T17:45:04Z

Duplication can also occur if the data attribute of an hdu item is set to a non-ndarray type object. This will result in a copy (for example, assigning an NDArrayType to hdu.data will result in a copy) and data will then be duplicated as the link between the item in the hdulist and the array in the tree will be broken.

NDArrayType fails checks to isinstance(arr, ndarray) as it is not a subclass of ndarray. This prevented util.get_array_base from finding the correct base array which contributed to failures to link hdu items to arrays within the tree (leading to duplication of data). The EmbeddedBlockManager in fits_embed also skipped checking for links between NDArrayType and hdu items which also broke the link between hdu items and arrays within the tree. This PR removes the skip. fixes issue asdf-format#1232

jdavies-st · 2022-11-17T09:35:49Z

Here's another minimal case indicating that it seems (I think) to be coming from asdf, independent of anything in jwst.datamodels or stdatamodels:

from jwst import datamodels
from asdf.fits_embed import AsdfInFits
from astropy.io import fits


dm = datamodels.ImageModel((2048, 2048))
dm.save("test.fits")
with fits.open("test.fits") as hdulist:
    print(hdulist.info())

af = AsdfInFits.open("test.fits")
af.write_to("test2.fits")
with fits.open("test2.fits") as hdulist:
    print(hdulist.info())

and one gets as output

Filename: test.fits
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       7   ()      
  1  SCI           1 ImageHDU         9   (2048, 2048)   float32   
  2  ASDF          1 BinTableHDU     11   1R x 1C   [683B]   

Filename: test2.fits
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       7   ()      
  1  SCI           1 ImageHDU         9   (2048, 2048)   float32   
  2  ASDF          1 BinTableHDU     11   1R x 1C   [16777944B]

Note that initializing the AsdfInFITS object with a FITS file bypasses anything that jwst.datamodels and stdatamodels might be doing in terms of validation or shuffling. I think.

EDIT: oops, I realized that above there's already a minimal case with jwst. 😅

braingram · 2022-11-17T15:50:16Z

Thanks! Confirmation is always good :)

Here's another minimal case using stdatamodels.

import astropy.io.fits
import stdatamodels
import numpy

# --- this bit here replicates test infrastructure already in stdatamodels ---
cfg = asdf.get_config()
path_to_stdatamodels_repo = FILL_THIS_IN_FOR_YOUR_CHECKOUT
schema_dir = f"{path_to_stdatamodels_repo}/tests/schemas"
cfg.add_resource_mapping(asdf.resource.DirectoryResourceMapping(
    schema_dir, "http://example.com/schemas"))

class FitsModel(stdatamodels.DataModel):
    schema_url = "http://example.com/schemas/fits_model"
# --- end of replication --

arr = numpy.zeros((1000, 100), dtype='f4')
m = FitsModel(arr)
m.save('test1.fits')

m2 = FitsModel.from_fits('test1.fits')
m2.save('test2.fits')

ff1 = astropy.io.fits.open('test1.fits')
ff2 = astropy.io.fits.open('test2.fits')
print(f"original size = {ff1[-1].size}")
print(f"resave size   = {ff2[-1].size}")

I believe the fix will have to occur in ASDF (with discussion commencing on this PR #1233) but adding these cases as tests to stdatamodels and perhaps also jwst would help to catch any regressions. I will open PRs with tests for jwst and datamodels (with tests that currently fail and will be marked as skip). Once we have a fix we can unskip these (or wait to accept the PRs until the fix is available).

See asdf issue asdf-format/asdf#1232 This test is currently skipped as it will fail with the current asdf version.

When resaving a model to fits, asdf loses track of array-hdu links and saves arrays in both hdu and as internal blocks in the asdf tree. This can make an asdf tree that is larger than what can be stored in a fits hdu. See asdf issue: asdf-format/asdf#1232 This test is skipped as it fails with the current asdf version.

perrygreenfield · 2022-11-17T19:13:39Z

I'm trying to think of a case where an inadvertent reference to an HDU array would happen. I'm asking if any JWST products appear with identical DQ or ERR arrays (e.g., with constant values, or identical defaults). That may affect whether this is an issue in the only case this stuff is likely to be used for.

perrygreenfield · 2022-11-17T19:25:03Z

Even if this leads to multiple ASDF attributes pointing to the same HDU (with other HDUs having the same array contents that really should have matched), and some of those others not actually being used, I wonder if that really is a problem. The point of the FITS files is providing users with something they are comfortable with and know how to interpret. In most cases they would ignore the ADSF content. And for code that does use the ASDF content, it shouldn't care where the array values came from in the FITS file. The main worry with the ASDF is if different components are sharing the same array when they shouldn't. In most (and hopefully all) cases I think tasks make copies of the arrays, but I'm not completely sure of that. Another question for Howard.

perrygreenfield · 2022-11-17T19:50:42Z

I asked Howard the following questions:

We are looking more at this issue with duplicated arrays. Do products ever get generated with identical default DQ and/or ERR arrays within the same file? Or are they generally written out with populated results when such arrays appear?
A related question: are all data read from FITS files copied before modification? (Some of the solutions being considered may end up with ASDF elements sharing the data from the same HDU when the file is read). Would that cause a problem?

And Howard's responses:

As to the first question, the DQ arrays get populated with something non-default in like the very first step of the pipeline and proceed with more updates from there. So I doubt there are ever any products where multiple DQ arrays would have all default (e.g. zero) values. ERR arrays don't get populated until the ramp_fit step, which is the end of Stage 1 processing (rate products). So it only be intermediate products saved before that stage (e.g. our 4D "ramp" product) that might still have all default ERR array values. But each product would only have 1 ERR array. It isn't until mid-way through Stage 2 processing that we ever get products containing multiple DQ or ERR arrays, which occurs for spectroscopic modes that have multiple sources split out from the original single image. But by that time the DQ and ERR arrays should never still be at default values.

Most of the time we have steps make a copy of the input data before making changes. Originally we had things make corrections in-place, just to save memory (i.e. no need to have 2 copies of the data in memory all the time). That's fine when running a default pipeline end-to-end from the command line, because you get the correct final results in the output files, but when running interactively from within python, it's not always desirable. Users may not want to have their inputs tampered with, especially if they want to run a step multiple times to tweak params or whatever.

When resaving a model to fits, asdf loses track of array-hdu links and saves arrays in both hdu and as internal blocks in the asdf tree. This can make an asdf tree that is larger than what can be stored in a fits hdu. See asdf issue: asdf-format/asdf#1232 This test is skipped as it fails with the current asdf version. modify imports and use context manager for test_fits Co-authored-by: James Davies <[email protected]> use context manager in test_fits Co-authored-by: James Davies <[email protected]> improve test docstring in test_fits

NDArrayType fails checks to isinstance(arr, ndarray) as it is not a subclass of ndarray. This prevented util.get_array_base from finding the correct base array which contributed to failures to link hdu items to arrays within the tree (leading to duplication of data). The EmbeddedBlockManager in fits_embed also skipped checking for links between NDArrayType and hdu items which also broke the link between hdu items and arrays within the tree. This PR removes the skip. fixes issue asdf-format#1232

Fixes #1232

When resaving a model to fits, asdf loses track of array-hdu links and saves arrays in both hdu and as internal blocks in the asdf tree. This can make an asdf tree that is larger than what can be stored in a fits hdu. See asdf issue: asdf-format/asdf#1232 This test is skipped as it fails with the current asdf version. modify imports and use context manager for test_fits Co-authored-by: James Davies <[email protected]> use context manager in test_fits Co-authored-by: James Davies <[email protected]> improve test docstring in test_fits

* Add test for duplicate data when resaving model to fits When resaving a model to fits, asdf loses track of array-hdu links and saves arrays in both hdu and as internal blocks in the asdf tree. This can make an asdf tree that is larger than what can be stored in a fits hdu. See asdf issue: asdf-format/asdf#1232 This test is skipped as it fails with the current asdf version. modify imports and use context manager for test_fits Co-authored-by: James Davies <[email protected]> use context manager in test_fits Co-authored-by: James Davies <[email protected]> improve test docstring in test_fits * bump asdf version to 2.14.1 to fix hdu data unskip test for issue #7354 fixes #7354

* Add test for asdf data duplication issue with fits See asdf issue asdf-format/asdf#1232 This test is currently skipped as it will fail with the current asdf version. * bump asdf to 2.14.1 to fix hdu data duplication * update changelog

perrygreenfield added the bug label Nov 15, 2022

perrygreenfield added this to the 2.14 milestone Nov 15, 2022

perrygreenfield self-assigned this Nov 15, 2022

perrygreenfield mentioned this issue Nov 15, 2022

Fix problem of duplicate HDU and ASDF arrays with embedded asdf in FITS files #1233

Closed

braingram mentioned this issue Nov 16, 2022

Saving datamodels causes file size to double spacetelescope/jwst#7354

Closed

braingram added a commit to braingram/asdf that referenced this issue Nov 16, 2022

add test for resave breaking hdu array links issue asdf-format#1232

6451171

braingram added a commit to braingram/stdatamodels that referenced this issue Nov 17, 2022

Add test for asdf data duplication issue with fits

8bad113

See asdf issue asdf-format/asdf#1232 This test is currently skipped as it will fail with the current asdf version.

braingram mentioned this issue Nov 17, 2022

Add test for asdf data duplication issue with fits spacetelescope/stdatamodels#105

Merged

5 tasks

braingram mentioned this issue Nov 17, 2022

Add test for duplicate data when resaving model to fits spacetelescope/jwst#7358

Merged

7 tasks

braingram mentioned this issue Nov 21, 2022

Fix handling of NDArrayType during block to fits hdu matching #1234

Merged

braingram added a commit to braingram/asdf that referenced this issue Nov 22, 2022

add test for resave breaking hdu array links issue asdf-format#1232

759a857

WilliamJamieson closed this as completed in #1234 Nov 22, 2022

WilliamJamieson added a commit that referenced this issue Nov 22, 2022

Merge pull request #1234 from braingram/bug/hdu_array_link_breaking

0a6bfcb

Fixes #1232

braingram mentioned this issue Jan 13, 2023

Reimplement support for ASDF-in-FITS in this package spacetelescope/stdatamodels#110

Merged

5 tasks

stscijgbot-jp mentioned this issue Jul 7, 2023

fixed-type tuple error for level_3 tso data spacetelescope/jwst#7666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary duplicate arrays when using asdf embedded in FITS files #1232

Unnecessary duplicate arrays when using asdf embedded in FITS files #1232

perrygreenfield commented Nov 15, 2022

braingram commented Nov 15, 2022

perrygreenfield commented Nov 15, 2022

braingram commented Nov 15, 2022 •

edited

Loading

braingram commented Nov 16, 2022

braingram commented Nov 16, 2022

jdavies-st commented Nov 17, 2022 •

edited

Loading

braingram commented Nov 17, 2022

perrygreenfield commented Nov 17, 2022

perrygreenfield commented Nov 17, 2022

perrygreenfield commented Nov 17, 2022

Unnecessary duplicate arrays when using asdf embedded in FITS files #1232

Unnecessary duplicate arrays when using asdf embedded in FITS files #1232

Comments

perrygreenfield commented Nov 15, 2022

braingram commented Nov 15, 2022

perrygreenfield commented Nov 15, 2022

braingram commented Nov 15, 2022 • edited Loading

braingram commented Nov 16, 2022

braingram commented Nov 16, 2022

jdavies-st commented Nov 17, 2022 • edited Loading

braingram commented Nov 17, 2022

perrygreenfield commented Nov 17, 2022

perrygreenfield commented Nov 17, 2022

perrygreenfield commented Nov 17, 2022

braingram commented Nov 15, 2022 •

edited

Loading

jdavies-st commented Nov 17, 2022 •

edited

Loading