Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rasterization fails for relatively few staged files inconsistently #7

Open
julietcohen opened this issue Dec 21, 2022 · 3 comments
Open

Comments

@julietcohen
Copy link
Collaborator

Rasterization step logs errors for relatively few staged files compared to the total amount of staged files. Occurs for both the maximum z-level as well as parent z-levels according to log.log. For the lake size change data sample, the GeoPackage files that errored during a certain parsl run are not obviously corrupt, and when rasterization is applied again (outside of the workflow but still with parallelization) the files are successfully rasterized. See this documented here. This was also seen in the ray workflow with IWP dataset, for which Robyn noted:

IWP run update: Of 10,805,019 staged tiles, I managed to rasterize and transfer (to scratch) 10,741,400, which means 63,619 tiles (~0.5%) got lost along the way. This could be that some files didn’t transfer to scratch before the 24 hour job limit ran out, or it could be some other problem. I did see some warnings that implied that some geopackage files were corrupt. Since it’s overall a small percent that are missing, I am going to continue with the next steps so that we can visualize what we already have. We can always go back and compare the list of geotiff tiles to staged tiles to see which ones we are missing and try to rasterize just those ones.

It would be helpful to do more runs with various datasets to determine if the failing files are ever consistent, and if the errors are random (and can therefore be solved by just trying to rasterize these few staged files again with the same approach), or if the files are actually corrupt, or if it has to do with the files not transferring to scratch dir like Robyn suggested.

@julietcohen
Copy link
Collaborator Author

julietcohen commented Dec 21, 2022

Made branch feature-raster-retry (from develop branch) to test modification to RasterTiler.py to integrate 3 tries total for staged files that fail to rasterize. The retries simply apply the same rasterization approach, and includes log statements when retries are applied. The syntax & execution for this is show in the following example:

# define function to exemplify retry approach
def test_func(x, y):
    for attempt in range(2):
        try:
            z = x / y
            print(f'Success: {x} / {y} = {z}')
            return z
        except Exception as e:
            print(f'Error dividing {x} by {y} so trying again.')
        else:
            break
    else:
         print(f'Error dividing {x} by {y}, ran out of retries.')
         return None

Apply function with expected input types that won't error:

test_func(x = 9, y = 3)

output:

Success: 9 / 3 = 3.0
3.0

Apply function with unexpected input types that will error:

test_func(x = 'elephant', y = 'cat')

Output:

Error dividing elephant by cat so trying again.
Error dividing elephant by cat so trying again.
Error dividing elephant by cat, ran out of retries.

This retry feature is applied to the rasterize_vector() function.

rasterize_vector() with retry
    def rasterize_vector(self, path, overwrite=True):
        """
            Given a path to an output file from the viz-staging step, create a
            GeoTIFF and save it to the configured dir_geotiff directory. By
            default, if the output geotiff already exists, it will be
            overwritten. To change this behaviour, set overwrite to False.

            During this process, the min and max values (and other summary
            stats) of the data arrays that comprise the GeoTIFFs for each band
            will be tracked.

            Parameters
            ----------

            path : str
                Path to the staged vector file to rasterize.

            overwrite : bool
                Optional, defaults to True. If set to False, then if there is
                an existing GeoTiff tile at the output path created,
                rasterization will be skipped.

            Returns
            -------

            morecantile.Tile or None
                The tile that was rasterized or None if there was an error.
        """
        for attempt in range(2):
            try:

                # Get information about the tile from the path
                tile = self.tiles.tile_from_path(path)
                out_path = self.tiles.path_from_tile(tile, 'geotiff')

                if os.path.isfile(out_path) and not overwrite:
                    logger.info(f'Skip rasterizing {path} for tile {tile}.'
                                ' Tile already exists.')
                    return None

                bounds = self.tiles.get_bounding_box(tile)

                # Track and log the event
                id = self.__start_tracking('geotiffs_from_vectors')
                logger.info(f'Rasterizing {path} for tile {tile} to {out_path}.')

                gdf = gpd.read_file(path)

                # Check if deduplication should be performed first
                dedup_here = self.config.deduplicate_at('raster')
                dedup_method = self.config.get_deduplication_method()
                if dedup_here and dedup_method is not None:
                    prop_duplicated = self.config.polygon_prop('duplicated')
                    if prop_duplicated in gdf.columns:
                        gdf = gdf[~gdf[prop_duplicated]]

                # Get properties to pass to the rasterizer
                raster_opts = self.config.get_raster_config()

                # Rasterize
                raster = Raster.from_vector(
                    vector=gdf, bounds=bounds, **raster_opts)
                raster.write(out_path)

                # Track and log the end of the event
                message = f'Rasterization for tile {tile} complete.'
                self.__end_tracking(id, raster=raster, tile=tile, message=message)
                logger.info(
                    f'Complete rasterization of tile {tile} to {out_path}.')

                return tile

            except Exception as e:
                logger.info(f'Error rasterizing {path} for tile {tile} so trying again.')

            else:
                break
                
        else:
            message = f'Error rasterizing {path} for tile {tile}, ran out of retries.'
            self.__end_tracking(id, tile=tile, message=message)
            # note that the error = e argument is removed from __end_tracking(),
            # because e is no longer locally defined due to the break just before
            # find way to maintain e in the error message for a better workflow
            return None

We apply this with the virtual env rasterRetry with local develop branch of viz-staging, local feature-raster-retry branch of viz-workflow, sqlite3, and parsl.

see installed packages
Package            Version    Editable project location
------------------ ---------- -------------------------
affine             2.3.1
asttokens          2.0.5
attrs              22.2.0
backcall           0.2.0
bcrypt             4.0.1
certifi            2022.12.7
cffi               1.15.1
charset-normalizer 2.1.1
click              8.1.3
click-plugins      1.1.1
cligj              0.7.2
coloraide          0.18.1
colormaps          0.3
contourpy          1.0.6
cryptography       38.0.4
cycler             0.11.0
debugpy            1.5.1
decorator          5.1.1
dill               0.3.6
entrypoints        0.4
executing          0.8.3
filelock           3.8.2
Fiona              1.8.22
fonttools          4.38.0
geopandas          0.12.2
globus-sdk         3.15.1
idna               3.4
ipykernel          6.15.2
ipython            8.7.0
jedi               0.18.1
jupyter_client     7.4.7
jupyter_core       4.11.2
kiwisolver         1.4.4
matplotlib         3.6.2
matplotlib-inline  0.1.6
morecantile        3.2.5
munch              2.5.0
nest-asyncio       1.5.5
numpy              1.24.0
packaging          22.0
pandas             1.5.2
paramiko           2.12.0
parsl              2022.12.19
parso              0.8.3
pdgraster          0.1.0      /home/jcohen/viz-raster
pdgstaging         0.1.0      /home/jcohen/viz-staging
pexpect            4.8.0
pickleshare        0.7.5
Pillow             9.3.0
pip                22.3.1
prompt-toolkit     3.0.20
psutil             5.9.4
ptyprocess         0.7.0
pure-eval          0.2.2
pycparser          2.21
pydantic           1.10.2
Pygments           2.11.2
PyJWT              2.6.0
PyNaCl             1.5.0
pyparsing          3.0.9
pyproj             3.4.1
python-dateutil    2.8.2
pytz               2022.7
pyzmq              24.0.1
rasterio           1.3.4
requests           2.28.1
Rtree              0.9.7
setproctitle       1.3.2
setuptools         65.5.0
shapely            2.0.0
six                1.16.0
snuggs             1.4.7
stack-data         0.2.0
tblib              1.7.0
tornado            6.2
traitlets          5.7.1
typeguard          2.13.3
typing_extensions  4.4.0
urllib3            1.26.13
wcwidth            0.2.5
wheel              0.37.1
see config
{
    "dir_input": "/home/jcohen/viz-workflow/raster-retry/data_subsample",
    "dir_geotiff": "raster-retry/OUTPUT_GEOTIFFS",
    "dir_web_tiles": "raster-retry/OUTPUT_WEBTILE",
    "dir_staged": "raster-retry/OUTPUT_STAGING_TILES",
    "filename_staging_summary": "raster-retry/staging_summary.csv",
    "filename_rasterization_events": "raster-retry/raster_events.csv",
    "filename_rasters_summary": "raster-retry/raster_summary.csv",
    "version": "DATE",
    "ext_input": ".gpkg",
    "simplify_tolerance": 0.0001,
    "tms_id": "WGS1984Quad",
    "z_range": [0, 11],
    "statistics": [
        {
            "name": "polygon_count",
            "weight_by": "count",
            "property": "centroids_per_pixel",
            "aggregation_method": "sum",
            "resampling_method": "sum",
            "val_range": [0, null],
            "nodata_val": 0,
            "nodata_color": "#ffffff00",
            "palette": ["#d9c43f", "#d93fce"]
        },
        {
            "name": "coverage",
            "weight_by": "area",
            "property": "area_per_pixel_area",
            "aggregation_method": "sum",
            "resampling_method": "average",
            "val_range": [0, 1],
            "nodata_val": 0,
            "nodata_color": "#ffffff00",
            "palette": ["#d9c43f", "#d93fce"]
        }
    ],
    "deduplicate_at": ["staging"],
    "deduplicate_method": "neighbor",
    "deduplicate_keep_rules": [["staging_filename", "larger"]],
    "deduplicate_overlap_tolerance": 0.1,
    "deduplicate_overlap_both": false,
    "deduplicate_centroid_tolerance": null
  }

Data sample is 3000 randomly sampled polygons from lake size change dataset (1000 from each UTM zone). In order to ensure the retry code is run, we purposefully corrupt one of the GeoPackage files by using sqlite3 to delete one of the mandatory tables after the file is staged. See this in the script here.

@julietcohen
Copy link
Collaborator Author

julietcohen commented Dec 22, 2022

Executed a run with the same environment and config as above, but inserted variable e into log statement that is printed for the initial tries. e cannot be printed in the final attempt's error message printed by __end_tracking() because of the break statement just before.

rasterize_vector() with retry
    def rasterize_vector(self, path, overwrite=True):
        """
            Given a path to an output file from the viz-staging step, create a
            GeoTIFF and save it to the configured dir_geotiff directory. By
            default, if the output geotiff already exists, it will be
            overwritten. To change this behaviour, set overwrite to False.

            During this process, the min and max values (and other summary
            stats) of the data arrays that comprise the GeoTIFFs for each band
            will be tracked.

            Parameters
            ----------

            path : str
                Path to the staged vector file to rasterize.

            overwrite : bool
                Optional, defaults to True. If set to False, then if there is
                an existing GeoTiff tile at the output path created,
                rasterization will be skipped.

            Returns
            -------

            morecantile.Tile or None
                The tile that was rasterized or None if there was an error.
        """
        for attempt in range(2):
            try:

                # Get information about the tile from the path
                tile = self.tiles.tile_from_path(path)
                out_path = self.tiles.path_from_tile(tile, 'geotiff')

                if os.path.isfile(out_path) and not overwrite:
                    logger.info(f'Skip rasterizing {path} for tile {tile}.'
                                ' Tile already exists.')
                    return None

                bounds = self.tiles.get_bounding_box(tile)

                # Track and log the event
                id = self.__start_tracking('geotiffs_from_vectors')
                logger.info(f'Rasterizing {path} for tile {tile} to {out_path}.')

                gdf = gpd.read_file(path)

                # Check if deduplication should be performed first
                dedup_here = self.config.deduplicate_at('raster')
                dedup_method = self.config.get_deduplication_method()
                if dedup_here and dedup_method is not None:
                    prop_duplicated = self.config.polygon_prop('duplicated')
                    if prop_duplicated in gdf.columns:
                        gdf = gdf[~gdf[prop_duplicated]]

                # Get properties to pass to the rasterizer
                raster_opts = self.config.get_raster_config()

                # Rasterize
                raster = Raster.from_vector(
                    vector=gdf, bounds=bounds, **raster_opts)
                raster.write(out_path)

                # Track and log the end of the event
                message = f'Rasterization for tile {tile} complete.'
                self.__end_tracking(id, raster=raster, tile=tile, message=message)
                logger.info(
                    f'Complete rasterization of tile {tile} to {out_path}.')

                return tile

            except Exception as e:
                logger.info(f'Error rasterizing {path} for tile {tile} due to error {e} so trying again.')

            else:
                break
                
        else:
            message = f'Error rasterizing {path} for tile {tile}, ran out of retries.'
            self.__end_tracking(id, tile=tile, message=message)
            return None

I'm curious if there is a benefit to printing the explicit e in the __end_tracking() rather than a general log statement. If there is a benefit to that, this change to rasterize_vector() should be adjusted further to enable printing e after the final try. If printing e in the initial tries' log statements is equally as helpful to printing it in the final __end_tracking() message, then I believe this code is good to go!

For this run, which staged 3000 polygons, then we corrupted 1 before rasterization started:

files in staged dir files in geotiff dir files in web tiles dir
2964 6496 12992

No errors were reported in log.log besides the two initial tries for the purposefully corrupted file, and the error for the final try:

image

For the run with original develop branch RasterTiler.py script without retries, with the same starting 3000 polygons, and none are corrupted:

files in staged dir files in geotiff dir files in web tiles dir
2964 6498 12996

No errors reported in that log.log at all. This means that corrupting only that one staged file resulted in 2 fewer rasters and 4 fewer web tiles.

@julietcohen
Copy link
Collaborator Author

julietcohen commented Dec 22, 2022

Just tested this with the same everything except reduced the amount of retries to 1. I figure this is better because with large datasets such as IWP, even with only ~0.5% of the files initially failing, reducing the amount of retries from 2 to 1 will save time and Delta credits. We don't want to waste resources retrying the same file an unnecessary amount of times.

I would ideally like to test this workflow on a larger amount of data now that the syntax and number of tries has shown to be successful. I could use all the lake change files Ingmar uploaded to the PDG google drive. I am not sure if Robyn already processed them, but I believe she has not. It would be interesting to run all those files on the current develop branch (without the retry) and see how many files fail to rasterize, then process the same files with this branch and see how many more were rasterized. That would allow for a deeper check for this update without using any credits on Delta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Status: No status
Development

No branches or pull requests

1 participant