Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate check for file processing errors from staging through rasterization #11

Open
julietcohen opened this issue Dec 5, 2022 · 0 comments

Comments

@julietcohen
Copy link
Collaborator

Throughout the workflow, it would be helpful to integrate logged checks for files that error from input through staging, from staging through merging (if using the Ray workflow), and from merging through the rasterization step. Kastan noted that using the ray workflow resulted in approximately 2000 out of approximately 8 million files failing to process correctly.

To start, we should implement the minimum viable product (MVP) as a comparison of the filepaths at the conclusion of the staging step with the filepaths at the conclusion of the rasterization step. The initial filepaths list can pulled from the staged directory, or from the filepaths in staging_summary.csv.

Ideally, we will eventually implement more rigorous checks for all polygon vectors that are present in input files that error during staging and are therefore not fed into the merging or rasterization steps. However, this is more complex than the MVP, considering the following:

  • deduplication of polygons can occur during staging, rasterization, web tiling, etc. based on the user's config
  • polygons that cross tile boundaries are documented in all tiles in which a portion of the polygon is present
  • the input files may contain no polygons whatsoever
  • during staging, the data are in the form of polygons, but after rasterization, they are in the form of grid cells, which makes it impossible to track the files via file sizes
  • the CRS may need to be converted during staging, which changes polygon attributes such as area

Robyn noted that an expansion beyond the MVP might be best executed by using the footprints to determine if we expect a polygon within each file's bounds, and comparing the footprints to their respective processed files using an overlay method from geopandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Status: No status
Development

No branches or pull requests

1 participant