Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADC documentation options for PDG datasets: IWP & water/glacier mask #6

Open
julietcohen opened this issue Oct 31, 2022 · 50 comments
Open
Assignees
Labels
help wanted Extra attention is needed

Comments

@julietcohen
Copy link
Collaborator

@ChandiWitharana and Elias, I’d like your opinions regarding how to archive the PDG data and metadata for the IWP and water/glacier clipped datasets. Elias processed the .shp files last week, and Kastan is running the workflow to stage, rasterize, and create the web tiles. We’re processing both these datasets in advance of NNA (prioritizing the IWP dataset), and we’ll have the .shp, .gpkg, and .tif data files archived on the Arctic Data Center.

Matt suggested 2 ways to archive the data:
data_pkg_options

Which option is best?

As I document each file type, I'll be checking in with you about metadata and authorship questions.

@julietcohen julietcohen added the help wanted Extra attention is needed label Oct 31, 2022
@julietcohen julietcohen self-assigned this Oct 31, 2022
@julietcohen
Copy link
Collaborator Author

julietcohen commented Nov 1, 2022

Data Processing Methods:

  • The water and glacier directory (described as the water and glacier mask in the schematic above) includes shapefiles for: 1) coastal water (to remove polygons that were incorrectly identified as IWP, when in reality it was sea ice) as well as 2) glaciers (to remove polygons that were incorrectly identified as IWP, when in reality it was a glacier). It does not include surface water such as lakes.
  • glacier shapefile
  • coastal water shapefile
  • The reason these files do not include surface water is because MAPLE's processing steps already masked out lakes before the coastal water and glacier mask was applied
  • Therefore the fully processed files (found on Delta: /scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice) have been cleaned for all 3 types of water (coastal, glaciers, and surface water)

To Do:

  • priority before NNA: archive the IWP data package on ADC
  • within the water and glacier mask data package on ADC, archive the inland water mask applied by MAPLE in addition to the coastal water and glacier mask used by Elias & team (retrieve this mask from the MAPLE workflow?)

@julietcohen
Copy link
Collaborator Author

julietcohen commented Nov 3, 2022

Questions for PDG team

Please provide either answers or links to resources where I can find the following information:

IWP data

1. overall temporal coverage of all processed IWP shapefiles (just year(s) will suffice, months & days are optional)

  • I can derive this from the shapefilefile names, since the year is included
  • 1. is the temporal coverage 2000 -2021 correct? optional to add months/days too
  • 2. overall spatial coverage of all processed IWP files
  • Northwest and Southeast coordinates for bounding box
  • Can enter 2 bounding boxes for region since it spans dateline
  • 3. Will we include the term "high ice" in the metadata documentation? Or is this just a term used within the PDG team?
  • 4. All authorship information:
  • names
  • emails
  • ORCID's
  • specifically what data they contributed to (if not all the data types)
  • Chandi is listed as the dataset contact?
  • 5. Confirm units are correct for all data attributes
  • see this draft metadata documentation on ADC test site
  • units of CentroidX and CentroidY attributes (currently input as meters)
  • more detailed descriptions for the Class and Sensor attributes
  • more detailed descriptions for the Length and Width attributes (currently input as major and minor axes of IWP polygons)
  • 6. Review data package title, abstract, methods, and "ethical research practices" drafts
  • 7. explanation for file naming template for shapefiles, something like sensor_YYYYMMDD... and so on.
    Example files:
GE01_20190805214726_1050010017BFD700_19AUG05214726-M1BS-503537880070_01_P001_u16rf3413_pansh.shp

and the slightly longer filename, with an inserted _R4C1:

WV02_20120813231405_103001001B8A8E00_12AUG13231405-M1BS_R4C1-052719605010_02_P002_u16rf3413_pansh
  • 8. Explanation for numerical directory naming in the intermediate directories of the IWP data, such as: alaska/146_157_iwp/...
  • 9. NSF funding information (award number such as "NSF Award 2240912") and any non-NSF funding information
  • 10. preferred citation for MAPLE paper (1 is included in the Methods step 2)
  • 11. software heritage citation to the MAPLE code (on GitHub?) that pre-processed this data
  • 12. Methods details for pre-processing cleaning scripts (scripts and data dir to document can be found on Delta: /scratch/julietcohen/IWP/supplementary_files, scripts also uploaded to ADC test package)
    • clean_predictions.py
    • split_footprints.py
    • cleaning_data dir
    • add_date_attribute_footprints.py
  • 13. Point to water masks data in Sampling section

Separate data package for coastal water, inland water, & glacier mask

  • overall temporal and spatial coverage (just year(s) will suffice, months & days are optional)
  • I found many water masks Elias mentioned he uploaded to Delta along with the shapefiles, in the folders with _water suffix, such as this one: /scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/alaska/190_191_192_water/WV03_20200828223924_104001005E631B00_20AUG28223924-M1BS-504712637050_01_P001_u16rf3413_pansh/WV03_20200828223924_104001005E631B00_20AUG28223924-M1BS-504712637050_01_P001_u16rf3413_pansh_watermask.tif
  • some _watermask files are labeled as 2019 and some are 2020, so the temporal coverage for all water masks would be 2019-2020? Update: New water masks provided by Elias because new IWP data (output of pre-processing) was produced in February, with file hierarchy that needs to be retained when archived on ADC. Water masks are stored in Delta /scratch/bbou/julietcohen/water_masks
  • overall temporal and spatial coverage of MAPLE inland water mask (if available?), this may be the same as the temporal/spatial coverage for coastal water and glacier mask and IWP data Update: the masks for all water features have now been combined, so no need to document separate extents for different water features
  • Can we retrieve the surface water mask used in the MAPLE workflow?

@ChandiWitharana
Copy link

Either options would be fine by me. I suggest you to implement most effective one. As per the description, it would be Option 2.

@ChandiWitharana
Copy link

Generally temporal coverage of images could fall in between 2001 - 2021. But majority are post 2008 or 2010.

Spatial coverage of ALL processed files generally fall within Arctic Tundra region and confined to low-, medium, and high-ice areas within Tundra. These terms (high, low, medium ) are Brown et al. 1998

@julietcohen
Copy link
Collaborator Author

@ChandiWitharana Thank you for the feedback.

The Abstract and Methods were drafted for option 1, rather than option 2. We can split the package to go with option 2 if @mbjones thinks option 2 makes more sense as well. That would mean submitting 2 more tickets for a total of 3 repositories published by Monday.

@julietcohen
Copy link
Collaborator Author

I searched for the Brown et al. 1998 paper you mentioned, and found this: https://nsidc.org/sites/default/files/heginbottometal_1993.pdf

Please let me know if you are referring to a different publication, we can include a formal citation for it in the metadata.

@ChandiWitharana
Copy link

Better we use this link (https://nsidc.org/data/ggd318/versions/2)
Brown, J., O. Ferrians, J. A. Heginbottom, and E. Melnikov. (2002). Circum-Arctic Map of Permafrost and Ground-Ice Conditions, Version 2 [Data Set]. Boulder, Colorado USA. National Snow and Ice Data Center. https://doi.org/10.7265/skbg-kf16. Date Accessed 12-08-2022.

@julietcohen
Copy link
Collaborator Author

julietcohen commented Dec 9, 2022

From Anna:

For those attending the AGU both and a comment on the ImageryViewer and the IWP in general: Two permafrost science users have sent us some feedback on the IWP dataset in the public version of the ImageryViewer and they find it hard to assess the quality for the IWP dataset due to the poor resolution of the Bing imagery (and the simple fact that the background imagery in the ImageryViewer does not reflect the original Maxar imagery used in developing the IWP dataset). The Maxar license restrictions prevents us from showing the Maxar imagery to the public (can only be accesses/viewed by NSF Arctic funded researchers). So, two take-aways: 1) The meta-data of the IWP dataset need to contain statistics on the efficiency/accuracy of the algorithm. 2) We need to make it clear somewhere in the ImageryViewer that the background Bing imagery is not what was used to create the datasets shown in the ImageryViewer due to license restrictions of satellite imagery.

  • add statistics to metadata package about accuracy of IWP algorithm

@robyngit
Copy link
Member

⭐️ Note ⭐️ @julietcohen: @dvirlar2 pre-issued a DOI for the IWP dataset once it is published: doi:10.18739/A2KW57K57

@dvirlar2
Copy link

@robyngit @julietcohen I'm happy to publish the dataset with the above DOI once things are ready to go, just let me know! Our code for has changed a little since Juliet was on the curation team, and I wouldn't want y'all to use old code 🙂

@mbjones
Copy link
Member

mbjones commented Jun 14, 2023

@robyngit Thanks for the DOI. For now, I think we could manually configure the DOI to point at a manually-created landing page for the dataset. Once it is published in the ADC, the DOI would then be updated to point at the ADC landing page. Does that sounds reasonable?

@mbjones
Copy link
Member

mbjones commented Jun 14, 2023

Overview of package entity relationships, with the processing steps we associate with each:

flowchart LR
    A[A. Maxar]-->|MAPLE| B(B. IWP Shapefiles)
    B --> |Staging| C(C. IWP Geopackages)
    C --> |Rasterization| D(D. IWP Geotiffs)
    D --> |Web tiling| E[E. IWP PNGs]
    C --> |3dTiling| F(F. IWP 3DTiles)
    Note[Square boxes\n likely not\n to be archived]
Loading

@robyngit
Copy link
Member

robyngit commented Jun 15, 2023

For the initial release of the IWP layer we are aiming for mid-July or later to correspond with other announcements. Since this might not leave sufficient time to get all of the metadata in order, we discussed initially publishing a minimal version of the data package so at least we can have a DOI in place, that points to relevant information, in case anyone needs to cite or reference the data.

We envisioned that this MVP data package would comprise just 1) citation info, and 2) abstract, and 3) link to file tree for downloads. However, there are more fields that are required in order to publish a package on the ADC, thus I think the package should contain all the fields that are marked as mandatory in the editor, but exclude all of the entity information for now.

I created a test version of this minimal package that is mostly a copy of what @julietcohen already created but without any files (so no python scripts and no data object descriptions). 📑 The MVP test version is available here.

There are some outstanding issues with the metadata we have:

  • People
    • Need the full list of authors (Dataset Creators, PI, Co-PI, Metadata Creators, Custodians, etc.)
  • Location
    • Need feedback on what to include for location, I entered:
      • Description: Pan-arctic
      • Northwest coordinates: 90, -180
      • Southeast coordinates: 66.5, 180
  • Methods
    • Step 2: Need input from Elias ("Script for this post-processing step: {Elias - point to script here}")
    • Step 3, 4, 5: No official releases for viz-* packages
    • Study Extent: Define "high", "medium", and "low" ice regions
  • General Review
    • The metadata in its entirety should be reviewed by the IWP team before release

@ChandiWitharana
Copy link

High, Medium, Low ice regions are categorization adapted from Brown et al. 2002 for image selection and processing purposes. (Brown, J., O. Ferrians, J. A. Heginbottom, and E. Melnikov. (2002). Circum-Arctic Map of Permafrost and Ground-Ice Conditions, Version 2 [Data Set]. Boulder, Colorado USA. National Snow and Ice Data Center. https://doi.org/10.7265/skbg-kf16. Date Accessed 12-08-2022.)

@ChandiWitharana
Copy link

From UConn side, the team would be:
Chandi Witharana (PI), Mahendra R. Udawalpola (Postdoc), Amal S. Perera (Postdoc), Amit Hasan (Graduate student), Elias Manos (Graduate student)

@ChandiWitharana
Copy link

Location: Whats given is fine.

@ChandiWitharana
Copy link

Metadata should be fine

@robyngit
Copy link
Member

Thanks @ChandiWitharana! What can we include to define what is meant by "high", "medium", and "low" ice regions?

@julietcohen
Copy link
Collaborator Author

In @robyngit 's to-do list above, methods steps 3-5 require a release for the PDG packages. I will take care of this so we can include it in the metadata.

@dvirlar2
Copy link

dvirlar2 commented Jun 23, 2023

Preliminary thoughts after viewing the latest version of the package:

Title, Abstract, and Keywords:

  • The title should include some mention that this is the "High Ice" portion of the IWP overall dataset.
    • Edit Jun 26: I totally missed that "High Ice" was already in the title, oops!
  • I'd include an explanation of what "High Ice" means in the context of this dataset. I'd also provide the context of where that definition/delineation comes from (per Chandi's comment above).
    • In general I think it'd be a good idea to add a note that there are also medium and low ice versions of the data (and explain those meanings), that we're also planning on making public. Later when we do make those public, we can add links to direct links to them in the abstract, and assign the High Ice dataset can get a new doi at that point. Does that sound like reasonable to you @mbjones?
  • For keywords, I'd consider adding: CNN, Convolutional Neural Network, ice wedge polygons, and any others that might be relevant.

I have ideas for how to flesh out the methods section, but I'll get to that later when I have more time.

@julietcohen
Copy link
Collaborator Author

@dvirlar2 Thank you for the feedback 👍🏼

Regarding your first suggestion: The title does already include "high ice Arctic regions". Would you suggest wording it in a different way or is that sufficient?

Regarding your second suggestion: The Sampling section does already include a short description of what "high ice" means and mentions that there are also medium and low ice regions. "The geographic area sampled is the "high ice" regions of the Arctic, which are those the dataset authors identified to contain a relatively high proportion of ice. The study extent encompasses all high ice regions masked for coastal oceans, glaciers, and surface water. Further additions to this dataset will include "medium ice" and "low ice" regions of the Arctic as well. These regions were classified by less ice content." Perhaps this is not sufficient, or we could put it in a different section so it's more obvious?

@julietcohen
Copy link
Collaborator Author

I can move the high ice / medium ice / low ice descriptions from the Sampling section to the Abstract, since it seems that is what you are suggesting since that part is all you had time to review so far.

@mbjones
Copy link
Member

mbjones commented Jun 23, 2023

The high/med/low ice distinctions are not that critical - they essentially signal solely the order in which different spatial regions were processed. Its helpful for people to know that the spatial extent of the dataset will grow over time, but the results in each region are the same, and the divisions between the regions are pretty arbitrary.

@dvirlar2
Copy link

dvirlar2 commented Jun 26, 2023

Given Chandi's earlier comment and link to the NSIDC dataset, I found this information about the designations between high/med/low ice.

From the user guide:

"High Ice" is characterized by:

  • greater than 20% for lowlands, highlands, and intra- and intermontane depressions characterized by thick overburden cover (>5-10m), and
  • greater than 10% for mountains, highlands ridges, and plateaus characterized by thin overburden cover (>5-10m) and exposed bedrock

Medium ice is characterized by 10-20%, and low ice is 0-10%, with no internal breakdown in terrain like the high ice.

From the Hegginbottom paper under the "ATDBs" section:

The relative abundance of ground ice in each map unit is presented in the form of qualitative estimates of the percentage of ice in the upper 10 to 20m of the ground. These estimates include the volume of segregation ice, injection ice and reticulate ice. Three classes are used for ground ice content (high, >20%; medium, 10-20%; and low, <10%) in areas in physiographic class 1, that is for areas of generally thick overburden. For areas of generally thin overburden (physiographic class 2) only two classes of ground ice are mapped, medium to high (>10%) and low (<10%), due in part to paucity of data.

Given the above descriptions, I think it's reasonable to include some combination of the above content that would explain the difference between the High, Medium, and Low ice datasets to users. I think these specific descriptions should go in the Sampling Description section of the dataset like Juliet mentioned above, but there should also be a sentence in the abstract mentioning that an explanation is provided further on in the dataset.

In the Sampling Description section, we should also link to the NSIDC dataset and provide brief direction for users to view the User Guide and Hegginbottom paper for more in-depth information.

@dvirlar2
Copy link

Given my own confusion reading the dataset title earlier in this thread, and my experience of having a harder time distinguishing between datasets with very similar titles, I would recommend putting changing the title to something along the lines of

"High Ice: Ice wedge polygon detection in satellite imagery from Arctic regions, Permafrost Discovery Gateway, 2001-2021"

That way, the High, Medium, and Low distinctions are more immediately clear to users. Food for thought

@dvirlar2
Copy link

dvirlar2 commented Jun 26, 2023

List of Orcid IDs:


Need to verify:


Still need to include, if desired by person:

  • Amit Hasan
  • Mahendra R. Udawalpola

@mbjones
Copy link
Member

mbjones commented Jun 26, 2023

@dvirlar2 The plan is to update the dataset with new version releases to include all of the high, med, and low ice regions. And we plan to do that soon. So, I think the title should not include that distinction. A proposed title:

"Ice wedge polygon detection in satellite imagery from Pan-Arctic regions, Permafrost Discovery Gateway, 2001-2021"

@julietcohen
Copy link
Collaborator Author

The ORCiD Daphne suggested for Amal is correct

@amalshehan
Copy link

amalshehan commented Jun 27, 2023

(1) explanation for file naming template for shapefiles,:
[Sensor]_ [Acquisition time stamp]_ [Catalog ID]_ [Original timestamp]_ [Image type(P: panchromatic, M: multispectra][DG product type (1b: standard, 2A: georectified)][Original Image ID].shp

(2) What is the NSF award number? Such as "NSF Award 2240912" and any non-NSF funding info
NSF Award No: 1720875, 1722572, 1927872, 1927723, 1927729

(3) What are the ORCiD's for Mahendra R. Udawalpola and Amit Hasan?
Mahendra R. Udawalpola : 0000-0002-3521-1508
Amit Hasan : 0000-0001-8774-0228

@dvirlar2
Copy link

@mbjones thank you for clarifying that! I got confused between this ticket and our meeting last week on how the datasets were going to be broken up. I agree with the title you proposed 👍🏽

@dvirlar2
Copy link

dvirlar2 commented Jul 5, 2023

I've added @julietcohen 's test version onto the production site. ADC people can view it here. I haven't checked yet who has access to the test version, but I can do that at a later date.

From a curation standpoint, I've:

  • changed the title per Matt's suggestion above to "Ice wedge polygon detection in satellite imagery from Pan-Arctic regions, Permafrost Discovery Gateway, 2001-2021"
  • edited the abstract to include the full link to where the data will live
  • Added the publisher information
  • Added the NSF awards
  • Added a discipline annotation to the dataset. I added cryology, but I'm contemplating adding data science and soil science. @mbjones , do you have any strong preferences on that?

To-Do:

  • @julietcohen: I didn't add the actual data object for the add_date_attribute_footprints.py file. Since the other objects are going to be dummy entities, I decided to only keep the metadata for all the files, the script included. If it should be added and immediately downloadable from the dataset landing page, and not where the rest of the data will live, let me know and I'll add it back!
  • Author List: From my outside perspective, I think the author list should include @mbjones, @robyngit, and @julietcohen for their roles in leadership and data processing. I'm not sure what role other people on the whole PDG group have played, but it would be good to have a discussion on authorship and to get other people's names and information listed before the deadline.

@dvirlar2
Copy link

dvirlar2 commented Jul 5, 2023

The following is the placeholder filename for the dummy shapefile in the package:
example_GE01_20110826213903_10504100013F3800_11AUG26213903_M1BS_054019163020_01_P001_u16rf3413_pansh.shp


After reading the structure provided by @amalshehan, I have a few questions:

  • Are there multiple sensors that were used? If so, we should document the acronyms and their meanings.
  • What's the difference between the acquisition and original timestamp? In the example above, the formats are different but the times are the same.
  • What's the difference between standard and georectified images? Would the average, "entry level" person / early career researcher interested in permafrost (but not satellites) know the differences?
  • What is the format for the original image ID? Do we know this?

@julietcohen
Copy link
Collaborator Author

julietcohen commented Jul 5, 2023

Regarding Daphne's to-do items above:

  • The script add_date_attribute_footprints.py was written by me as a post-processing step, and this script should be archived either:
  1. Wherever we archive the other post-processing materials from Chandi's team that are not already pointed to in the methods: the cleaning_data and water_mask directories.
    or:
  2. Uploaded to the PermafrostDiscoveryGateway/MAPLE_v3 repo
  • For the cleaning_data and water_mask directories: We already agreed to make a separate data package for the water masks, since it will likely be a helpful dataset for other purposes outside of the ice wedge polygon dataset post-processing. We haven't decided where to store cleaning_data. This can go in the same repository as the shapefiles, geopackages, and rasters, or it's own data package on the ADC.

  • We also need to upload the footprints directory, which has the same number of files as the shapefiles directory, and has a hierarchy that is important to maintain.

I would also add Kastan Day to the list of dataset contributors for the geopackages and rasters, and Anna Liljedahl.

@mbjones
Copy link
Member

mbjones commented Jul 6, 2023

Regarding the discipline choices, let's ask @amalshehan and @ChandiWitharana review the proposal with a pointer to the ADCAD vocabulary for choices.

@julietcohen
Copy link
Collaborator Author

julietcohen commented Jul 6, 2023

To Do for the IWP metadata package:

  • Give Anna Liljedahl access
  • Suggested by Anna: add the following people to "people and associated parties", and give them editing access if they don't already have it (we might have to find their ORCiD's):
  • add these 2 awards, sent by Anna: 1927720, 2052107

@dvirlar2
Copy link

dvirlar2 commented Jul 6, 2023

Access to dataset added via ORCIDs:

I also added Chandi to the list of editors. His orcid is already in the dataset. I'll follow up with Howard and Ronald.

Also, I'd like to re-emphasize that if any of the above people should be listed in the dataset citation, they should be listed under the Data Set Creator section, and not to "people and associated parties" 🙂 If not, then no worries

@julietcohen
Copy link
Collaborator Author

Thanks Daphne! I emailed Kastan to confirm that's his ORCiD.

You are correct about distinguishing between the Data Set Creator section and the "people and associated parties", sorry to cause confusion there.

@dvirlar2
Copy link

dvirlar2 commented Jul 7, 2023

ORCID Updates:

  • Howard confirmed the orcid listed above does belong to him. I can add access to the dataset later today

@julietcohen
Copy link
Collaborator Author

Kastan also confirmed that ORCiD listed above is his

@amalshehan
Copy link

amalshehan commented Jul 10, 2023

The following is the placeholder filename for the dummy shapefile in the package: example_GE01_20110826213903_10504100013F3800_11AUG26213903_M1BS_054019163020_01_P001_u16rf3413_pansh.shp

After reading the structure provided by @amalshehan, I have a few questions:

  • Are there multiple sensors that were used? If so, we should document the acronyms and their meanings.

@dvirlar2 Based on my discussions with @ChandiWitharana we would like to point to PGC data docs for extra details on file naming as the original data was acquired from PGC and the names were maintained as is. If you think that we should document (archive) this I can respond to the specific details you request above.

The PGC data doc I am referring to are
PGC Commercial Satellite Imagery Documentation

PDF: PGC Commercial Satellite Imagery Documentation (umn.edu)

  • What's the difference between the acquisition and original timestamp? In the example above, the formats are different but the times are the same.

There is no difference. Original time stamp is given by the vendor and the acquisition time stamp is added by PGC.

  • What's the difference between standard and georectified images? Would the average, "entry level" person / early career researcher interested in permafrost (but not satellites) know the differences?

Georectified images are corrected for any geometric distortions that may be present in the original/standard image due to the approach used to acquire the image.

  • What is the format for the original image ID? Do we know this?

No

@amalshehan
Copy link

Regarding the discipline choices, let's ask @amalshehan and @ChandiWitharana review the proposal with a pointer to the ADCAD vocabulary for choices.

@dvirlar2,
Based on discussion with @ChandiWitharana we should add "Data Science" No need to add "Soil Science".

Good to also have Earth Science, Computer Vision, Geo AI, Big Data.

@dvirlar2
Copy link

Should verify at some point how Torre Jorgenson wants to be identified in the dataset. Seems he goes by Torre among peers, but is professionally known as Mark. For now I'm putting him down as "M. Torre" in this dataset, and including information based on this recent dataset

@dvirlar2
Copy link

Dataset has been finalized from my POV, and I've sent it to Matt and Juliet to review before sending off to others. Can view things here:

Also, I thought I had sent the Academic Ontology for the dataset annotations, but I see that I did not! @amalshehan For context, this ontology is where we pull our "dataset annotations" from. Earlier I had mentioned cryology, soil science, and data science as possible choices. I ended up going with data science and cryology based off of your earlier comments! Let me know if you have any questions 🙂

@mbjones
Copy link
Member

mbjones commented Jul 14, 2023

@julietcohen I rearranged the IWP dataset to streamline the directory structure as we discussed. Here's what I did, and the final file layout:

cd /var/data/10.18739/A2KW57K57/
cd iwp_geopackage_high/
mv staged/gpub020/WGS1984Quad .
mv staged/staging_summary.csv .
mv staged /var/data/submission/pdg/ice-wedge-polygon-data/
cd ../iwp_geotiff_high/
mv geotiff/WGS1984Quad .
mv geotiff/raster_events.csv .
mv geotiff/raster_summary.csv .
mv geotiff/raster_summary_duplicate.csv .
rmdir geotiff
cd ..
tree -L 2 .
.
├── cleaning_materials
│   ├── add_date_attribute_footprints.py
│   └── cleaning_data
├── iwp_geopackage_high
│   ├── staging_summary.csv
│   └── WGS1984Quad
├── iwp_geotiff_high
│   ├── raster_events.csv
│   ├── raster_summary.csv
│   ├── raster_summary_duplicate.csv
│   └── WGS1984Quad
├── iwp_shapefile_detections
│   ├── high
│   ├── low
│   └── medium
└── iwp_shapefile_footprints
    ├── high
    ├── low
    └── medium

I also revised the Mermaid diagram to reflect these changes, and worked a bit on the wording in that diagram:

flowchart LR

    A["Maxar <br> (satellite images)"] -->|MAPLE| B("`**/iwp_shapefile_detections/**
    Format: Shapefile
    Irregularly shaped vector files, one per image`")
    B -->|Create Tiles and <br> Identify Duplicates| C("`**/iwp_geopackage_high/**
    Format: GeoPackage
    Evenly-spaced vector tiles, with duplicates flagged`")
    C -->|Rasterize and remove <br> flagged duplicates| D("`**/iwp_geotiff_high/**
    Format: GeoTIFF
    Evenly-spaced raster tiles, with duplicates removed`")
Loading

@dvirlar2
Copy link

dvirlar2 commented Jul 17, 2023

Edits for next release:

  • Make the following change to the sampling description via R, not the web editor. For some reason the changes weren't being retained in the editor (bug report). Sampling description should instead be: "Sampling procedures include collecting satellite imagery via the commercial satellite database at the Polar Geospatial Center, University of Minnesota."
  • Include funding from TTAC and ACCESS to represent the high performance computing resources that were used to create this dataset
  • Add descriptive entities to the test package for the four csvs listed above: staging_summary.csv, raster_events.csv, raster_summary.csv, raster_summary_duplicate.csv
    • copy metadata to production version when we're ready to update it

@amalshehan
Copy link

@dvirlar2 For the IWP mapping the HPC resources used are from TACC allocation DPP20001 and ACCESS allocation DPP190001. Do you need any other details such as the specific systems used?

@julietcohen
Copy link
Collaborator Author

julietcohen commented Jul 26, 2023

Kenton provided the following to help fill in the ACCESS / TTAC grant info:

National Science Foundation - Leadership Resource Allocation (LRAC): Harnessing big satel-
lite imagery, deep learning, and high-performance computing resources to map pan-Arctic permafrost
thaw, 2020-2022, 94,000 GPU hours and 180 TB, 37.5 TB Tape on Frontera

National Science Foundation - ACCESS Explore: Permafrost Discovery Gateway Pan-Arctic
Dataset Creation, 2022-2023, 380,000 credits

And based on the format of the above ACCEES award, the new allocation info is:

National Science Foundation - ACCESS Discover: Permafrost Discovery Gateway Pan-Arctic
Dataset Creation, 2023-2024, 750,000 credits

@julietcohen
Copy link
Collaborator Author

julietcohen commented Jul 26, 2023

More info from Kenton, the IBM acknowledgement:

IBM-Illinois Discovery Accelerator Institute - Scaling Data-Intensive Discovery Workflows on
the Hybrid Cloud, 2021-2023

IBM-Illinois Discovery Accelerator Institute - HDC: A Full-Stack Solution for the Hybrid Cloud
Marrying Data and Compute, 2023-2025

@dvirlar2
Copy link

Thanks @amalshehan and @julietcohen! I think that's all the info I need, but I'll let you know if that changes.

@julietcohen
Copy link
Collaborator Author

Since this issue has been stagnant for some time, an update:
The "high" ice IWP dataset has been published (https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2KW57K57) and the "low" and "medium" ice portions have been processed (run through the visualization workflow) but not archived at the ADC. All regions are visible on the PDG demo site.

Since one run on Delta processed the high ice (more than half the data), and the other run processed the low and medium ice, deduplication between those 2 tilesets was not executed. This is because the merging steps executes deduplication for gpkg files that were staged on different nodes. Because merging so many files takes days and depletes our Delta credits, it would be best to finish developing the kubernetes and parsl workflow to run on the NCEAS server (or another server, such as Google Cloud Platform) so we can take advantage of fast and powerful hardware without run time limitations, credit limitations, and memory limitations we experience on Delta.

Tickets to describe the progress of the kubernetes workflow are documented in the viz-workfow repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
Status: No status
Status: No status
Development

No branches or pull requests

6 participants