Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SatlasPretrain: add new dataset #2248

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

adamjstewart
Copy link
Collaborator

@adamjstewart adamjstewart commented Aug 23, 2024

This PR adds a data loader for the SatlasPretrain dataset.

This is a work in progress:

  • Basic data loader
  • Documentation
  • Tests
  • Switch download from requests to aws
  • Add all checksums
  • Add support for multiple images per sensor per tile
  • Return timestamp info
  • Add support for dynamic and vector labels?
  • Add support for band selection?
  • Add support for time-series images?

References:

@favyen2 @piperwolters can you review this PR as time permits? I'm still in the process of downloading the entire dataset, so it's going to be a bit before I can actually test it myself, but wanted to open a WIP PR anyway. I have a ton of questions for you that I'll leave in-line and we can resolve as we finalize the PR. The first draft will likely have significantly limited functionality compared to your reference implementation, but that's fine for our use case. We can always expand it in the future.

@ando-shah and I are planning on heavily using your dataset for our next paper. Our specific use case requires us to only sample from tiles where all low-resolution products (S1, S2, L) are available. My current plan is to create a custom metadata/train_lowres_matching.json file containing the paired-down list of tiles. We can host this in our repo, or you're also welcome to include this in your metadata.tar.gz file once it's complete.

satlas

(yes, the original image is that washed out, the TCI product is not great)

@adamjstewart adamjstewart added this to the 0.6.0 milestone Aug 23, 2024
@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets labels Aug 23, 2024

Reference implementation:

* https://github.com/allenai/satlas/blob/main/satlas/model/dataset.py
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worth adding this in the docstring or just leaving it in the comments only. Happy to add pointers to the official codebase/data loaders for Satlas which may be preferred by some users.

'metadata': (),
}

# NOTE: 'tci' is RGB (b04-02), not BGR (b02-04)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could optionally add a bands parameter that allows users to specify the order of spectral bands returned by the model, but so far I don't think we need this feature.

channels.append(torch.tensor(np.array(img, dtype=np.float32)))
return torch.cat(channels)

def _load_label(self, label: str, col: int, row: int) -> Tensor:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add support for vector labels.

sample: dict[str, Tensor] = {}

for image in self.images:
sample[f'image_{image}'] = self._load_image(image, col, row)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design decision here is to use image_{landsat,naip,sentinel} as the key so we can retain support for kornia.augmentation.AugmentationSequential auto-detecting the type of key/value pairs. Not sure how important this actually is since many augmentations like Normalize will be unique to each image, but it could be useful for augmentations like RandomCrop.

@github-actions github-actions bot added the testing Continuous integration testing label Aug 26, 2024
@adamjstewart adamjstewart marked this pull request as ready for review August 26, 2024 13:23
@adamjstewart adamjstewart marked this pull request as draft August 27, 2024 14:52
@adamjstewart adamjstewart modified the milestones: 0.6.0, 0.7.0 Aug 27, 2024
@adamjstewart
Copy link
Collaborator Author

I'm curious if anyone has ever managed to successfully download https://github.com/allenai/satlas/blob/main/satlaspretrain_urls.txt because I have been trying for weeks and the download always dies in the middle.

@calebrob6
Copy link
Member

Have you tried the aws-cli?

@adamjstewart
Copy link
Collaborator Author

How do I convert these URLs to s3 equivalents?

@calebrob6
Copy link
Member

calebrob6 commented Sep 3, 2024

aws s3 cp s3://ai2-public-datasets/satlas/satlas-dataset-v1-sentinel2-a.tar . --no-sign-request

You can also do:
aws s3 ls s3://ai2-public-datasets/satlas/ --no-sign-request

From working with Maxar Open Data on S3 and similar on Azure -- using s3 and azcopy is much much better than wgeting the HTTPS URL.

@ando-shah
Copy link

+1 to Caleb's suggestion. Best to run in a screen / tmux - mine took a long time, and unzipping s2 took a day or so (on slower disks)!

aws s3 cp s3://ai2-public-datasets/satlas/satlas-dataset-v1-sentinel2-a.tar . --no-sign-request

You can also do: aws s3 ls s3://ai2-public-datasets/satlas/ --no-sign-request

From working with Maxar Open Data on S3 and similar on Azure -- using s3 and azcopy is much much better than wgeting the HTTPS URL.

@adamjstewart

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants