Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TileDatasets #1353

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

TileDatasets #1353

wants to merge 5 commits into from

Conversation

calebrob6
Copy link
Member

  • Introduces a new class of datasets called TileDatasets that are indexed by filename, xoffset, yoffset, and patch size.
  • Implements samplers for these
  • Implements a L7IrishDataModule using this scheme

@github-actions github-actions bot added datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets scripts Training and evaluation scripts labels May 20, 2023
@adamjstewart
Copy link
Collaborator

Thanks for opening this proof of concept!

There's a few questions here:

  1. Do we want to add this base class to TorchGeo?
  2. Do we want all of our curated benchmark GeoDatasets to subclass this?
  3. Do we want to search for files or pass in a list of files?

My current opinions:

  1. I think we'll likely want something like this in TorchGeo
  2. This one is more nuanced. Using this for all benchmark GeoDatasets lets us avoid reprojection, but prevents us from combining those datasets with other GeoDatasets. We'll have to decide how important this functionality is. Depending on how much time I have this summer, I may try to tackle GeoDataset: avoid unnecessary reprojection #409, which would make the gains less.
  3. I know you've been looking for something like this for a long time. I would like to be consistent here. Either Raster/Vector/Tile search for files, or get passed a list of files, or support both options.

@calebrob6
Copy link
Member Author

Do we want to add this base class to TorchGeo?

Maybe -- although it feels like we should be able to get there by making RasterDataset and the Samplers more complex. I think I sketched out a method for allowing RasterDataset __getitem__ to take in two different types of bounding boxes, the current one, and the one used in this implementation (call these BoundingBox and Patch or something -- note you can convert from Patch to BoundingBox). If we're in a situation where there are overlapping datasets or something you can just stick with the current implementation.

Do we want all of our curated benchmark GeoDatasets to subclass this?

Depends on the first question

Do we want to search for files or pass in a list of files?

We've talked about this a bunch of times before -- I almost never want to search a file system and keep things that match a regex. The nice thing about this is that you can rename fns to uris and pass it lists of COG files on remote servers and it will work, while you definitely can't do that with glob.

@github-actions github-actions bot added the trainers PyTorch Lightning trainers label May 21, 2023
else:
experiment_name = f"{model}_{lr}_{loss}_{wd}_{weights}_{seed}"

config_file = os.path.join("conf", "l7irishtile.yaml")
Copy link
Collaborator

@nilsleh nilsleh May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make one run_{downstream_task}.py script that has the config file name as an additional variable at the beginning of the file? Then we can use this one script for all the different downstream tasks, or are there some differences beyond the config files that we need to account for?

@adamjstewart adamjstewart added this to the 0.5.0 milestone May 25, 2023
@adamjstewart adamjstewart removed this from the 0.5.0 milestone Sep 28, 2023
@isaaccorley
Copy link
Collaborator

I'd like to revive this one but maybe we can also make the samplers work with non georeferenced images. Some of the datasets we have like GID-15, LEVIR-CD, etc. are large images that I may want to sample smaller patches from for training but I don't want to manually preprocess them to a specific patch size beforehand in case I want to run an ablation by varying the patch size.

@calebrob6
Copy link
Member Author

@adamjstewart -- do you have any more recent thoughts on this? I use these types of datasets somewhat frequently in my day-to-day work and would love to have them in TorchGeo, but also don't have a burning passion to merge it

@adamjstewart
Copy link
Collaborator

Overall, I think I would be okay with this approach. It's kind of like what other libraries did before TorchGeo existed. I'm very okay with this for datasets like GID-15 and LEVIR-CD where the images are large but there are no geocoordinates. I'm slightly okay with this for datasets like L7 Irish where the images are large and there are geocoordinates.

Fancy is cool, but fast is better. Especially if you're already regularly using datasets like this. I want to make sure TorchGeo is actually useful to its intended audience, including its maintainers. If the maintainers aren't using the builtin datasets, we're doing something wrong.

Depending on how much time I have this summer, I may try to tackle GeoDataset: avoid unnecessary reprojection #409, which would make the gains less.

Well this did not age well...

it feels like we should be able to get there by making RasterDataset and the Samplers more complex.

I don't think this is possible. If we subclass from GeoDataset, we are declaring that the dataset can interoperate with all other GeoDatasets. TileDataset really does deserve its own base class and samplers. I'm not even sure if samplers are the right approach here. This requires a lot of rethinking. How do people normally use these kind of datasets for ML? Are there other ideas from high-res imagery we can borrow from? Presumably we are not the only ones using large images.

Do we want to search for files or pass in a list of files?

We now support this, yay!

Implements a L7IrishDataModule using this scheme

Note that I converted L7 Irish to an IntersectionDataset in 0.6.0. This significantly cuts down on the code (no more getitem). Just an aside, not really relevant to the discussion.

@calebrob6
Copy link
Member Author

calebrob6 commented Sep 4, 2024

Note that this would fix all issues with ChesapeakeCVPR as ChesapeakeCVPR should be a tile dataset!
The unchipped version of LandCoverAI and OSCD (more generally, any dataset where we currently use NCrops) could also be tile datasets. Any situation where you have image tiles of different shapes, that have no CRS information (or many different CRSs) you can treat as a tile dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets scripts Training and evaluation scripts trainers PyTorch Lightning trainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants