Benchmarking of GeoDataset for a paper result #81

calebrob6 · 2021-08-10T22:38:16Z

adamjstewart · 2021-08-11T16:57:49Z

I think this will require a significant rework of our __getitem__ implementation. Right now, we warp and then merge/sample from a tile at the same time. If we want to benefit from the 2-step random tile/chip sampling strategy, we'll have to use an LRU cache on the entire tile after warping.

adamjstewart · 2021-08-13T15:15:11Z

I think we can also consider the following I/O strategies:

load/warp entire file, no caching (worst case scenario)
load/warp entire file, caching (good default)
load/warp single window (does not allow for caching)

Merging should happen after the fact so that (tile 1, tile 2, tile 1 + 2) don't end up being 3 different entries in the cache.

I don't think we need to consider situations in which we:

warp the file on disk on-the-fly (violates principle of least surprise)
don't bother to warp (won't work if not already in correct CRS/res)
don't bother to merge (won't be correct near boundaries)
always return entire tiles instead of chips (not feasible for model/GPU memory)
load the entire dataset into memory beforehand (won't fit in RAM)

These strategies make sense for tile-based raster images, but are slightly more complicated for vector geometries or static regional maps. We may need to change the default behavior based on the dataset.

adamjstewart · 2021-08-13T15:29:14Z

For timing, we should choose some arbitrary epoch size, then experiment with various batch sizes and see how long it takes to load an entire epoch.

adamjstewart · 2021-08-13T22:20:59Z

Here's where I'm currently stuck to remind myself when I next pick this up:

Our process right now is:

Open filehandles for raw data (rasterio.open)
Open filehandles for warped VRTs (rasterio.vrt.WarpedVRT)
Merge VRTs to get an array (rasterio.merge.merge)
Return array as tensor

Steps 1 and 2 don't actually do anything and are almost instantaneous. It isn't until you actually try to read() the data that warping occurs, and read() is called in rasterio.merge.merge. If we want to cache this reading of warped data, we'll have to call vrt.read() ourselves. Since rasterio.merge.merge only accepts filenames or filehandles as input, we'll basically need to implement our own merge algorithm that takes 1+ cached numpy arrays, creates a new array with the correct dimensions, and indexes the old arrays to copy the data. The hard part here will be keeping track of coordinates, nodata values, and merging correctly. See https://github.com/mapbox/rasterio/blob/master/rasterio/merge.py for the source code, most of which we'll need to do as well.

adamjstewart · 2021-08-13T22:32:15Z

Another hurdle: the size of each array depends greatly on the dataset, but most are around 0.5 GB per file. We can't really assume users have >8 GB of RAM, which greatly limits our LRU cache size. We could use something like psutil to query the system memory, and hard-code the avg file size for each dataset if we want to make things more flexible.

adamjstewart · 2021-08-14T15:11:37Z

For now, I think we can rely on GDAL's internal caching behavior. When I read a VRT the second time around, it seems to be significantly faster. Still not as fast as reading the raw data or as indexing from a loaded array, but good enough for a first round of benchmarking. GDAL also lets you configure the cache size.

adamjstewart · 2021-08-18T20:39:34Z

Preliminary results look very promising!

calebrob6 · 2021-09-07T19:17:10Z

@adamjstewart, sketch of the full experiment:

Get Landsat scenes from several projections and CDL data.
- Convert all to COG if they aren't already
Copy the files to blob storage, local SSD, local HDD
Create three GeoDataset instances, one for each dataset location
Record the number of patches per second you can read using each of the GeoDatasets with the different types of GeoSamplers
Record how long it takes to warp/reproject the Landsat scenes to align them to CDL (or vice-versa) "by hand" with gdalwarp.
- Also record the size of the resulting files.
- Also record the nasty gdalwarp command you actually have to figure out and execute to do this.
- Note: We can use this to extrapolate how much preprocessing you would need to do before training with a traditional DL library.
- Note: We can use this pre-aligned data with a custom dataloader to see how many patches/second you could sample if you did go in and do all the preprocessing. Hopefully this number is similar to what you get with torchgeo (or at least not much much larger).

adamjstewart · 2021-09-07T19:50:58Z

@calebrob6 the above proposal covers the matrix of:

Data location: local SSD, local HDD, blob storage
Sampling strategy: RandomGeoSampler, RandomBatchGeoSampler, GridGeoSampler
I/O strategy: cached, not cached

There are a lot of additional constraints that we're currently skipping:

File format: GeoTIFF vs. HDF5, Shapefile vs. GeoJSON
Warping strategy: already in correct CRS/res, change CRS, change res, change CRS and res

Do you think it's fine to skip these for the sake of time? I doubt reviewers would straight up reject us for not including one of these permutations, and can always ask us to perform additional experiments if they want.

Also, we should definitely benchmark not only RasterDataset but also VectorDataset (maybe Sentinel + Canadian Building Footprints?). Should I purposefully change the resolution of one of these datasets? Should I purposefully switch to a CRS different than all files or keep the CRS of one of the files?

adamjstewart · 2021-09-07T19:55:21Z

Also, do we want to compare with different batch_sizes or different num_workers?

calebrob6 · 2021-09-07T20:08:22Z

I'd do the first matrix as quickly as possible because the results of that are going to be very informative. If that all works out then you can repeat the same with a vectordataset.

File format: GeoTIFF vs. HDF5, Shapefile vs. GeoJSON

I don't think this is important right now. I.e. we can just assume the data is in a good format (COG and shapefile/geopackage)

Warping strategy

In the above sketch you can repeat the experiments with the manually aligned versions of the dataset to test the "already in correct CRS/res" case. The first set of experiments is with "change CRS and res". It might be interesting to see if warping or resampling is more expensive, but not interesting for the paper I think.

Also, do we want to compare with different batch_sizes or different num_workers?

Sure! These experiments should be very quick to run once you have a script for them.

calebrob6 · 2021-09-12T05:59:13Z

Some things to discuss soon:

How to benchmark CDL/Landsat from blob containers?
How to compare patches/sec to something that people might understand (torchvision.datasets.ImageNet seems like a good idea)?
How to do the no warp/reproject case (e.g. to we warp/crop CDL to each of the landsat scenes s.t. the pixels align)?

adamjstewart · 2024-02-25T11:09:44Z

We're following up on this discussion in #1330 (comment)

adamjstewart mentioned this issue Aug 11, 2021

0.1.0 release and publication #55

Closed

29 tasks

adamjstewart mentioned this issue Aug 11, 2021

More intelligent sampling #84

Merged

adamjstewart mentioned this issue Aug 14, 2021

Cache reads in RasterDataset #85

Merged

adamjstewart added datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets labels Sep 8, 2021

adamjstewart mentioned this issue Sep 8, 2021

Add dataset/sampler benchmarking script #115

Merged

adamjstewart closed this as completed in #115 Sep 22, 2021

adamjstewart mentioned this issue Oct 11, 2021

GeoDataset performance #190

Closed

adamjstewart added this to the 0.1.0 milestone Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking of GeoDataset for a paper result #81

Benchmarking of GeoDataset for a paper result #81

calebrob6 commented Aug 10, 2021 •

edited by adamjstewart

Loading

adamjstewart commented Aug 11, 2021

adamjstewart commented Aug 13, 2021 •

edited

Loading

adamjstewart commented Aug 13, 2021

adamjstewart commented Aug 13, 2021

adamjstewart commented Aug 13, 2021

adamjstewart commented Aug 14, 2021

adamjstewart commented Aug 18, 2021

calebrob6 commented Sep 7, 2021

adamjstewart commented Sep 7, 2021 •

edited

Loading

adamjstewart commented Sep 7, 2021

calebrob6 commented Sep 7, 2021 •

edited

Loading

calebrob6 commented Sep 12, 2021 •

edited

Loading

adamjstewart commented Feb 25, 2024

Benchmarking of GeoDataset for a paper result #81

Benchmarking of GeoDataset for a paper result #81

Comments

calebrob6 commented Aug 10, 2021 • edited by adamjstewart Loading

Datasets

Experiments

adamjstewart commented Aug 11, 2021

adamjstewart commented Aug 13, 2021 • edited Loading

adamjstewart commented Aug 13, 2021

adamjstewart commented Aug 13, 2021

adamjstewart commented Aug 13, 2021

adamjstewart commented Aug 14, 2021

adamjstewart commented Aug 18, 2021

calebrob6 commented Sep 7, 2021

adamjstewart commented Sep 7, 2021 • edited Loading

adamjstewart commented Sep 7, 2021

calebrob6 commented Sep 7, 2021 • edited Loading

calebrob6 commented Sep 12, 2021 • edited Loading

adamjstewart commented Feb 25, 2024

calebrob6 commented Aug 10, 2021 •

edited by adamjstewart

Loading

adamjstewart commented Aug 13, 2021 •

edited

Loading

adamjstewart commented Sep 7, 2021 •

edited

Loading

calebrob6 commented Sep 7, 2021 •

edited

Loading

calebrob6 commented Sep 12, 2021 •

edited

Loading