Add load of hydrobasins data #919

ekatef · 2023-11-05T19:38:03Z

Relates to #914

Changes proposed in this Pull Request

Add functionality to load hydrobasins data directly from the data source.

Checklist

I consent to the release of this PR's code under the AGPLv3 license and non-code contributions under CC0-1.0 and CC-BY-4.0.
I tested my contribution locally and it seems to work fine.
Code and workflow changes are sufficiently documented.
Newly introduced dependencies are added to envs/environment.yaml and doc/requirements.txt.
Changes in configuration options are added in all of config.default.yaml and config.tutorial.yaml.
Add a test config or line additions to test/ (note tests are changing the config.tutorial.yaml)
Changes in configuration options are also documented in doc/configtables/*.csv and line references are adjusted in doc/configuration.rst and doc/tutorial.rst.
A note for the release notes doc/release_notes.rst is amended in the format of previous release notes, including reference to the requested PR.

ekatef · 2023-11-05T19:46:10Z

Currently, the loaded data sources are hardcoded in bundle_config.yaml, which can potentially lead to troubles in case of mismatches with hydro -> resource -> hydrobasins in config.yaml.

A possible improvement we discussed implies taking into account the config parameters. However, the used hydrobasin files are generally not very big, while config.default requests the biggest of them. Would it be probably a better solution simply load all hydrobasis files which can be used. That would be also consistent with the current approach.

ekatef · 2023-11-05T19:50:53Z

As a long-term solution, a new basins dataset by Potsdam climate institute may be of interest.

davide-f

Thanks Katia :D
I've added some thoughts on the hydrobasin, let me know if they are clear and what is your opinion on that :)

davide-f · 2023-11-06T22:02:55Z

scripts/_helpers.py

@@ -418,6 +420,11 @@ def dlProgress(count, blockSize, totalSize, roundto=roundto):
    if data is not None:
        data = urllib.parse.urlencode(data).encode()

+    if headers:
+        opener = urllib.request.build_opener()
+        opener.addheaders = [("User-agent", "Mozilla/5.0")]


Maybe the content of this may be headers itself, so we make the function quite general.
What do you think?

Maybe the content of this may be headers itself, so we make the function quite general. What do you think?

Agree and fixed.

davide-f · 2023-11-06T22:10:59Z

scripts/retrieve_databundle_light.py

-            progress_retrieve(url, file_path, disable_progress=disable_progress)
+            progress_retrieve(
+                url, file_path, headers=True, disable_progress=disable_progress
+            )

            # if the file is a zipfile and unzip is enabled
            # then unzip it and remove the original file
-            if config.get("unzip", False):
+            if config.get("unzip", False) or bool(re.search(".zip$", file_path)):


Unfortunately, I believe that for hydrobasins, we need a specific different function, because:

the hydrobasins bundles to download depend on the region and level option, which is not easy to handle with bundles without overcomplicating

the output shape files may be merged: for a run on countries across continents, the shape files need to be merged

If we focus on creating such function(s) and then we integrate them into the workflow.
What do you think?

Here, the or bool(re.search(".zip$", file_path)) is unnecessary. if you specify: unzip: False into the bundle that you added this line is unnecessary

Agree that a dedicated hydro-basins function would be helpful. Added one.

A condition fixed with unzip :)

davide-f · 2023-11-07T09:11:13Z

configs/bundle_config.yaml

+    category: common
+    destination: "data/hydrobasins"
+    urls:
+      direct: https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_af_lev06_v1c.zip


I comment here and below I add some examples.
We could define a new class like "urls: hydrobasins: " to distinguish the download procedure that is different than the others.
the new code "hydrobasins" require the definition of a new function like download_and_unzip_hydrobasins

Then, the url of the hydrobasin may be: "https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_af_lev{:02d}_v1c.zip"

Inside the new download_and_unzip_hydrobasins, we need to do "url_hbasin = urls["hydrobasins"].forma({level from config file}"

With this format, if we add one bundle_hydrobasin for every region in the hydrobasin website, this workflow will automatically download all of them. Note that the output file of each databundle shall account for the region (e.g. AF) and shall be filled with the level value similarly to above.

Then, we need to merge them into a unique shape file. To do so, we can create a dedicated rule to do so: the function takes in input the list of hydro shape files, calculated with a dedicated function, and in output creates the merged file.
Alternatively the merging process can be included into the retrieve databundle, if that's easier to start with for you

Absolutely agree that a dedicated download function would be helpful, and great point about the merging procedure!

Regarding a special treatment for different regions, not sure I get your point right. Currently, we have a single file of global coverage, and it seems to work pretty well. Why wouldn't we reproduce the same approach, regardless of the requested region? The global dataset is about 5MB which is ~200 time smaller as compared with our environment 🙃

Do you mean that it doesn't look polite if we'd re-creating a chunk of data set locally by default? 🙂 In this case I do agree that it makes absolute sense to go for the approach you suggest.

The size of the dataset depends on the level. I've done a scan and the global v12 level may be of 500MB zipped, but it is unlikely to be commonly used.
We can go for it :) I like the proposal!

However, unfortunately there is no global bundle in the hydrobasins, we need to manually download all regions and merge them regardless.

Thanks :)

Mmm... Probably, it could make sense to add a warning if the data to be loaded exceed some 100MB.

No problem with creating a merged dataset :)

ekatef · 2023-11-07T23:47:02Z

Thanks Katia :D I've added some thoughts on the hydrobasin, let me know if they are clear and what is your opinion on that :)

Thanks a lot, Davide! It has been tremendously helpful 😄

Fixed technical points and happy to create a dedicated rule. It will be also useful for pre-processing of new PIK dataset 🙂

davide-f

Cool @ekatef :D
With little changes, the new function allows to download each hydrobasin databundle.
As next steps, there is the need to make sure all databundles for Earth are downloaded and merged together :) cool! :D
This is a great step towards making PyPSA-Earth better structured :D

davide-f · 2023-11-13T18:00:09Z

scripts/retrieve_databundle_light.py

            logger.info(f"Downloading resource '{resource}' from cloud '{url}'.")
-            progress_retrieve(url, file_path, disable_progress=disable_progress)
+            progress_retrieve(
+                url, file_path, headers=True, disable_progress=disable_progress
+            )


Maybe we don't need this change?

Thanks for the catch! Fixed.

davide-f · 2023-11-13T18:00:38Z

scripts/retrieve_databundle_light.py

+    download_and_unzip_basins(config, rootpath, dest_path, hot_run=True,
+    disable_progress=False)
+
+    Function to download and unzip the data for hydrobasins.


May be nice to add the path to the data and the licensing

Well, it has been definitely a good idea to look into the license conditions:

The following copyright statement must be displayed with, attached to or embodied in (in a reasonably prominent manner) the documentation or metadata of any Licensee Product or Program provided to an End User when utilizing the Licensed Materials: ...

Have added the required statement to the docstring. Do you agree?

davide-f · 2023-11-13T18:02:49Z

scripts/retrieve_databundle_light.py

+
+    basins_fl = snakemake.config["renewable"]["hydro"]["resource"]["hydrobasins"]
+    level_pattern = r".*?lev(.*)_.*"
+    level_code = re.findall(level_pattern, basins_fl)


Cool that you got how to identify the level from the path!
The level to be used later in the level_code to download should however be loaded from the config.yaml file.

Have been trying to avoid changing the config :D
But you are absolutely right from the long-term perspective. Adjusted.

davide-f · 2023-11-13T18:03:50Z

scripts/retrieve_databundle_light.py

+
+                # if the file is a zipfile and unzip is enabled
+                # then unzip it and remove the original file
+                if config.get("unzip", False):


It is always a zip file in the hydrobasins, right?
If so, maybe this if is not needed

davide-f · 2023-11-13T18:04:32Z

scripts/retrieve_databundle_light.py

-            url, file_path, data=postdata, disable_progress=disable_progress
+            url,
+            file_path,
+            data=postdata,
+            header=header,
+            disable_progress=disable_progress,


Maybe this change is not needed, where is header defined?

A header is added in the function loading hydrobasins. Removed it from here. Thanks for the catch!

ekatef · 2023-11-15T18:27:46Z

Cool @ekatef :D With little changes, the new function allows to download each hydrobasin databundle. As next steps, there is the need to make sure all databundles for Earth are downloaded and merged together :) cool! :D This is a great step towards making PyPSA-Earth better structured :D

Thanks for checking it @davide-f! Have implemented the fixes.

ekatef · 2023-11-17T23:06:00Z

Current approach seems to be functional, but pydevd is complaining on low performance, although computational times doesn't feel very bad.

Results of initial testing on level 2:

To be tested:

on finer resolution for basins shapes;
for the whole workflow.

ekatef · 2023-11-18T08:15:14Z

Testing on 12th basins level gives the whole coverage:

with a nice level of regional details:

A file size for the global dataset at 12th hydrobasins level is 1.32 GB.

ekatef · 2023-11-18T08:19:29Z

CI currently fails, which may be the case due to some mismatch in the configuration parameters. Have to check it. Apart of that, it may be a good idea to add some progress to track progress of hydrobasins merge.

davide-f · 2023-11-21T17:38:39Z

After checking the structure of the merged file, it looks like some basins may be duplicated by the current merge approach. In particular, when testing level 4, there is a entry appears twice with same id and exactly same geometry (a basin which belongs to tow different macro-regions?):
778,1040029810,0,1040029810,1040029810,0.0,0.0,275925.8,275925.8,1530,0,1,0,144,"MULTIPOLYGON (((10.277777777777796 34.32...
2377,1040029810,0,1040029810,1040029810,0.0,0.0,275925.8,275925.8,1530,0,1,0,144,"MULTIPOLYGON (((10.277777777777796 34.32...
That leads to atlite not being happy, and most likely causes the current CI troubles.

The current merge approach should be supplemented by checking for duplicates.

Interesting and fanstastic job!
Do the entries have an ID?
It would be great to remove duplicated ids rather than performing geospatial checks that check for overlaps...

davide-f

Cool :D converging :)

davide-f · 2023-11-21T17:41:45Z

configs/bundle_config.yaml

+      hydrobasins: https://data.hydrosheds.org/file/HydroBASINS/standard/
+      suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"]


Indeed :)
The purpose of the flag urls would be to give one entry for alternative options to download the same bundle.
That's why I'd be prone to keep one entry (hydrobasins) under urls and potentially add multiple entries underneat, such as:

Suggested change

hydrobasins: https://data.hydrosheds.org/file/HydroBASINS/standard/

suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"]

bundle_hydrobasins:

countries: [Earth]

tutorial: false

category: common

destination: "data/hydrobasins"

urls:

hydrobasins:

base_url: https://data.hydrosheds.org/file/HydroBASINS/standard/

suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"]

davide-f · 2023-11-21T17:44:33Z

scripts/retrieve_databundle_light.py

+        if hot_run:
+            if os.path.exists(file_path):
+                os.remove(file_path)
+
+            try:
+                logger.info(
+                    f"Downloading resource '{resource}' for hydrobasins in '{rg}' from cloud '{url}'."
+                )
+                progress_retrieve(
+                    url,
+                    file_path,
+                    headers=[("User-agent", "Mozilla/5.0")],
+                    disable_progress=disable_progress,
+                )
+
+                with ZipFile(file_path, "r") as zipfile:
+                    zipfile.extractall(config["destination"])
+
+                os.remove(file_path)
+                logger.info(f"Downloaded resource '{resource}' from cloud '{url}'.")
+            except:
+                logger.warning(
+                    f"Failed download resource '{resource}' from cloud '{url}'."
+                )
+                return False


My suggestion was to propose calling
download_and_unzip_direct(..., ..., hot_run=hot_run, disable_progress=disable_progress)
within this for loop, rather than copying the code again, but we can go for the duplication if it is easier

davide-f · 2023-11-21T17:46:42Z

scripts/retrieve_databundle_light.py

+        os.path.join(
+            basins_path, "hybas_world_lev{:02d}_v1c.shp".format(int(hydrobasins_level))
+        )


This path should be taken from the config

My suggestion was to propose calling
download_and_unzip_direct(..., ..., hot_run=hot_run, disable_progress=disable_progress)
within this for loop, rather than copying the code again, but we can go for the duplication if it is easier

Thanks for the explanation! Have investigated a bit, and it looks like there may be a way to avoid duplication, but we'd need for that to encapsulate a part of download_and_unzip_direct. The point is that currently all the downloading functions use the same arguments list config, rootpath, hot_run=True, disable_progress=False, which makes possible to use a wildcard when calling the download function:

pypsa-earth/scripts/retrieve_databundle_light.py

Lines 676 to 680 in 5f77c98

download_and_unzip = globals()[f"download_and_unzip_{host}"]

if download_and_unzip(

config_bundles[b_name], rootpath, disable_progress=disable_progress

):

downloaded_bundle = True

config is common for all download_and_unzip_* functions, while processing of the config parameters differs for different functions. So, I have added download_and_unpack function to replace duplicated functionality both in download_and_unzip_direct and download_and_unzip_hydrobasins. Would you agree with this approach?

This path should be taken from the config

An understanding question: do you mean using snakemake.config["renewable"]["hydro"]["resource"]["hydrobasins"]?

Agree that it's the major parameter transferred to atlite. However, there is also hydrobasins_level parameter which basically duplicate hydrobasins. So, it's perfectly possible to organise in the config.yaml a data mismatch: like set hydrobasins_level: 4 along with hydrobasins: data/hydrobasins/hybas_world_lev06_v1c.shp. If we'll take the file name from the config, this error will remain unnoticed, while if an actual hydrobasins_level is used to generate the merged file, we keep the prepared data consistent, while there will be an error thrown while building renewable profiles. To me, this design look more error-safe.

What is your feeling about that? (Sorry for bringing the complexity which is probably not needed there 🙃)

I quite believe that in our case we should have a file like: data/hydrobasins/hybas_world.shp that is independent from the level.
It is the merge of all hydrobasin shapes regardless of the level.
If the level is modified, the function is triggered again, that should simplify the procedure.
Alternatively, we need to create a specific rule that merges the hydrobasins shapes

ekatef · 2023-11-21T21:25:39Z

After checking the structure of the merged file, it looks like some basins may be duplicated by the current merge approach. In particular, when testing level 4, there is a entry appears twice with same id and exactly same geometry (a basin which belongs to tow different macro-regions?):
778,1040029810,0,1040029810,1040029810,0.0,0.0,275925.8,275925.8,1530,0,1,0,144,"MULTIPOLYGON (((10.277777777777796 34.32...
2377,1040029810,0,1040029810,1040029810,0.0,0.0,275925.8,275925.8,1530,0,1,0,144,"MULTIPOLYGON (((10.277777777777796 34.32...
That leads to atlite not being happy, and most likely causes the current CI troubles.
The current merge approach should be supplemented by checking for duplicates.
Interesting and fanstastic job! Do the entries have an ID? It would be great to remove duplicated ids rather than performing geospatial checks that check for overlaps...

Totally agree :D Yeah, the basins have ids, and pandas drop_duplicates seems do the trick.

ekatef · 2023-11-22T22:47:50Z

Cool :D converging :)

Thanks a lot for the review and absolutely agree ;)

Testing the whole workflow as currently the PR break CI. Once done, I'll introduce the changes you suggest (absolutely agree with them) and shall also try to add a clear error message in case the url is not available.

Co-authored-by: Davide Fioriti <[email protected]>

ekatef · 2023-11-25T16:28:48Z

Cool :D converging :)

Thanks! :)
Have introduced changes; there are two points I'm not completely sure about. Have added some comments on both. Happy to exchange on both if you feel it'd be easier.

davide-f · 2023-11-28T23:48:34Z

configs/bundle_config.yaml

+  # global data for hydrobasins
+  bundle_hydrobasins:
+    countries: [Earth]
+    tutorial: false
+    category: common
+    destination: "data/hydrobasins"
+    urls:
+      hydrobasins:
+        base_url: https://data.hydrosheds.org/file/HydroBASINS/standard/
+        suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"]
+    unzip: true
+    output:
+    - data/hydrobasins/*.shp
+


As comments:

the output file could be hybas_world_v1c.shp, that represent the merged file, value not to be loaded from the config, but from here

we could add a databundle for the tutorial, which is the same as this one, but limited to africa. So the changes would be: tutorial=true, countries=["Africa"], and suffixes=["af"]

Pretty cool! :D
Once this works, we can think of creating the databundles for the tutorials and plan to go for zenodo.
We can try again zenodo sandbox and when it works we can move to the default

davide-f · 2023-11-29T13:28:28Z

To avoid the comment being lost, the following comment is also added above:

I quite believe that in our case we should have a file like: data/hydrobasins/hybas_world.shp that is independent from the level.
It is the merge of all hydrobasin shapes regardless of the level.
If the level is modified, the function is triggered again, that should simplify the procedure.
Alternatively, we need to create a specific rule that merges the hydrobasins shapes

…obasins

davide-f · 2023-11-29T18:28:11Z

Merging :D

ekatef added 4 commits November 5, 2023 22:27

Add a headers option for direct download

c469a92

Use a headers option when loading data

4e7cb63

Read unzip option from the file name

b40230c

Add load config parameters for hydrobasins

7e49b26

davide-f reviewed Nov 7, 2023

View reviewed changes

ekatef added 3 commits November 8, 2023 02:18

Add unzip argument

3645dc8

Replace hard-coding

cc51b7e

Add a dedicated function to retrieve hydrobasins

6798cb1

ekatef added 3 commits November 9, 2023 02:19

Generalize hydrobasins url

52c9016

Generalize hydrobasing retrieval

f3f32d8

Add a basin level

79a071b

davide-f reviewed Nov 13, 2023

View reviewed changes

ekatef added 5 commits November 15, 2023 20:41

Remove an unnecessary headers argument

a19ab90

Replace a path to hydrobasins with a level parameter

de54321

Add hydrobasins license to the docstring

bc1e02d

Update reading a hydrobasisn level in a hydrobasins loading script

46309ad

Remove an redundant check

7aa5f00

ekatef mentioned this pull request Nov 17, 2023

Linopy transition #796

Open

8 tasks

ekatef added 3 commits November 18, 2023 01:54

Merge regional hydrobasins files into a single file

e72cfff

Add imports

0d36218

Fix bundle config for hydrobasing

bd5877a

ekatef added 2 commits November 18, 2023 11:02

Improve an info message

3b2153f

Put merge into a function

4db8f47

davide-f reviewed Nov 21, 2023

View reviewed changes

ekatef added 2 commits November 22, 2023 00:21

Fix duplicates

74d0402

Fix reading from config

075727a

ekatef added 3 commits November 22, 2023 00:46

Get hydrobasins back to the config

e21c150

Fix typos in a name of hydrobasins file

06a8340

Merge branch 'main' into load_hydrobasins

7303257

ekatef and others added 4 commits November 23, 2023 18:20

Implement Davide's suggestion

2b908ff

Co-authored-by: Davide Fioriti <[email protected]>

Update parameters reading

89cd5d1

Fix typo

706001d

Wrap-up download and unzip into a function

f357d53

davide-f reviewed Nov 28, 2023

View reviewed changes

Revise hydrobasin files to merge

9052d79

ekatef and others added 10 commits November 29, 2023 17:13

Fix file names

1790d18

Add a hydrobasins tutorial databundle

2f61d41

Fix naming

79268e9

Rename tutorial hydrobasins

972b493

Add a hydrobasins level as a parameter

484b7fc

Improve helper efficiency

56307ae

Finalize automatic download of hydrobasins

29ee7a0

update bundle links

3fce893

Add a release note

482cabb

Merge remote-tracking branch 'origin/load_hydrobasins' into load_hydr…

92e1fce

…obasins

davide-f merged commit 69d4064 into pypsa-meets-earth:main Nov 29, 2023
4 checks passed

ekatef deleted the load_hydrobasins branch December 26, 2023 08:51

		hydrobasins: https://data.hydrosheds.org/file/HydroBASINS/standard/
		suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"]

	download_and_unzip = globals()[f"download_and_unzip_{host}"]
	if download_and_unzip(
	config_bundles[b_name], rootpath, disable_progress=disable_progress
	):
	downloaded_bundle = True

Add load of hydrobasins data #919

Add load of hydrobasins data #919

Conversation

ekatef commented Nov 5, 2023

Relates to #914

Changes proposed in this Pull Request

Checklist

ekatef commented Nov 5, 2023

ekatef commented Nov 5, 2023 • edited Loading

davide-f left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekatef commented Nov 7, 2023

davide-f left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekatef Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekatef commented Nov 15, 2023

ekatef commented Nov 17, 2023 • edited Loading

ekatef commented Nov 18, 2023

ekatef commented Nov 18, 2023

davide-f commented Nov 21, 2023

davide-f left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davide-f Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

ekatef commented Nov 21, 2023

ekatef commented Nov 22, 2023

ekatef commented Nov 25, 2023

Choose a reason for hiding this comment

davide-f commented Nov 29, 2023

davide-f commented Nov 29, 2023

ekatef commented Nov 5, 2023 •

edited

Loading

davide-f left a comment •

edited

Loading

ekatef Nov 15, 2023 •

edited

Loading

ekatef commented Nov 17, 2023 •

edited

Loading

davide-f Nov 28, 2023 •

edited

Loading