Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load of hydrobasins data #919

Merged
merged 44 commits into from
Nov 29, 2023
Merged

Conversation

ekatef
Copy link
Member

@ekatef ekatef commented Nov 5, 2023

Relates to #914

Changes proposed in this Pull Request

Add functionality to load hydrobasins data directly from the data source.

Checklist

  • I consent to the release of this PR's code under the AGPLv3 license and non-code contributions under CC0-1.0 and CC-BY-4.0.
  • I tested my contribution locally and it seems to work fine.
  • Code and workflow changes are sufficiently documented.
  • Newly introduced dependencies are added to envs/environment.yaml and doc/requirements.txt.
  • Changes in configuration options are added in all of config.default.yaml and config.tutorial.yaml.
  • Add a test config or line additions to test/ (note tests are changing the config.tutorial.yaml)
  • Changes in configuration options are also documented in doc/configtables/*.csv and line references are adjusted in doc/configuration.rst and doc/tutorial.rst.
  • A note for the release notes doc/release_notes.rst is amended in the format of previous release notes, including reference to the requested PR.

@ekatef
Copy link
Member Author

ekatef commented Nov 5, 2023

Currently, the loaded data sources are hardcoded in bundle_config.yaml, which can potentially lead to troubles in case of mismatches with hydro -> resource -> hydrobasins in config.yaml.

A possible improvement we discussed implies taking into account the config parameters. However, the used hydrobasin files are generally not very big, while config.default requests the biggest of them. Would it be probably a better solution simply load all hydrobasis files which can be used. That would be also consistent with the current approach.

@ekatef
Copy link
Member Author

ekatef commented Nov 5, 2023

As a long-term solution, a new basins dataset by Potsdam climate institute may be of interest.

Copy link
Member

@davide-f davide-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Katia :D
I've added some thoughts on the hydrobasin, let me know if they are clear and what is your opinion on that :)

@@ -418,6 +420,11 @@ def dlProgress(count, blockSize, totalSize, roundto=roundto):
if data is not None:
data = urllib.parse.urlencode(data).encode()

if headers:
opener = urllib.request.build_opener()
opener.addheaders = [("User-agent", "Mozilla/5.0")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the content of this may be headers itself, so we make the function quite general.
What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the content of this may be headers itself, so we make the function quite general. What do you think?

Agree and fixed.

Comment on lines 340 to 346
progress_retrieve(url, file_path, disable_progress=disable_progress)
progress_retrieve(
url, file_path, headers=True, disable_progress=disable_progress
)

# if the file is a zipfile and unzip is enabled
# then unzip it and remove the original file
if config.get("unzip", False):
if config.get("unzip", False) or bool(re.search(".zip$", file_path)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I believe that for hydrobasins, we need a specific different function, because:

  1. the hydrobasins bundles to download depend on the region and level option, which is not easy to handle with bundles without overcomplicating
  2. the output shape files may be merged: for a run on countries across continents, the shape files need to be merged

If we focus on creating such function(s) and then we integrate them into the workflow.
What do you think?

Here, the or bool(re.search(".zip$", file_path)) is unnecessary. if you specify: unzip: False into the bundle that you added this line is unnecessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that a dedicated hydro-basins function would be helpful. Added one.

A condition fixed with unzip :)

category: common
destination: "data/hydrobasins"
urls:
direct: https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_af_lev06_v1c.zip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I comment here and below I add some examples.
We could define a new class like "urls: hydrobasins: " to distinguish the download procedure that is different than the others.
the new code "hydrobasins" require the definition of a new function like download_and_unzip_hydrobasins

Then, the url of the hydrobasin may be: "https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_af_lev{:02d}_v1c.zip"

Inside the new download_and_unzip_hydrobasins, we need to do "url_hbasin = urls["hydrobasins"].forma({level from config file}"

With this format, if we add one bundle_hydrobasin for every region in the hydrobasin website, this workflow will automatically download all of them. Note that the output file of each databundle shall account for the region (e.g. AF) and shall be filled with the level value similarly to above.

Then, we need to merge them into a unique shape file. To do so, we can create a dedicated rule to do so: the function takes in input the list of hydro shape files, calculated with a dedicated function, and in output creates the merged file.
Alternatively the merging process can be included into the retrieve databundle, if that's easier to start with for you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely agree that a dedicated download function would be helpful, and great point about the merging procedure!

Regarding a special treatment for different regions, not sure I get your point right. Currently, we have a single file of global coverage, and it seems to work pretty well. Why wouldn't we reproduce the same approach, regardless of the requested region? The global dataset is about 5MB which is ~200 time smaller as compared with our environment 🙃

Do you mean that it doesn't look polite if we'd re-creating a chunk of data set locally by default? 🙂 In this case I do agree that it makes absolute sense to go for the approach you suggest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the dataset depends on the level. I've done a scan and the global v12 level may be of 500MB zipped, but it is unlikely to be commonly used.
We can go for it :) I like the proposal!

However, unfortunately there is no global bundle in the hydrobasins, we need to manually download all regions and merge them regardless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

Mmm... Probably, it could make sense to add a warning if the data to be loaded exceed some 100MB.

No problem with creating a merged dataset :)

@ekatef
Copy link
Member Author

ekatef commented Nov 7, 2023

Thanks Katia :D I've added some thoughts on the hydrobasin, let me know if they are clear and what is your opinion on that :)

Thanks a lot, Davide! It has been tremendously helpful 😄

Fixed technical points and happy to create a dedicated rule. It will be also useful for pre-processing of new PIK dataset 🙂

Copy link
Member

@davide-f davide-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool @ekatef :D
With little changes, the new function allows to download each hydrobasin databundle.
As next steps, there is the need to make sure all databundles for Earth are downloaded and merged together :) cool! :D
This is a great step towards making PyPSA-Earth better structured :D

Comment on lines 339 to 342
logger.info(f"Downloading resource '{resource}' from cloud '{url}'.")
progress_retrieve(url, file_path, disable_progress=disable_progress)
progress_retrieve(
url, file_path, headers=True, disable_progress=disable_progress
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! Fixed.

download_and_unzip_basins(config, rootpath, dest_path, hot_run=True,
disable_progress=False)

Function to download and unzip the data for hydrobasins.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be nice to add the path to the data and the licensing

Copy link
Member Author

@ekatef ekatef Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it has been definitely a good idea to look into the license conditions:

The following copyright statement must be displayed with, attached to or embodied in (in a reasonably prominent manner) the documentation or metadata of any Licensee Product or Program provided to an End User when utilizing the Licensed Materials: ...

Have added the required statement to the docstring. Do you agree?


basins_fl = snakemake.config["renewable"]["hydro"]["resource"]["hydrobasins"]
level_pattern = r".*?lev(.*)_.*"
level_code = re.findall(level_pattern, basins_fl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that you got how to identify the level from the path!
The level to be used later in the level_code to download should however be loaded from the config.yaml file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have been trying to avoid changing the config :D
But you are absolutely right from the long-term perspective. Adjusted.


# if the file is a zipfile and unzip is enabled
# then unzip it and remove the original file
if config.get("unzip", False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always a zip file in the hydrobasins, right?
If so, maybe this if is not needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed :)

Comment on lines 397 to 470
url, file_path, data=postdata, disable_progress=disable_progress
url,
file_path,
data=postdata,
header=header,
disable_progress=disable_progress,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this change is not needed, where is header defined?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A header is added in the function loading hydrobasins. Removed it from here. Thanks for the catch!

@ekatef
Copy link
Member Author

ekatef commented Nov 15, 2023

Cool @ekatef :D With little changes, the new function allows to download each hydrobasin databundle. As next steps, there is the need to make sure all databundles for Earth are downloaded and merged together :) cool! :D This is a great step towards making PyPSA-Earth better structured :D

Thanks for checking it @davide-f! Have implemented the fixes.

@ekatef ekatef mentioned this pull request Nov 17, 2023
8 tasks
@ekatef
Copy link
Member Author

ekatef commented Nov 17, 2023

Current approach seems to be functional, but pydevd is complaining on low performance, although computational times doesn't feel very bad.

Results of initial testing on level 2:

image

To be tested:

  1. on finer resolution for basins shapes;
  2. for the whole workflow.

@ekatef
Copy link
Member Author

ekatef commented Nov 18, 2023

Testing on 12th basins level gives the whole coverage:

image

with a nice level of regional details:

image

A file size for the global dataset at 12th hydrobasins level is 1.32 GB.

@ekatef
Copy link
Member Author

ekatef commented Nov 18, 2023

CI currently fails, which may be the case due to some mismatch in the configuration parameters. Have to check it. Apart of that, it may be a good idea to add some progress to track progress of hydrobasins merge.

@davide-f
Copy link
Member

After checking the structure of the merged file, it looks like some basins may be duplicated by the current merge approach. In particular, when testing level 4, there is a entry appears twice with same id and exactly same geometry (a basin which belongs to tow different macro-regions?):

778,1040029810,0,1040029810,1040029810,0.0,0.0,275925.8,275925.8,1530,0,1,0,144,"MULTIPOLYGON (((10.277777777777796 34.32...
2377,1040029810,0,1040029810,1040029810,0.0,0.0,275925.8,275925.8,1530,0,1,0,144,"MULTIPOLYGON (((10.277777777777796 34.32...

That leads to atlite not being happy, and most likely causes the current CI troubles.

The current merge approach should be supplemented by checking for duplicates.

Interesting and fanstastic job!
Do the entries have an ID?
It would be great to remove duplicated ids rather than performing geospatial checks that check for overlaps...

Copy link
Member

@davide-f davide-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool :D converging :)

Comment on lines 151 to 152
hydrobasins: https://data.hydrosheds.org/file/HydroBASINS/standard/
suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed :)
The purpose of the flag urls would be to give one entry for alternative options to download the same bundle.
That's why I'd be prone to keep one entry (hydrobasins) under urls and potentially add multiple entries underneat, such as:

Suggested change
hydrobasins: https://data.hydrosheds.org/file/HydroBASINS/standard/
suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"]
bundle_hydrobasins:
countries: [Earth]
tutorial: false
category: common
destination: "data/hydrobasins"
urls:
hydrobasins:
base_url: https://data.hydrosheds.org/file/HydroBASINS/standard/
suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"]

Comment on lines 409 to 433
if hot_run:
if os.path.exists(file_path):
os.remove(file_path)

try:
logger.info(
f"Downloading resource '{resource}' for hydrobasins in '{rg}' from cloud '{url}'."
)
progress_retrieve(
url,
file_path,
headers=[("User-agent", "Mozilla/5.0")],
disable_progress=disable_progress,
)

with ZipFile(file_path, "r") as zipfile:
zipfile.extractall(config["destination"])

os.remove(file_path)
logger.info(f"Downloaded resource '{resource}' from cloud '{url}'.")
except:
logger.warning(
f"Failed download resource '{resource}' from cloud '{url}'."
)
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was to propose calling
download_and_unzip_direct(..., ..., hot_run=hot_run, disable_progress=disable_progress)
within this for loop, rather than copying the code again, but we can go for the duplication if it is easier

Comment on lines 726 to 728
os.path.join(
basins_path, "hybas_world_lev{:02d}_v1c.shp".format(int(hydrobasins_level))
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path should be taken from the config

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was to propose calling
download_and_unzip_direct(..., ..., hot_run=hot_run, disable_progress=disable_progress)
within this for loop, rather than copying the code again, but we can go for the duplication if it is easier

Thanks for the explanation! Have investigated a bit, and it looks like there may be a way to avoid duplication, but we'd need for that to encapsulate a part of download_and_unzip_direct. The point is that currently all the downloading functions use the same arguments list config, rootpath, hot_run=True, disable_progress=False, which makes possible to use a wildcard when calling the download function:

download_and_unzip = globals()[f"download_and_unzip_{host}"]
if download_and_unzip(
config_bundles[b_name], rootpath, disable_progress=disable_progress
):
downloaded_bundle = True

config is common for all download_and_unzip_* functions, while processing of the config parameters differs for different functions. So, I have added download_and_unpack function to replace duplicated functionality both in download_and_unzip_direct and download_and_unzip_hydrobasins. Would you agree with this approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path should be taken from the config

An understanding question: do you mean using snakemake.config["renewable"]["hydro"]["resource"]["hydrobasins"]?

Agree that it's the major parameter transferred to atlite. However, there is also hydrobasins_level parameter which basically duplicate hydrobasins. So, it's perfectly possible to organise in the config.yaml a data mismatch: like set hydrobasins_level: 4 along with hydrobasins: data/hydrobasins/hybas_world_lev06_v1c.shp. If we'll take the file name from the config, this error will remain unnoticed, while if an actual hydrobasins_level is used to generate the merged file, we keep the prepared data consistent, while there will be an error thrown while building renewable profiles. To me, this design look more error-safe.

What is your feeling about that? (Sorry for bringing the complexity which is probably not needed there 🙃)

Copy link
Member

@davide-f davide-f Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite believe that in our case we should have a file like: data/hydrobasins/hybas_world.shp that is independent from the level.
It is the merge of all hydrobasin shapes regardless of the level.
If the level is modified, the function is triggered again, that should simplify the procedure.
Alternatively, we need to create a specific rule that merges the hydrobasins shapes

@ekatef
Copy link
Member Author

ekatef commented Nov 21, 2023

After checking the structure of the merged file, it looks like some basins may be duplicated by the current merge approach. In particular, when testing level 4, there is a entry appears twice with same id and exactly same geometry (a basin which belongs to tow different macro-regions?):

778,1040029810,0,1040029810,1040029810,0.0,0.0,275925.8,275925.8,1530,0,1,0,144,"MULTIPOLYGON (((10.277777777777796 34.32...
2377,1040029810,0,1040029810,1040029810,0.0,0.0,275925.8,275925.8,1530,0,1,0,144,"MULTIPOLYGON (((10.277777777777796 34.32...

That leads to atlite not being happy, and most likely causes the current CI troubles.
The current merge approach should be supplemented by checking for duplicates.

Interesting and fanstastic job! Do the entries have an ID? It would be great to remove duplicated ids rather than performing geospatial checks that check for overlaps...

Totally agree :D Yeah, the basins have ids, and pandas drop_duplicates seems do the trick.

@ekatef
Copy link
Member Author

ekatef commented Nov 22, 2023

Cool :D converging :)

Thanks a lot for the review and absolutely agree ;)

Testing the whole workflow as currently the PR break CI. Once done, I'll introduce the changes you suggest (absolutely agree with them) and shall also try to add a clear error message in case the url is not available.

@ekatef
Copy link
Member Author

ekatef commented Nov 25, 2023

Cool :D converging :)

Thanks! :)
Have introduced changes; there are two points I'm not completely sure about. Have added some comments on both. Happy to exchange on both if you feel it'd be easier.

Comment on lines 144 to 157
# global data for hydrobasins
bundle_hydrobasins:
countries: [Earth]
tutorial: false
category: common
destination: "data/hydrobasins"
urls:
hydrobasins:
base_url: https://data.hydrosheds.org/file/HydroBASINS/standard/
suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"]
unzip: true
output:
- data/hydrobasins/*.shp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comments:

  • the output file could be hybas_world_v1c.shp, that represent the merged file, value not to be loaded from the config, but from here
  • we could add a databundle for the tutorial, which is the same as this one, but limited to africa. So the changes would be: tutorial=true, countries=["Africa"], and suffixes=["af"]

Pretty cool! :D
Once this works, we can think of creating the databundles for the tutorials and plan to go for zenodo.
We can try again zenodo sandbox and when it works we can move to the default

@davide-f
Copy link
Member

To avoid the comment being lost, the following comment is also added above:

I quite believe that in our case we should have a file like: data/hydrobasins/hybas_world.shp that is independent from the level.
It is the merge of all hydrobasin shapes regardless of the level.
If the level is modified, the function is triggered again, that should simplify the procedure.
Alternatively, we need to create a specific rule that merges the hydrobasins shapes

@davide-f davide-f merged commit 69d4064 into pypsa-meets-earth:main Nov 29, 2023
4 checks passed
@davide-f
Copy link
Member

Merging :D

@ekatef ekatef deleted the load_hydrobasins branch December 26, 2023 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants