-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add load of hydrobasins data #919
Conversation
Currently, the loaded data sources are hardcoded in A possible improvement we discussed implies taking into account the config parameters. However, the used hydrobasin files are generally not very big, while |
As a long-term solution, a new basins dataset by Potsdam climate institute may be of interest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Katia :D
I've added some thoughts on the hydrobasin, let me know if they are clear and what is your opinion on that :)
scripts/_helpers.py
Outdated
@@ -418,6 +420,11 @@ def dlProgress(count, blockSize, totalSize, roundto=roundto): | |||
if data is not None: | |||
data = urllib.parse.urlencode(data).encode() | |||
|
|||
if headers: | |||
opener = urllib.request.build_opener() | |||
opener.addheaders = [("User-agent", "Mozilla/5.0")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the content of this may be headers itself, so we make the function quite general.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the content of this may be headers itself, so we make the function quite general. What do you think?
Agree and fixed.
scripts/retrieve_databundle_light.py
Outdated
progress_retrieve(url, file_path, disable_progress=disable_progress) | ||
progress_retrieve( | ||
url, file_path, headers=True, disable_progress=disable_progress | ||
) | ||
|
||
# if the file is a zipfile and unzip is enabled | ||
# then unzip it and remove the original file | ||
if config.get("unzip", False): | ||
if config.get("unzip", False) or bool(re.search(".zip$", file_path)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I believe that for hydrobasins, we need a specific different function, because:
- the hydrobasins bundles to download depend on the region and level option, which is not easy to handle with bundles without overcomplicating
- the output shape files may be merged: for a run on countries across continents, the shape files need to be merged
If we focus on creating such function(s) and then we integrate them into the workflow.
What do you think?
Here, the or bool(re.search(".zip$", file_path))
is unnecessary. if you specify: unzip: False into the bundle that you added this line is unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that a dedicated hydro-basins function would be helpful. Added one.
A condition fixed with unzip
:)
configs/bundle_config.yaml
Outdated
category: common | ||
destination: "data/hydrobasins" | ||
urls: | ||
direct: https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_af_lev06_v1c.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I comment here and below I add some examples.
We could define a new class like "urls: hydrobasins: " to distinguish the download procedure that is different than the others.
the new code "hydrobasins" require the definition of a new function like download_and_unzip_hydrobasins
Then, the url of the hydrobasin may be: "https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_af_lev{:02d}_v1c.zip"
Inside the new download_and_unzip_hydrobasins, we need to do "url_hbasin = urls["hydrobasins"].forma({level from config file}"
With this format, if we add one bundle_hydrobasin for every region in the hydrobasin website, this workflow will automatically download all of them. Note that the output file of each databundle shall account for the region (e.g. AF) and shall be filled with the level value similarly to above.
Then, we need to merge them into a unique shape file. To do so, we can create a dedicated rule to do so: the function takes in input the list of hydro shape files, calculated with a dedicated function, and in output creates the merged file.
Alternatively the merging process can be included into the retrieve databundle, if that's easier to start with for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely agree that a dedicated download function would be helpful, and great point about the merging procedure!
Regarding a special treatment for different regions, not sure I get your point right. Currently, we have a single file of global coverage, and it seems to work pretty well. Why wouldn't we reproduce the same approach, regardless of the requested region? The global dataset is about 5MB which is ~200 time smaller as compared with our environment 🙃
Do you mean that it doesn't look polite if we'd re-creating a chunk of data set locally by default? 🙂 In this case I do agree that it makes absolute sense to go for the approach you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of the dataset depends on the level. I've done a scan and the global v12 level may be of 500MB zipped, but it is unlikely to be commonly used.
We can go for it :) I like the proposal!
However, unfortunately there is no global bundle in the hydrobasins, we need to manually download all regions and merge them regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :)
Mmm... Probably, it could make sense to add a warning if the data to be loaded exceed some 100MB.
No problem with creating a merged dataset :)
Thanks a lot, Davide! It has been tremendously helpful 😄 Fixed technical points and happy to create a dedicated rule. It will be also useful for pre-processing of new PIK dataset 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool @ekatef :D
With little changes, the new function allows to download each hydrobasin databundle.
As next steps, there is the need to make sure all databundles for Earth are downloaded and merged together :) cool! :D
This is a great step towards making PyPSA-Earth better structured :D
scripts/retrieve_databundle_light.py
Outdated
logger.info(f"Downloading resource '{resource}' from cloud '{url}'.") | ||
progress_retrieve(url, file_path, disable_progress=disable_progress) | ||
progress_retrieve( | ||
url, file_path, headers=True, disable_progress=disable_progress | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch! Fixed.
scripts/retrieve_databundle_light.py
Outdated
download_and_unzip_basins(config, rootpath, dest_path, hot_run=True, | ||
disable_progress=False) | ||
|
||
Function to download and unzip the data for hydrobasins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be nice to add the path to the data and the licensing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it has been definitely a good idea to look into the license conditions:
The following copyright statement must be displayed with, attached to or embodied in (in a reasonably prominent manner) the documentation or metadata of any Licensee Product or Program provided to an End User when utilizing the Licensed Materials: ...
Have added the required statement to the docstring. Do you agree?
scripts/retrieve_databundle_light.py
Outdated
|
||
basins_fl = snakemake.config["renewable"]["hydro"]["resource"]["hydrobasins"] | ||
level_pattern = r".*?lev(.*)_.*" | ||
level_code = re.findall(level_pattern, basins_fl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool that you got how to identify the level from the path!
The level to be used later in the level_code to download should however be loaded from the config.yaml file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have been trying to avoid changing the config :D
But you are absolutely right from the long-term perspective. Adjusted.
scripts/retrieve_databundle_light.py
Outdated
|
||
# if the file is a zipfile and unzip is enabled | ||
# then unzip it and remove the original file | ||
if config.get("unzip", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is always a zip file in the hydrobasins, right?
If so, maybe this if is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed :)
scripts/retrieve_databundle_light.py
Outdated
url, file_path, data=postdata, disable_progress=disable_progress | ||
url, | ||
file_path, | ||
data=postdata, | ||
header=header, | ||
disable_progress=disable_progress, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this change is not needed, where is header defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A header is added in the function loading hydrobasins. Removed it from here. Thanks for the catch!
Thanks for checking it @davide-f! Have implemented the fixes. |
CI currently fails, which may be the case due to some mismatch in the configuration parameters. Have to check it. Apart of that, it may be a good idea to add some progress to track progress of hydrobasins merge. |
Interesting and fanstastic job! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool :D converging :)
configs/bundle_config.yaml
Outdated
hydrobasins: https://data.hydrosheds.org/file/HydroBASINS/standard/ | ||
suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed :)
The purpose of the flag urls would be to give one entry for alternative options to download the same bundle.
That's why I'd be prone to keep one entry (hydrobasins) under urls and potentially add multiple entries underneat, such as:
hydrobasins: https://data.hydrosheds.org/file/HydroBASINS/standard/ | |
suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"] | |
bundle_hydrobasins: | |
countries: [Earth] | |
tutorial: false | |
category: common | |
destination: "data/hydrobasins" | |
urls: | |
hydrobasins: | |
base_url: https://data.hydrosheds.org/file/HydroBASINS/standard/ | |
suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"] |
scripts/retrieve_databundle_light.py
Outdated
if hot_run: | ||
if os.path.exists(file_path): | ||
os.remove(file_path) | ||
|
||
try: | ||
logger.info( | ||
f"Downloading resource '{resource}' for hydrobasins in '{rg}' from cloud '{url}'." | ||
) | ||
progress_retrieve( | ||
url, | ||
file_path, | ||
headers=[("User-agent", "Mozilla/5.0")], | ||
disable_progress=disable_progress, | ||
) | ||
|
||
with ZipFile(file_path, "r") as zipfile: | ||
zipfile.extractall(config["destination"]) | ||
|
||
os.remove(file_path) | ||
logger.info(f"Downloaded resource '{resource}' from cloud '{url}'.") | ||
except: | ||
logger.warning( | ||
f"Failed download resource '{resource}' from cloud '{url}'." | ||
) | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion was to propose calling
download_and_unzip_direct(..., ..., hot_run=hot_run, disable_progress=disable_progress)
within this for loop, rather than copying the code again, but we can go for the duplication if it is easier
scripts/retrieve_databundle_light.py
Outdated
os.path.join( | ||
basins_path, "hybas_world_lev{:02d}_v1c.shp".format(int(hydrobasins_level)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path should be taken from the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion was to propose calling
download_and_unzip_direct(..., ..., hot_run=hot_run, disable_progress=disable_progress)
within this for loop, rather than copying the code again, but we can go for the duplication if it is easier
Thanks for the explanation! Have investigated a bit, and it looks like there may be a way to avoid duplication, but we'd need for that to encapsulate a part of download_and_unzip_direct
. The point is that currently all the downloading functions use the same arguments list config, rootpath, hot_run=True, disable_progress=False
, which makes possible to use a wildcard when calling the download function:
pypsa-earth/scripts/retrieve_databundle_light.py
Lines 676 to 680 in 5f77c98
download_and_unzip = globals()[f"download_and_unzip_{host}"] | |
if download_and_unzip( | |
config_bundles[b_name], rootpath, disable_progress=disable_progress | |
): | |
downloaded_bundle = True |
config
is common for all download_and_unzip_*
functions, while processing of the config parameters differs for different functions. So, I have added download_and_unpack
function to replace duplicated functionality both in download_and_unzip_direct
and download_and_unzip_hydrobasins
. Would you agree with this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path should be taken from the config
An understanding question: do you mean using snakemake.config["renewable"]["hydro"]["resource"]["hydrobasins"]
?
Agree that it's the major parameter transferred to atlite. However, there is also hydrobasins_level
parameter which basically duplicate hydrobasins
. So, it's perfectly possible to organise in the config.yaml
a data mismatch: like set hydrobasins_level: 4
along with hydrobasins: data/hydrobasins/hybas_world_lev06_v1c.shp
. If we'll take the file name from the config, this error will remain unnoticed, while if an actual hydrobasins_level
is used to generate the merged file, we keep the prepared data consistent, while there will be an error thrown while building renewable profiles. To me, this design look more error-safe.
What is your feeling about that? (Sorry for bringing the complexity which is probably not needed there 🙃)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite believe that in our case we should have a file like: data/hydrobasins/hybas_world.shp that is independent from the level.
It is the merge of all hydrobasin shapes regardless of the level.
If the level is modified, the function is triggered again, that should simplify the procedure.
Alternatively, we need to create a specific rule that merges the hydrobasins shapes
Totally agree :D Yeah, the basins have ids, and pandas |
Thanks a lot for the review and absolutely agree ;) Testing the whole workflow as currently the PR break CI. Once done, I'll introduce the changes you suggest (absolutely agree with them) and shall also try to add a clear error message in case the url is not available. |
Co-authored-by: Davide Fioriti <[email protected]>
Thanks! :) |
configs/bundle_config.yaml
Outdated
# global data for hydrobasins | ||
bundle_hydrobasins: | ||
countries: [Earth] | ||
tutorial: false | ||
category: common | ||
destination: "data/hydrobasins" | ||
urls: | ||
hydrobasins: | ||
base_url: https://data.hydrosheds.org/file/HydroBASINS/standard/ | ||
suffixes: ["af", "ar", "as", "au", "eu", "gr", "na", "sa", "si"] | ||
unzip: true | ||
output: | ||
- data/hydrobasins/*.shp | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As comments:
- the output file could be hybas_world_v1c.shp, that represent the merged file, value not to be loaded from the config, but from here
- we could add a databundle for the tutorial, which is the same as this one, but limited to africa. So the changes would be: tutorial=true, countries=["Africa"], and suffixes=["af"]
Pretty cool! :D
Once this works, we can think of creating the databundles for the tutorials and plan to go for zenodo.
We can try again zenodo sandbox and when it works we can move to the default
To avoid the comment being lost, the following comment is also added above: I quite believe that in our case we should have a file like: data/hydrobasins/hybas_world.shp that is independent from the level. |
Merging :D |
Relates to #914
Changes proposed in this Pull Request
Add functionality to load hydrobasins data directly from the data source.
Checklist
envs/environment.yaml
anddoc/requirements.txt
.config.default.yaml
andconfig.tutorial.yaml
.test/
(note tests are changing the config.tutorial.yaml)doc/configtables/*.csv
and line references are adjusted indoc/configuration.rst
anddoc/tutorial.rst
.doc/release_notes.rst
is amended in the format of previous release notes, including reference to the requested PR.