Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrate CLMS HRL VPP #460

Closed
jdries opened this issue Jun 22, 2023 · 34 comments · Fixed by #751, #767 or #769
Closed

integrate CLMS HRL VPP #460

jdries opened this issue Jun 22, 2023 · 34 comments · Fixed by #751, #767 or #769
Assignees

Comments

@jdries
Copy link
Contributor

jdries commented Jun 22, 2023

https://land.copernicus.eu/pan-european/biophysical-parameters/high-resolution-vegetation-phenology-and-productivity
problem here is that there's one collection, with multiple producttypes, and each type has a single band
this should become one collection with multiple bands

@jdries
Copy link
Contributor Author

jdries commented Nov 14, 2023

@JohanKJSchreurs This is a good candidate to test our new STAC api.
This collection already exists in opensearch catalog as well, mostly a matter of porting metadata to STAC.

@jdries
Copy link
Contributor Author

jdries commented Nov 21, 2023

Collection id: copernicus_r_3035_x_m_hrvpp-vpp_p_2017-now_v01_openeo

Example collection metadata that we would also target:
https://collections.eurodatacube.com/stac/vegetation-phenology-and-productivity-parameters-season-1.json

Python code to get the product metadata:
https://github.com/eea/clms-hrvpp-tools-python/blob/main/HRVPP_opensearch_demo/HRVPP%20catalogue%20and%20download%20demo.ipynb

Opensearch collections:
https://phenology.hrvpp2.vgt.vito.be/collections

@jdries
Copy link
Contributor Author

jdries commented Dec 8, 2023

@JohanKJSchreurs
Copy link
Contributor

We are implementing this in the stac-catalog-builder project.

The issue linked below is the main one that contains a breakdown of the parts/features we need for this integration
VitoTAP/stac-catalog-builder#16

@JohanKJSchreurs
Copy link
Contributor

JohanKJSchreurs commented Mar 6, 2024

The implementation in the stac-catalog-builder is complete: VitoTAP/stac-catalog-builder#16

Some small improvements can still be done, to update the collections & items in the STAC API with improved information, but those should be separate GH issues.

At present, three of the VPP collections have been converted and uploaded now to the development environment of the terra-stac-api at VITO.
The largest collection has many products (6.5 million) and we will need to download that in several pieces because the process runs too long.

@JohanKJSchreurs
Copy link
Contributor

Fourth collection has also been uploaded to the STAC API.

Overview:

collection download + conversion upload to dev STAC API number of products number of STAC Items
copernicus_r_3035_x_m_hrvpp-st_p_2017-now_v01 done done 388_008 194_004
copernicus_r_3035_x_m_hrvpp-vpp_p_2017-now_v01 done done 150_849 10_778
copernicus_r_utm-wgs84_10_m_hrvpp-st_p_2017-now_v01 done done 470_066 235_058
copernicus_r_utm-wgs84_10_m_hrvpp-vi_p_2017-now_v01 did not finish (process got killed) TO DO 6_564_209 unkown at present
copernicus_r_utm-wgs84_10_m_hrvpp-vpp_p_2017-now_v01 done done 182_784 13_056

@JohanKJSchreurs
Copy link
Contributor

How we can solve the long download of the large collection:

We can add some options to the command to tell it what time slice to download.
Right now it tries to download the entire collection, that is tot say the entire period.

We already divide that entire period into smaller time slots in order to limit the number of products in each query to a reasonable number. So if we add options for a start and end date then we could do a partial download.
That way we can test just upload and test the collection with a more limited set of STAC items

Furthermore, with same additional work we could download the whole collection in several parts, and in each run save out the STAC items for just those slices.
We can already upload upload the STAC items in several parts or sets.
At present, the collection.json file would be created and overwritten every time on disk when you do a partial download, but the collection file would still have the same data anyway. (It does not link to its STAC items in this case because there are far too many items for a static STAC collection). However with a little extra work we could split up the command so one command downloads/creates the "empty" collection and another command downloads/creates the STAC items.

@VictorVerhaert
Copy link

Update:
Trying out some of the suggestions johan gave to build de last collection. So far I always encounter an error stating that a request is to large.

I have been trying to load the uploaded stac collections in openeo, without luck.
Even with the isoformat fix on openeo dev I still encounter an IllegalArgument exception:
j-240321bb64b14251b296ace46577d0b4. @bossie could you have another look at this?

@VictorVerhaert
Copy link

It seems that the eo:bands are missing from the stac items, which is needed for openeo.

@JeroenVerstraelen
Copy link
Contributor

JeroenVerstraelen commented Mar 25, 2024

  • STAC API does not work with openeo yet
    • eo:bands are missing

Will take 2 weeks of debugging.

We already have 4/5 collections but does require someone to help testing of STAC API and integration with openEO.

@VictorVerhaert
Copy link

VictorVerhaert commented Mar 25, 2024

I'll keep editing this column as bugs are found.

I ran some different load_stac's from different sources, as I noticed the errors I encounter often differ.
tests ran on https://openeo.vito.be/:

Stac api collection job id result stac request url
https://stac.terrascope.be terrascope-s2-toc-v2 j-240325bbab144c388897826ebd12a00f java.lang.IllegalArgumentException: requirement failed: Server doesn't support ranged byte reads https://stac.terrascope.be/search?limit=20&bbox=5.0%2C51.2%2C5.01%2C51.21&datetime=2017-06-01T00%3A00%3A00Z%2F2017-07-29T23%3A59%3A59.999000Z&collections=terrascope-s2-toc-v2
https://stac.terrascope.be terrascope-s2-ndvi-v2 j-2403251d212041a69f875afcb6476848 error without error log in the editor
https://stac-openeo-dev.vgt.vito.be copernicus_r_utm-wgs84_10_m_hrvpp-st_p_2017-now_v01 j-240325eb608247938a2ef79efbf66f0d java.lang.IllegalArgumentException: requirement failed https://stac-openeo-dev.vgt.vito.be/search?limit=20&bbox=5.0%2C51.2%2C5.01%2C51.21&datetime=2017-06-01T00%3A00%3A00Z%2F2017-07-29T23%3A59%3A59.999000Z&collections=copernicus_r_utm-wgs84_10_m_hrvpp-st_p_2017-now_v01
https://stac-openeo-dev.vgt.vito.be TEST_Landsat_three-annual_NDWI_v1 j-240325604db14055b31b93ab58894870 OpenEOApiException(status_code=400, code='NoDataAvailable', message='There is no data available for the given extents.', id='no-request') https://stac-openeo-dev.vgt.vito.be/search?limit=20&bbox=5.0%2C51.2%2C5.01%2C51.21&datetime=2000-06-01T00%3A00%3A00Z%2F2000-07-29T23%3A59%3A59.999000Z&collections=TEST_Landsat_three-annual_NDWI_v1

comments:

  • the https://stac.openeo.vito.be/api.html source contains no collections and cannot be tested
  • (third line) the copernicus hrvpp-st_p collection shows no eo:bands property for its assets. I have yet to discover why these were not included in the upload. Only the name property is used from eo:bands, so a quickfix might be to fall back on another name field on openeo's side
  • the TEST_landsat... collection is the only one containing eo:bands information on the stac-openeo-dev source

bossie added a commit that referenced this issue Mar 29, 2024
#460

Traceback (most recent call last):
  File "batch_job.py", line 1347, in <module>
    main(sys.argv)
  File "batch_job.py", line 1014, in main
    run_driver()
  File "batch_job.py", line 985, in run_driver
    run_job(
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/utils.py", line 56, in memory_logging_wrapper
    return function(*args, **kwargs)
  File "batch_job.py", line 1078, in run_job
    result = ProcessGraphDeserializer.evaluate(process_graph, env=env, do_dry_run=tracer)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 377, in evaluate
    result = convert_node(result_node, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 402, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1572, in apply_process
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1572, in <dictcomp>
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 416, in convert_node
    return convert_node(processGraph['node'], env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 402, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1604, in apply_process
    return process_function(args=ProcessArgs(args, process_id=process_id), env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 2216, in load_stac
    return env.backend_implementation.load_stac(url=url, load_params=load_params, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py", line 1079, in load_stac
    pyramid_factory = jvm.org.openeo.geotrellis.file.PyramidFactory(
  File "/opt/spark3_4_0/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1587, in __call__
    return_value = get_return_value(
  File "/opt/spark3_4_0/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.openeo.geotrellis.file.PyramidFactory.
: java.lang.IllegalArgumentException: requirement failed
	at scala.Predef$.require(Predef.scala:268)
	at org.openeo.geotrellis.file.PyramidFactory.<init>(PyramidFactory.scala:47)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
@bossie
Copy link
Collaborator

bossie commented Mar 29, 2024

FYI, this error when querying https://stac.terrascope.be:

java.lang.IllegalArgumentException: requirement failed: Server doesn't support ranged byte reads

is because the underlying assets require authentication (if you click an asset's href in your browser, it will redirect you to a login page; it's this response that doesn't have an Accept-Ranges: bytes header, hence the error).

@VictorVerhaert
Copy link

VictorVerhaert commented Mar 29, 2024

Update: S3 seems to work on CDSE
This leaves the following TODO's for the integration:

  • mount buckets on terrascope (optional, should the collections be required there)
  • test full collection on CDSE (WIP)
  • rebuild al the collections with s3 links and upload to https://stac-openeo.vgt.vito.be/api.html#/ (production stac api, currently empty)
  • adjust stac-builder to build largest collection without OOM (WIP)

@VictorVerhaert
Copy link

VictorVerhaert commented Apr 4, 2024

The following collection are now available on https://stac.openeo.vito.be/ and should work on CDSE-staging

  • copernicus_r_3035_x_m_hrvpp-vpp_p_2017-now_v01
  • copernicus_r_utm-wgs84_10_m_hrvpp-vpp_p_2017-now_v01
  • copernicus_r_utm-wgs84_10_m_hrvpp-st_p_2017-now_v01
  • copernicus_r_3035_x_m_hrvpp-st_p_2017-now_v01

I am still trying to build copernicus_r_utm-wgs84_10_m_hrvpp-vi_p_2017-now_v01 but making good progress on this by speeding up the pipeline with thread and process pools, with memory cleanup.

@JeroenVerstraelen
Copy link
Contributor

@bossie Define collections that are stac based (layercatalog). So load_collection has to call load_stac.

@VictorVerhaert
Copy link

important note:
The way the stac_api works, only items of which the datetime property (often equal to the start_datetime) lies within the temporal_extent of load_stac are loaded.
For yearly assets, this means that the 1st of january needs to be included in the temporal_extent for it to be loaded.

@bossie
Copy link
Collaborator

bossie commented Apr 11, 2024

Note: the assets are in the HRVPP bucket on S3 endpoint http://data.cloudferro.com (not externally accessible).

@bossie
Copy link
Collaborator

bossie commented Apr 26, 2024

@VictorVerhaert copernicus_r_utm-wgs84_10_m_hrvpp-vpp_p_2017-now_v01 can be reingested as filtering by property (in this case: "season") should work without adverse side-effects (in this case: empty results).

I did notice that proj:epsg and proj:bbox are both in 4326, whereas the actual assets are in UTM so this might be something that you want to include e.g. https://stac.openeo.vito.be/search?limit=20&bbox=5.0%2C51.2%2C5.01%2C51.21&datetime=2017-07-01T00%3A00%3A00Z%2F2018-07-30T23%3A59%3A59.999000Z&collections=copernicus_r_utm-wgs84_10_m_hrvpp-vpp_p_2017-now_v01&fields=%2Bproperties

For some reason loading these collections seems to take a really long time, much longer than I remember. 🤔 Maybe something similar to #250?

@bossie
Copy link
Collaborator

bossie commented Apr 26, 2024

TODO: incorporate property filters defined in creo_layercatalog.json to support collections per season.

@VictorVerhaert
Copy link

copernicus_r_utm-wgs84_10_m_hrvpp-vpp_p_2017-now_v01 has been reuploaded with the season property. @bossie

@bossie
Copy link
Collaborator

bossie commented Apr 29, 2024

Confirmed: works (for "s1" and "s2"):

data_cube = (connection
             .load_collection(collection_id, bands=["SPROD", "TPROD", "QFLAG"], properties={"season": lambda s: s == "s1"})
             .filter_temporal(["2017-07-01", "2018-07-31"])
             .filter_bbox([5.00, 51.20, 5.01, 51.21])
             .save_result("GTiff"))

jdries added a commit to Open-EO/openeo-geotrellis-kubernetes that referenced this issue Apr 30, 2024
@jdries
Copy link
Contributor Author

jdries commented Apr 30, 2024

commited collection config for seasonal collections

bossie added a commit to Open-EO/openeo-geopyspark-driver-testdata that referenced this issue Apr 30, 2024
bossie added a commit that referenced this issue Apr 30, 2024
bossie added a commit that referenced this issue Apr 30, 2024
@bossie
Copy link
Collaborator

bossie commented Apr 30, 2024

Still needs work wrt/ bands order defined in creo_layercatalog.json.

Adapt related test:

bands=["SPROD", "TPROD"]) # TODO: remove other bands from layercatalog.json, then drop this bands argument

@bossie bossie reopened this Apr 30, 2024
bossie added a commit that referenced this issue May 2, 2024
bossie added a commit to Open-EO/openeo-geopyspark-driver-testdata that referenced this issue May 2, 2024
bossie added a commit that referenced this issue May 2, 2024
@bossie bossie linked a pull request May 2, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment