-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_stac gives no data cross backend #786
Comments
Some context. Original job by Darius C. was as such: # Now we extract the same input cube with openeo gfmap
import openeo
connection = openeo.connect("openeofed.dataspace.copernicus.eu").authenticate_oidc()
backend_context = BackendContext(Backend.FED)
EXTENT = dict(zip(["west", "south", "east", "north"], [5.318868004541495, 50.628576059801816, 5.3334400271343725, 50.637843899562576]))
EXTENT['crs'] = "EPSG:4326"
STARTDATE = '2022-01-01'
ENDDATE = '2022-03-31'
s2_cube = connection.load_collection("SENTINEL2_L2A", spatial_extent=EXTENT, temporal_extent=[STARTDATE, ENDDATE], bands=["B04"])
meteo_cube = connection.load_collection("AGERA5", spatial_extent=EXTENT, temporal_extent=[STARTDATE, ENDDATE], bands=["temperature-mean"])
s2_cube = s2_cube.aggregate_temporal_period(period='month', reducer='median', dimension='t')
meteo_cube = meteo_cube.aggregate_temporal_period(period='month', reducer='mean', dimension='t')
inputs = s2_cube.merge_cubes(meteo_cube)
job = inputs.create_job(
out_format="NetCDF",
title="Test extraction job",
job_options={
"split_strategy": "crossbackend",
"driver-memory": "2G",
"driver-memoryOverhead": "2G",
"driver-cores": "1",
"executor-memory": "1800m",
"executor-memoryOverhead": "1900m",
}
)
job.start_and_wait() The openeofed aggregator split up this job between:
where it essentially replaced the Unfortunately,
|
In the end, Terrascope job At this moment |
It should be noted that the dependent job ran on CDSE but this environment doesn't have the necessary infrastructure to poll its dependency job. |
This seems to be the point where it decides that CDSE does not in fact support dependencies: openeo-geopyspark-driver/openeogeotrellis/backend.py Lines 1765 to 1809 in facd818
CDSE batch job
This also explains why the CDSE job did not fail fast: it skipped putting the poll-message on Kafka altogether. |
I was afraid there would be something like that somewhere in the code path. I'm really not a fan of such bazooka-style configs |
Reversing the arguments of inputs = meteo_cube.merge_cubes(s2_cube)
In this case, dependencies look like this: [{'partial_job_results_url': 'https://openeo.dataspace.copernicus.eu/openeo/1.1/jobs/j-2405304298c14e609b85ffc70b5b6382/results/OTc2MGU1OWItNTY2ZC00MmQxLWI2NWItMzRlZDE4NzlkYThh/a21f52059a2d099abf9ff73539d4618b?expires=1717678778&partial=true'}] It will happily accept dependencies that look like: [{'collection_id': 'SENTINEL1_GRD', 'batch_request_ids': ['9ecd1d54-3b44-492f-9021-6eed00ea8a30'], 'results_location': 's3://openeo-sentinelhub/9ecd1d54-3b44-492f-9021-6eed00ea8a30', 'card4l': False}] or even [{'collection_id': 'SENTINEL1_GRD'] even though the ES mapping type for Maybe @JanssenBrm can explain this? |
Possible course of action:
|
A new version has been deployed that should fix this issue from EJR side point of view. Can you verify if you still see the bad request error? |
EJR is fixed, dependencies are good. This job ran to completion; the only difference is the order of the arguments of connection = openeo.connect("openeofed.dataspace.copernicus.eu").authenticate_oidc()
EXTENT = dict(zip(["west", "south", "east", "north"],
[5.318868004541495, 50.628576059801816, 5.3334400271343725, 50.637843899562576]))
EXTENT['crs'] = "EPSG:4326"
STARTDATE = '2022-01-01'
ENDDATE = '2022-03-31'
s2_cube = connection.load_collection("SENTINEL2_L2A", spatial_extent=EXTENT,
temporal_extent=[STARTDATE, ENDDATE], bands=["B04"])
meteo_cube = connection.load_collection("AGERA5", spatial_extent=EXTENT, temporal_extent=[STARTDATE, ENDDATE],
bands=["temperature-mean"])
s2_cube = s2_cube.aggregate_temporal_period(period='month', reducer='median', dimension='t')
meteo_cube = meteo_cube.aggregate_temporal_period(period='month', reducer='mean', dimension='t')
inputs = meteo_cube.merge_cubes(s2_cube)
job = inputs.create_job(
out_format="NetCDF",
title="Test extraction job",
job_options={
"split_strategy": "crossbackend",
"driver-memory": "2G",
"driver-memoryOverhead": "2G",
"driver-cores": "1",
"executor-memory": "1800m",
"executor-memoryOverhead": "1900m",
}
)
job.start_and_wait() |
To avoid confusion and wrong results in the short term, main jobs on CDSE will now fail fast with an error like this:
|
429 errors are being handled in Open-EO/openeo-geotrellis-extensions#299. |
Darius is unblocked so this became less urgent. |
As discussed: as a first implementation, try polling from within the batch job. This then becomes a default implementation for which no extra infrastructure is required and effectively platform-agnostic. |
Random ramblings to summarize things and refresh my memory. In the case of OG SHub:
In the case of unfinished job results:
|
The easiest way to do the polling in the main batch job seems to be to just poll
Notes:
|
Original job ran successfully on openeofed-staging ( |
The aggregator delegates a load_collection to Terrascope and then loads it with load_stac, but gives an error "NoDataAvailable".
Adding spatial and temporal extent to the load_stac node in JSON makes the job come trough
meteo_cube = connection.load_stac("https://stac.openeo.vito.be/collections/agera5_daily", spatial_extent=EXTENT, temporal_extent=[STARTDATE, ENDDATE], bands=["2m_temperature_mean"])
gives a 429 errorThe text was updated successfully, but these errors were encountered: