Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_stac: support loading unfinished results #489

Closed
jdries opened this issue Aug 10, 2023 · 11 comments · Fixed by #497
Closed

load_stac: support loading unfinished results #489

jdries opened this issue Aug 10, 2023 · 11 comments · Fixed by #497
Assignees

Comments

@jdries
Copy link
Contributor

jdries commented Aug 10, 2023

Using load_stac, we can normally also load openEO generated results provided by a signed url.
Using the 'partial' query parameter, it seems possible to get a canonical link to a job that is still running:
https://api.openeo.org/#tag/Data-Processing/operation/list-results
(support for 'partial' needs to be added to our backend)

So if our backend receives a process graph with a load_stac, it should only start the actual processing when all dependencies are finished!
This is an important piece of the puzzle for federated processing, because it allows the aggregator to perform job splitting without keeping track of the various jobs itself: it can just schedule all jobs and forward them to the respective backends. Hence it also no longer needs a long lived token.

@bossie
Copy link
Collaborator

bossie commented Aug 10, 2023

As discussed: works similarly to SHub batch process polling.

SHub batch processes: when starting a batch job, detect a load_collection(sh_collection, big_bbox) -> schedule corresponding batch processes, poll asynchronously and start the batch job if all dependencies have "status" == "DONE"

load_stac: when starting a batch job, detect a load_stac(canonical_partial_job_results) -> if "openeo:status" == "running", poll asynchronously and start the batch job if all dependencies have "openeo:status" == "finished".

@bossie
Copy link
Collaborator

bossie commented Aug 17, 2023

"openeo:status" is in the root of the STAC Collection object.

Some considerations/decisions that make sense at this time but might need revisiting:

  • since these dependencies are batch jobs and, by definition, can take a bit of time, this flow only makes sense in the context of a batch job (~ SHub batch processes)
  • what about a load_stac(canonical_partial_job_results) in a /result context? Never wait but continue with (partial) results or abort?
  • re: the initial check, if "openeo:status" is missing or == "finished", just proceed; if "error" or "canceled" continue with partial results or abort?
  • re: the polling check, if "openeo:status" is missing, just proceed; if "error" or "canceled" continue with partial results or abort?

@bossie bossie linked a pull request Aug 25, 2023 that will close this issue
bossie added a commit that referenced this issue Aug 25, 2023
Traceback (most recent call last):
  File "/home/bossie/PycharmProjects/openeo/venv38/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/bossie/PycharmProjects/openeo/venv38/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/bossie/PycharmProjects/openeo/venv38/lib/python3.8/site-packages/openeo_driver/users/auth.py", line 88, in decorated
    return f(*args, **kwargs)
  File "/home/bossie/PycharmProjects/openeo/venv38/lib/python3.8/site-packages/openeo_driver/views.py", line 865, in queue_job
    backend_implementation.batch_jobs.start_job(job_id=job_id, user=user)
  File "/home/bossie/PycharmProjects/openeo/openeo-geopyspark-driver/openeogeotrellis/backend.py", line 1827, in start_job
    self._start_job(job_id, user, _get_vault_token)
  File "/home/bossie/PycharmProjects/openeo/openeo-geopyspark-driver/openeogeotrellis/backend.py", line 2185, in _start_job
    args.append(serialize_dependencies())
  File "/home/bossie/PycharmProjects/openeo/openeo-geopyspark-driver/openeogeotrellis/backend.py", line 1957, in serialize_dependencies
    dependencies = dependencies or job_info.get('dependencies') or []
UnboundLocalError: local variable 'dependencies' referenced before assignment
bossie added a commit that referenced this issue Aug 25, 2023
@bossie
Copy link
Collaborator

bossie commented Aug 25, 2023

Currently implemented as such:

  • load_stac of partial job results that have already finished with "openeo:status": "error" or "openeo:status": "canceled": just proceed with the batch job; this allows the user to load partial results (their responsibility).
  • load_stac of unfinished partial job results that eventually fail with "error" or "canceled": fail the the batch job; otherwise the job would proceed with partial results and the user would never know (unless they check the logs)
  • load_stac of partial job results in /result: never wait, proceed with partial results (same reasoning, their responsibility).

bossie added a commit that referenced this issue Aug 25, 2023
We schedule batch processes ourselves, then pass on their source_location (and card4l flag) to the batch job that is subsequently started upon their completion ("DONE"). The load_collections that get invoked while evaluating the process graph in the batch job only rely on source_location and card4l, not on the arguments actually passed to load_collection.

In the case of unfinished job results dependencies, however, there's no need to serialize them and pass them on to the subsequent batch job, because load_stacs evaluated in that batch job will simply act upon the very same "running" URL that made it a dependency in the first place (but this time it will actually read the job results because they have "finished" in the meanwhile).
bossie added a commit that referenced this issue Aug 25, 2023
…489

>       batch_processes = reduce(partial(dict_merge_recursive, overwrite=True),
                                 (batch_request_details(dependency) for dependency in batch_process_dependencies))
E       TypeError: reduce() of empty sequence with no initial value
bossie added a commit that referenced this issue Aug 25, 2023
bossie added a commit that referenced this issue Aug 25, 2023
bossie added a commit that referenced this issue Aug 28, 2023
bossie added a commit that referenced this issue Aug 28, 2023
…finished-results

start job: await unfinished job results dependencies #489
bossie added a commit that referenced this issue Aug 29, 2023
bossie added a commit that referenced this issue Aug 29, 2023
@bossie
Copy link
Collaborator

bossie commented Aug 29, 2023

Did some testing and loading partial job results from OpenEO dev takes a very long time (but not always):

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='openeo-dev.vito.be', port=443): Max retries exceeded with url: /openeo/jobs/j-0977e9aed9da431f872b716c20e500cf/results/...?expires=1693901122&partial=true (Caused by ReadTimeoutError("HTTPSConnectionPool(host='openeo-dev.vito.be', port=443): Read timed out. (read timeout=60)"))

@bossie
Copy link
Collaborator

bossie commented Aug 30, 2023

Not sure what the deal is with:

  1. creating job A on openeo-dev but deliberately not starting it (= openeo:status: running);
  2. creating job B on openeo-dev that load_stac's the canonical URL of A's partial results and starting it.

In step 2, the web app driver will attempt to fetch A's openeo:status but this request never gets a response. Requests from outside of the application (think: curl) will also be stalled. 🤔

@bossie bossie reopened this Aug 30, 2023
@bossie
Copy link
Collaborator

bossie commented Aug 30, 2023

If I load_stac a cdse-staging job on openeo-dev, things behave better.

  1. create original job A on cdse-staging (back-end needs to support ?partial and have a public URL) but don't start it:
{
  "process_graph": {
    "load1": {
      "arguments": {
        "id": "SENTINEL2_L2A",
        "spatial_extent": {
          "coordinates": [
            [
              [
                14.20922527067026,
                40.855657765536336
              ],
              [
                14.20922527067026,
                40.95056915081699
              ],
              [
                14.316342442933973,
                40.95056915081699
              ],
              [
                14.316342442933973,
                40.855657765536336
              ],
              [
                14.20922527067026,
                40.855657765536336
              ]
            ]
          ],
          "type": "Polygon"
        },
        "temporal_extent": [
          "2022-04-17T00:00:00Z",
          "2022-04-17T00:00:00Z"
        ]
      },
      "process_id": "load_collection"
    },
    "save2": {
      "arguments": {
        "data": {
          "from_node": "load1"
        },
        "format": "GTIFF"
      },
      "process_id": "save_result",
      "result": true
    }
  }
}
  1. get the canonical partial URL of the results of A
  2. create and start job B on openeo-dev (needs to support async_task) that load_stac's A:
{
  "process_graph": {
    "load1": {
      "arguments": {
        "url": "https://openeo-staging.dataspace.copernicus.eu/openeo/1.1/jobs/j-735dfda9c31849efba3895cf8b8cf64c/results/...&partial=true"
      },
      "process_id": "load_stac"
    },
    "save2": {
      "arguments": {
        "data": {
          "from_node": "load1"
        },
        "format": "GTiff"
      },
      "process_id": "save_result",
      "result": true
    }
  }
}
  1. start job A and B proceeds as soon as A is done:

OpenEO batch job results statuses for batch job j-7c5e66fbba9f402b9ce7c19f3d42c29c: {'https://openeo-staging.dataspace.copernicus.eu/openeo/1.1/jobs/j-735dfda9c31849efba3895cf8b8cf64c/results/...&partial=true': 'running'}
...
OpenEO batch job results statuses for batch job j-7c5e66fbba9f402b9ce7c19f3d42c29c: {'https://openeo-staging.dataspace.copernicus.eu/openeo/1.1/jobs/j-735dfda9c31849efba3895cf8b8cf64c/results/ZGY3ZWE0NWQtZWNjNC00NTNmLThhZjktZGU4Y2ZiMTA1OGIx/...&partial=true': None}
...
Submitting job with command ['/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/deploy/submit_batch_job_spark3.sh', 'openEO batch_test_debug_load_stac_partial_job_results: load_stac_j-7c5e66fbba9f402b9ce7c19f3d42c29c_user 7d523381374b62e6f2aad1f2f9fb6eddf624d382a8f71db6dcf5788e3aae0af3@egi.eu', '/data/projects/OpenEO/j-7c5e66fbba9f402b9ce7c19f3d42c29c_rj0d0rou.in', '/data/projects/OpenEO/j-7c5e66fbba9f402b9ce7c19f3d42c29c', 'out', 'log', 'job_metadata.json', '[email protected]', 'openeo.keytab', 'vdboschj', '1.1.0', '8G', '2G', '3G', '5', '2', '2G', 'default', 'false', '[]', 'custom_processes.py', '100', '7d523381374b62e6f2aad1f2f9fb6eddf624d382a8f71db6dcf5788e3aae0af3@egi.eu', 'j-7c5e66fbba9f402b9ce7c19f3d42c29c', '0.0', '1', 'default', '/data/projects/OpenEO/j-7c5e66fbba9f402b9ce7c19f3d42c29c_rqtx6d0z.properties', '', 'INFO', '']

However, B does not yet run to completion; there are 2 errors in its logs:

OpenEO batch job failed: java.lang.IllegalArgumentException: requirement failed: Server doesn't support ranged byte reads

Traceback (most recent call last):
  File "batch_job.py", line 1291, in <module>
    main(sys.argv)
  File "batch_job.py", line 1028, in main
    run_driver()
  File "batch_job.py", line 999, in run_driver
    run_job(
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/utils.py", line 52, in memory_logging_wrapper
    return function(*args, **kwargs)
  File "batch_job.py", line 1092, in run_job
    result = ProcessGraphDeserializer.evaluate(process_graph, env=env, do_dry_run=tracer)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 348, in evaluate
    result = convert_node(result_node, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1480, in apply_process
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1480, in <dictcomp>
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 380, in convert_node
    return convert_node(processGraph['node'], env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1512, in apply_process
    return process_function(args=ProcessArgs(args, process_id=process_id), env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 2091, in load_stac
    return env.backend_implementation.load_stac(url=url, load_params=load_params, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py", line 1078, in load_stac
    pyramid = pyramid_factory.datacube_seq(projected_polygons, from_date, to_date, metadata_properties,
  File "/opt/spark3_4_0/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
  File "/opt/spark3_4_0/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o1635.datacube_seq.
: java.io.IOException: Exception while determining data type of collection https://openeo-staging.dataspace.copernicus.eu/openeo/1.1/jobs/j-735dfda9c31849efba3895cf8b8cf64c/results/...&partial=true and item https://openeo-staging.dataspace.copernicus.eu/openeo/1.1/jobs/j-735dfda9c31849efba3895cf8b8cf64c/results/assets/.../openEO_2022-04-17Z.tif?expires=1694006445. Detailed message: requirement failed: Server doesn't support ranged byte reads
	at org.openeo.geotrellis.layers.FileLayerProvider.determineCelltype(FileLayerProvider.scala:662)
	at org.openeo.geotrellis.layers.FileLayerProvider.readKeysToRasterSources(FileLayerProvider.scala:690)
	at org.openeo.geotrellis.layers.FileLayerProvider.readMultibandTileLayer(FileLayerProvider.scala:862)
	at org.openeo.geotrellis.file.PyramidFactory.datacube(PyramidFactory.scala:111)
	at org.openeo.geotrellis.file.PyramidFactory.datacube_seq(PyramidFactory.scala:84)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: requirement failed: Server doesn't support ranged byte reads
	at scala.Predef$.require(Predef.scala:281)
	at org.openeo.geotrellis.CustomizableHttpRangeReader.totalLength$lzycompute(CustomizableHttpRangeReader.scala:31)
	at org.openeo.geotrellis.CustomizableHttpRangeReader.totalLength(CustomizableHttpRangeReader.scala:10)
	at geotrellis.util.StreamingByteReader.ensureChunk(StreamingByteReader.scala:109)
	at geotrellis.util.StreamingByteReader.get(StreamingByteReader.scala:130)
	at geotrellis.raster.io.geotiff.reader.GeoTiffInfo$.read(GeoTiffInfo.scala:127)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:211)
	at geotrellis.raster.geotiff.GeoTiffReprojectRasterSource.$anonfun$tiff$1(GeoTiffReprojectRasterSource.scala:46)
	at scala.Option.getOrElse(Option.scala:189)
	at geotrellis.raster.geotiff.GeoTiffReprojectRasterSource.tiff$lzycompute(GeoTiffReprojectRasterSource.scala:43)
	at geotrellis.raster.geotiff.GeoTiffReprojectRasterSource.tiff(GeoTiffReprojectRasterSource.scala:40)
	at geotrellis.raster.geotiff.GeoTiffReprojectRasterSource.$anonfun$cellType$1(GeoTiffReprojectRasterSource.scala:50)
	at scala.Option.getOrElse(Option.scala:189)
	at geotrellis.raster.geotiff.GeoTiffReprojectRasterSource.cellType(GeoTiffReprojectRasterSource.scala:50)
	at org.openeo.geotrellis.layers.BandCompositeRasterSource.$anonfun$cellType$1(FileLayerProvider.scala:79)
	at cats.data.NonEmptyList.map(NonEmptyList.scala:87)
	at org.openeo.geotrellis.layers.BandCompositeRasterSource.cellType(FileLayerProvider.scala:79)
	at org.openeo.geotrellis.layers.FileLayerProvider.determineCelltype(FileLayerProvider.scala:656)
	... 16 more
Failed status sync for job_id='j-7c5e66fbba9f402b9ce7c19f3d42c29c': unexpected KeyError: 'batch_request_id'

Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_tracker_v2.py", line 383, in update_statuses
    self._sync_job_status(
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_tracker_v2.py", line 460, in _sync_job_status
    dependency_sources = list(set(get_dependency_sources(job_info)))
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_registry.py", line 433, in get_dependency_sources
    return [source for dependency in (job_info.get("dependencies") or []) for source in sources(dependency)]
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_registry.py", line 433, in <listcomp>
    return [source for dependency in (job_info.get("dependencies") or []) for source in sources(dependency)]
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_registry.py", line 427, in sources
    subfolder = dependency.get("subfolder") or dependency["batch_request_id"]
KeyError: 'batch_request_id

bossie added a commit to Open-EO/openeo-python-driver that referenced this issue Aug 31, 2023
Open-EO/openeo-geopyspark-driver#489

OpenEO batch job failed: java.lang.IllegalArgumentException: requirement failed: Server doesn't support ranged byte reads

Traceback (most recent call last):
  File "batch_job.py", line 1291, in <module>
    main(sys.argv)
  File "batch_job.py", line 1028, in main
    run_driver()
  File "batch_job.py", line 999, in run_driver
    run_job(
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/utils.py", line 52, in memory_logging_wrapper
    return function(*args, **kwargs)
  File "batch_job.py", line 1092, in run_job
    result = ProcessGraphDeserializer.evaluate(process_graph, env=env, do_dry_run=tracer)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 348, in evaluate
    result = convert_node(result_node, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1480, in apply_process
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1480, in <dictcomp>
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 380, in convert_node
    return convert_node(processGraph['node'], env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 368, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1512, in apply_process
    return process_function(args=ProcessArgs(args, process_id=process_id), env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 2091, in load_stac
    return env.backend_implementation.load_stac(url=url, load_params=load_params, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py", line 1078, in load_stac
    pyramid = pyramid_factory.datacube_seq(projected_polygons, from_date, to_date, metadata_properties,
  File "/opt/spark3_4_0/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
  File "/opt/spark3_4_0/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o1635.datacube_seq.
: java.io.IOException: Exception while determining data type of collection https://openeo-staging.dataspace.copernicus.eu/openeo/1.1/jobs/j-735dfda9c31849efba3895cf8b8cf64c/results/...&partial=true and item https://openeo-staging.dataspace.copernicus.eu/openeo/1.1/jobs/j-735dfda9c31849efba3895cf8b8cf64c/results/assets/.../openEO_2022-04-17Z.tif?expires=1694006445. Detailed message: requirement failed: Server doesn't support ranged byte reads
	at org.openeo.geotrellis.layers.FileLayerProvider.determineCelltype(FileLayerProvider.scala:662)
	at org.openeo.geotrellis.layers.FileLayerProvider.readKeysToRasterSources(FileLayerProvider.scala:690)
	at org.openeo.geotrellis.layers.FileLayerProvider.readMultibandTileLayer(FileLayerProvider.scala:862)
	at org.openeo.geotrellis.file.PyramidFactory.datacube(PyramidFactory.scala:111)
	at org.openeo.geotrellis.file.PyramidFactory.datacube_seq(PyramidFactory.scala:84)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: requirement failed: Server doesn't support ranged byte reads
	at scala.Predef$.require(Predef.scala:281)
	at org.openeo.geotrellis.CustomizableHttpRangeReader.totalLength$lzycompute(CustomizableHttpRangeReader.scala:31)
	at org.openeo.geotrellis.CustomizableHttpRangeReader.totalLength(CustomizableHttpRangeReader.scala:10)
	at geotrellis.util.StreamingByteReader.ensureChunk(StreamingByteReader.scala:109)
	at geotrellis.util.StreamingByteReader.get(StreamingByteReader.scala:130)
	at geotrellis.raster.io.geotiff.reader.GeoTiffInfo$.read(GeoTiffInfo.scala:127)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:211)
	at geotrellis.raster.geotiff.GeoTiffReprojectRasterSource.$anonfun$tiff$1(GeoTiffReprojectRasterSource.scala:46)
	at scala.Option.getOrElse(Option.scala:189)
	at geotrellis.raster.geotiff.GeoTiffReprojectRasterSource.tiff$lzycompute(GeoTiffReprojectRasterSource.scala:43)
	at geotrellis.raster.geotiff.GeoTiffReprojectRasterSource.tiff(GeoTiffReprojectRasterSource.scala:40)
	at geotrellis.raster.geotiff.GeoTiffReprojectRasterSource.$anonfun$cellType$1(GeoTiffReprojectRasterSource.scala:50)
	at scala.Option.getOrElse(Option.scala:189)
	at geotrellis.raster.geotiff.GeoTiffReprojectRasterSource.cellType(GeoTiffReprojectRasterSource.scala:50)
	at org.openeo.geotrellis.layers.BandCompositeRasterSource.$anonfun$cellType$1(FileLayerProvider.scala:79)
	at cats.data.NonEmptyList.map(NonEmptyList.scala:87)
	at org.openeo.geotrellis.layers.BandCompositeRasterSource.cellType(FileLayerProvider.scala:79)
	at org.openeo.geotrellis.layers.FileLayerProvider.determineCelltype(FileLayerProvider.scala:656)
	... 16 more
bossie added a commit that referenced this issue Aug 31, 2023
Failed status sync for job_id='j-7c5e66fbba9f402b9ce7c19f3d42c29c': unexpected KeyError: 'batch_request_id'

Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_tracker_v2.py", line 383, in update_statuses
    self._sync_job_status(
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_tracker_v2.py", line 460, in _sync_job_status
    dependency_sources = list(set(get_dependency_sources(job_info)))
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_registry.py", line 433, in get_dependency_sources
    return [source for dependency in (job_info.get("dependencies") or []) for source in sources(dependency)]
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_registry.py", line 433, in <listcomp>
    return [source for dependency in (job_info.get("dependencies") or []) for source in sources(dependency)]
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_registry.py", line 427, in sources
    subfolder = dependency.get("subfolder") or dependency["batch_request_id"]
KeyError: 'batch_request_id
bossie added a commit to Open-EO/openeo-python-driver that referenced this issue Sep 1, 2023
@bossie
Copy link
Collaborator

bossie commented Sep 1, 2023

A load_stac on Terrascope from an unfinished job on CDSE works after the latest fixes. 🥳

@bossie
Copy link
Collaborator

bossie commented Sep 1, 2023

With both jobs on Terrascope, it still hangs.

TimingLogger when starting the load_stac batch job (backend.py):

load_stac(https://openeo-dev.vito.be/openeo/1.1/jobs/j-dcc5727733a34956a34ca97a16ef57ef/results/...&partial=true): extract "openeo:status": start 2023-09-01 06:38:21.490097

TimingLogger in the subsequent get partial job results call (views.py):

backend_implementation.batch_jobs.get_job_info(job_id='j-dcc5727733a34956a34ca97a16ef57ef', user_id='7d523381374b62e6f2aad1f2f9fb6eddf624d382a8f71db6dcf5788e3aae0af3@egi.eu'): start 2023-09-01 06:39:14.346796

In both cases, there's no corresponding "end/elapsed" log.

Added some more logging to the implementation of get_job_info.

@bossie
Copy link
Collaborator

bossie commented Sep 1, 2023

I think the mutex introduced in 656b7ce leads to a deadlock.

GeoPySparkBackendImplementation is a singleton, therefore GpsBatchJobs is a singleton, therefore DoubleJobRegistry is a singleton, and the mutex prevents it from being used as a context manager in multiple threads at the same time.

The mutex is acquired for the whole of start_job, which includes an HTTP call to get the batch job details and that will also try to acquire the same mutex.

@bossie
Copy link
Collaborator

bossie commented Sep 4, 2023

Confirmed and subsequently fixed by splitting up the DoubleJobRegistry context manager blocks in _start_job().

Why was this RLock introduced @soxofaan ?

bossie added a commit that referenced this issue Sep 4, 2023
@bossie
Copy link
Collaborator

bossie commented Sep 5, 2023

Both original + load_stac jobs on Terrascope works as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants