UDP: define minimally required additional job options #545

jdries · 2024-09-18T14:03:46Z

Some UDP's depend on very specific job options to run them successfully. For instance:

"driver-memory": "4g",
"executor-memory": "1500m",
"python-memory": "5g",
"udf-dependency-archives": ["https://artifactory.vgt.vito.be/artifactory/auxdata-public/openeo/onnx_dependencies_1.16.3.zip#onnx_deps"]

Here, udf-dependency-archives is really mandatory.
As for the other options, these can usually be considered as lower bounds. (I don't immediately know of a case where an upper bound would be relevant.)

Here's an example, where I called it 'minimal_job_options':
https://github.com/ESA-APEx/apex_algorithms/blob/dd81a53463e5c913e09329dc02832e8db5a6350e/openeo_udp/worldcereal_inference.json#L1416

The text was updated successfully, but these errors were encountered:

m-mohr · 2024-09-18T15:46:31Z

Quick note: Should probably be aligned with #471

soxofaan · 2024-09-19T07:33:39Z

about udf-dependency-archives: I think it even makes sense to attach this kind of info directly to the UDF, instead of indirectly associating it through the UDP that contains the UDF.

This is kind of related to the idea being brainstormed at Open-EO/openeo-processes#374 to upgrade the current "just a string" udf argument of run_udf to a richer construct (e.g. an array or object).

Another path could be to use the context argument of run_udf, currently documented as:

Additional data such as configuration options to be passed to the UDF.

The current interpretation is that the context is passed directly as argument to the entrypoint function of the UDF, but in a more relaxed interpretation it could also serve to define extra runtime settings like udf-dependency-archives

The nice things about using run_udf's context is that there is no change required at openEO API/processes specification level

jdries · 2024-09-19T11:02:50Z

@soxofaan indeed, but we still need to configure the other job options as well.

m-mohr · 2024-09-20T00:38:26Z

+1 on adding the UDF options to the run_udf call.

Adding job options to processes seems to mix concerns. The process could in principle also be executed in other modes, what happens then? What if I load the process into the Web Editor and change the extent to reasonable small or utterly large and execute it then? So my thinking is that adding such options to a process is not inline with the initial vision for openEO, especially when it's for CPU/mem consumption, which was always meant to be abstracted away. I think if job options are important and you want a job to be exeucted as a job, you need to actually schare the job metadata, not just the process, so pretty much a partial body for the POST /jobs request with process and the other additional properties.

Thinking about it a bit more now, could job options also be provided as process? For example:

configure_runtime(options) -> bool
configure_runtime({"driver-memory": "4g", "executor-memory": "1500m", ...})

Just a spontaneaous idea. Still somewhat mixing concerns, but doesn't need a spec change and is more visible to users. Thoughts?
(If that's fully embraced, we would not even need #471 as options could be described in the process parameter schema. But there's no distinction between job/sync/services unless we implement #429)

soxofaan · 2024-09-20T15:02:31Z

Thinking about it a bit more now, could job options also be provided as process? For example:
configure_runtime(options) -> bool
configure_runtime({"driver-memory": "4g", "executor-memory": "1500m", ...})
Just a spontaneaous idea. Still somewhat mixing concerns, but doesn't need a spec change and is more visible to users. Thoughts?

This feels a bit too procedural/stateful to me and as such conflicts with the openEO concept of expressing your workflow as a graph of linked processing nodes. How would these configure_runtime nodes be connected to the other nodes of the graph? And related: how to resolve conflicts if there are multiple configure_runtime nodes active, not only because the user has put multiple, but additional ones indirectly pulled in by using UDPs.

m-mohr · 2024-09-22T00:18:34Z

That's a good question and there's no obvious solution yet. Having unconnected nodes is probably a bit of a hassle... On the other hand, adding job metadata to the process is also not very clean as pointed out above. So maybe it's really sharing jobs (i.e. job metadata) instead of processes in this case? Similar things could appear for web services, where you'd also don't want to add the metadata for the service creation to the process, but instead probably share web service metadata.

jdries assigned soxofaan and m-mohr Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UDP: define minimally required additional job options #545

UDP: define minimally required additional job options #545

jdries commented Sep 18, 2024

m-mohr commented Sep 18, 2024

soxofaan commented Sep 19, 2024 •

edited

Loading

jdries commented Sep 19, 2024

m-mohr commented Sep 20, 2024 •

edited

Loading

soxofaan commented Sep 20, 2024

m-mohr commented Sep 22, 2024

UDP: define minimally required additional job options #545

UDP: define minimally required additional job options #545

Comments

jdries commented Sep 18, 2024

m-mohr commented Sep 18, 2024

soxofaan commented Sep 19, 2024 • edited Loading

jdries commented Sep 19, 2024

m-mohr commented Sep 20, 2024 • edited Loading

soxofaan commented Sep 20, 2024

m-mohr commented Sep 22, 2024

soxofaan commented Sep 19, 2024 •

edited

Loading

m-mohr commented Sep 20, 2024 •

edited

Loading