-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New process graph structure #160
Comments
A couple of comments to the above:
|
Thanks, @mkadunc!
If I understand you correctly, all three points are basically saying that we should remove the
We have basically three "magic" keywords at the moment: |
Updated on 14/12/2018 with a new approach as the old approach was very hard to parse/execute. It's still a graph, but the callbacks are handled differently. "Reference" implementations in JS for building and parsing/executing are available in the JS client repository for now. This will be implemented as described here in openEO API v0.4 if no other comments are received. Awaiting your feedback, thanks. @mkadunc As you brought up re-using callbacks: If you re-use a callback in a client the processes get duplicated in the graph as the callbacks are just generated as child process graph of the calling method. If this gets too messy the user has to store the process graph and reference it. The issue is that the builder doesn't know whether a callback function is called twice or not. |
Possible typo: |
In case it's useful, a compact pseudo-representation of the graph:
(you get something more JSON-like if you substitute |
Thanks, @mkadunc .
Sure, my example given above is just based on a not very elegant JS process graph builder written as a proof-of-concept. In the long run you'd generate something that feels more "native", so a class with methods generated from the process definitions returned by the connected back-end. Still, you need to pass the data, which you omitted. You can do some "magic" there to pass it automatically, but that may fail in some cases. For example, your first call to export could be a problem as it returns a boolean instead of a collection, so the following reducer would probably fail. |
Nope, I missed the WebSocket output thing...
Don't worry - I wasn't trying to define a new encoding, just rewrote the JSON into some pseudo-language to get a better overview of the process graph.
Right now we have magic strings coming into the callback process, which limits reuse of stored (sub)processes - i.e. the callback process needs all its arguments to have identical names in all contexts where it's executed. I agree about the need to copy graphs - as it is now, the callback sub-process is parameterized with magic strings (free symbols), which might get resolved/bound to different things in different situations. We could improve this by defining callbacks as pure functions (subprocesses), that:
In the example above, The callback can be called from anywhere, as it doesn't depend on any data values being available outside of its definition. In the example, the parameter of callback, "pixelPair", is defined implicitly. Maybe there's value in defining function parameters explicitly, in which case the Another nice feature is that "dimension_data" - the magic internal variable that only exists within the implementation of 'reduce' - is only used once, and only inside the arguments of the 'reduce' call. There are a couple of nice benefits of having callbacks be equivalent to pre-defined processes and/pr UDFs:
|
No, this is not the way stored process graphs work as of now. Process graphs by itself don't have "non-filled"/flexible parameters as this would lead to a non-validating process graph and would therefore be rejected. To fill those flexible parameters, we introduced process graph variables in API v0.3: https://open-eo.github.io/openeo-api/v/0.4.0/processgraphs/#variables . So you'd need to define these and set these variables in the get_processgraph process. Nevertheless, it is true that the
I think this is already the case. On client-side in the JS builder a callback is a native JS function, which can be re-used. In the process graph they are still duplicated. To me it's not completely clear whether you are speaking about the process graph or the client implementation here?
We have the same idea here. That's also how I'd like it to be and I think this is already the case, but may need some fine-tuning. For example, I removed the binding from the last proposal, but it may still be useful and it seems it should be added again. That could be an optional |
Both - I mostly want to get as much clarity as possible in terms of the conceptual domain model (of processes, nodes and variables) and the conceptual model of the expression syntax that we're building. Things such as:
After re-reading your examples in light of your comments I agree that your current proposal is a good way to go forward, and maybe find opportunities for consolidation of concepts later. Regarding the |
…pply' process Open-EO/openeo-api#160 dimensions parameter is not yet used: Open-EO/openeo-processes#15
…pply' process Open-EO/openeo-api#160 dimensions parameter is not yet used: Open-EO/openeo-processes#15
…els of a geotrellis datacube. Open-EO/openeo-api#160
Another thought about TL/DR: I don't think we need binding - our choice of significant parameter names means that the higher-order functions ( P.S.: If we support anonymous/inline callback, binding can be quite easily achieved manually by wrapping the function call in an inline expression:
longer version Maybe best to demonstrate with a couple of examples - start with Java / JavaScript / TypeScript where order defines meaning of params. The callback is a function that takes some number of parameters of a certain type in order (names are there just for convenience): // parameter names "value" and "index" are irrelevant
function map<T, U>(array: Array<T>, callbackfn: (item: T, index: number) => U): U[] {
const ret = [];
for(i = 1; i< array.size; i++) {
const element = array[i];
ret.push(callbackfn(element, i)); //name is irrelevant; 'element' is bound to first param of callbackfn
}
return ret;
}
const myArray : number = [5, 4, 2];
const changedArray : string[] = map(myArray,
// the names p1 and p2 are local to the callback function
// we need explicit binding of some names to indices, then use the names in the callback expression
(p1, p2) : string => toString(p1) + "_" + p2;
); An extreme version of parameter-name-is-not-important are LISPs lambda expressions — or Mathematica's "pure functions" — that don't even bother with parameter names and allow one to write the callback with just the parameter position in the input: (* callback is a Function that will be called with 2 parameters; their names are irrelevant *)
MapIndexed[callback_Function, array_List] := Table[callback[array[idx], idx], {idx, Length[array]}];
myArray = {5, 4, 2};
changedArray = MapIndexed[
(* the following line is complete definition of callback - '#1' and '#2' are parameter references *)
(* '&' indicates that this is a pure function *)
(* no explicit binding *)
(ToString[#1] + "_" + #2)&
, myArray
] In openEO or other language with significant parameter names (and insignificant order), the interface of the callback function will prescribe names — this eliminates the need for new local names in the definition of the callback, and is close to LISPs or Mathematica's pure functions. Example here uses some fictional notation for parameterized types, and an anonymous declaration of the callback function: // callback will be evaluated with parameters "item" and "index"; their order is not important
function map<T,U>(array: Array<T>, callback: Function<index:number, item:T>) {
let ret = [];
for (let i = 0, i<array.length, i++) {
ret = array_push(
"array": ret,
"element": callback("item": get_element("array":array, "index":i),
"index": i)
);
}
return ret;
}
myArrayB = map("array": myArrayA,
"callback":
// start anonymous function block;
// signature of the function is inferred from
// the definition of 'map' function
{
return toString(item) + "_" + index;
}
); This is equivalent to the following version with explicit declaration of the callback function: // parameter names are important! order is not
function myCb(index: number, item: number) {
return toString(item) + "_" + index;
}
myArrayB = map("array": myArrayA, "callback": myCb); |
Thanks Miha, I really appreciate all your thought. By the way, the decision for named parameters and insignificant order comes directly from the decision for JSON objects and its restrictions. I'm note sure whether we are speaking about the same things regarding bindings. It seems your example is client or back-end code? As you can see in the JS client code example in the initial post, there are no bindings in it, so I came to the same conclusion. What I was speaking about later was the process graph. I still think we need "bindings" there. Or, let's not call that bindings, it just states where the data for each parameter comes from. And I think we still need that, otherwise it is not clear how the data flow is. The name used in the process graph are specified in the process definition. See Regarding your other questions:
There are no real variables, the data is passed directly between processes. Wen can't really control the client-side though.
Using
For callbacks this depends on the input data / context. For nodes in a set of processes, each get triggered once. |
It's an idealized client code. Or, if you want, a programming-language-like representation of the process graph.
We might not - in the same way as the client code for callback directly references properties of the
I think we are on the same page. I'm not trying to push for any particular change, just asking for clarification.
|
Okay, I wasn't speaking about client code here as it is hard to control what the "external" languages support, whether they are strict about scoping (Java) or not so much (JavaScript with An example:
In the reducer callback, JavaScript would technically allow me to use the dateFilter variable because the scope of dateFilter is not restricted. Nevertheless, currently this would fail when generating a process graph as the current architecture does only allow strict scoping, so you are only allowed to use whatever is passed in
Probably a misunderstanding. I was speaking about the process graph, there is no
No, that's not allowed. Strict scoping, see example above.
Yes, due to the strict scoping. (And thanks for the good examples, this will help me writing better documentation.)
No, again, strict scoping. You can only access what is passed by the process that calls the callback and what is passed there is defined in the process description. In my first comment it would be Hope this clarifies things. I'll try to better explain it once porting this to the API documentation. Do you think the approach is reasonable or should we switch to a less strict scoping? I'm convinced at the moment strict scoping makes things easier and less prone to errors, but are there any concerns you have? Sure, we need to make ensure all relevant information required in the callbacks are passed to the callback. @mkadunc |
Thanks. This clears things up, and I fully agree that strict scope (i.e. no access into parent or child contexts) is the best approach to take right now. The only drawback I can think of ATM is the inability to pass custom values to code inside a callback — right now we pass only those arguments that the outer function (e.g. The following is possible in Javascript, but not in openEO: const foo = [1,2,3];
const bar = 7;
const fifu = map(a, params =>params.dim_data + bar); If I wanted to do this in openEO, i'd need to do something analogous to this: //add a new single-element axis
const newcube = append_dimension(foo, "variable", "foo");
//add a new element to the axis, with all values initialized to bar
const foobar = extend_dimension(foo, "variable", "bar", bar);
const fifu = reduce(foobar, "variable", params => params.dim_data[0] + params.dim_data[1]); If we were to solve this problem, I'd consider doing this explicitly, e.g. with a parameter on definition of custom-callback that one could populate manually in order to provide the "context" into the callback. E.g. like this: const fifu = apply_dimensions(cube: a, dimensions: ALL, callback :
{
// explicitly specify which local data will be 'injected' into callback
context: {boo: bar},
// user only params and context in the callback expression (no access to data outside)
expression: (params, context) =>params.dim_data + context.boo
}
); |
Not sure if my example above is representative of what openEO use-cases require, so maybe supporting a way of passing context into callbacks is not yet necessary. |
Yes, this could be useful, but I'm not sure whether anybody needs it yet. I don't see a disadvantage to introduce a context mechanism/parameter except it leads to a bit additional documentation. I opened a separate issue for this: Open-EO/openeo-processes#25 |
Mostly incorporated into the API documentation. Would be great if someone could check https://open-eo.github.io/openeo-api/v/0.4.0/processgraphs/ and the openAPI specification. |
All assignees, it would be very helpful if you could review this issue carefully. Thanks!
Updated on 14/12/2018 with a new approach as the old approach was very hard to parse/execute.
Proposal for a new process graph structure
During the last days we had to find out that we face several issues (see below) with the current process graphs definition. All these issues discussed in the VITO sprint are solved by a new graph-based approach discussed below. This proposal is a bit different than the one discussed at VITO, but it stays with the promising solution to actually make the process graph really a graph-like structure. It radically changes the process graph structure, but gives us a lot more flexibility. In the following I'll provide an example:
Algorithm
Assume we would like to execute the following (not very meaningful) algorithm:
Process chain in blue, callbacks in yellow. Process descriptions are available here: http://processes.openeo.org/sprint
JavaScript client code example
Could be generated by the following client code:
Process graph
This translates into the following JSON encoding for the process graph:
Processes
So each process is assigned a unique graph id (e.g. export2) and arguments can expect data from another process by its graph id in the
from_node
property. They can expect either the result implicitly (noreceive
property available) or expect "internal" data from a process such as a single element (element
) infilter
or dimension data (dim_data
) and the dimension (dimension
) inreduce
. The parameters, which are passed to callbacks are specified in the JSON schema of the process parameters. For examples see the following two processes, especially the schema forreducer
andexpression
:(Remark: Having the callback parameters in the JSON schema may be a problem for validation, so we may need to move them one level up and define them directly in the process parameters, but we'll discover that in the next months. It doesn't change anything major in the discussed approach here.)
Callbacks are a process graph on their own and are set using the
callback
property.As multiple end nodes are possible (see example above), for web-services or stored process-graphs it can be important to have exactly one end node, which result can be referenced and used. Therefore one node need to have the
result
flag set totrue
.callback
has a result node similar to our "main" process graph.Processing the process graph
I made implementations in JS for both generating (client-side) and parsing/executingprocess graphs (server-side). Both solutions work and can be found here: https://github.com/Open-EO/openeo-js-client/tree/new-processgraph-builder
As a back-end you need to go through all nodes/processes in the list and set for each node to which node it passes data and from which it expects data. In another iteration the back-end can find all start nodes for processing by checking for zero dependencies (i.e.
node.expectsFrom.length === 0
in JS).You can now start and execute the start nodes (in parallel if possible). Results can be passed to the nodes that were identified beforehand. For each node that depends on multiple inputs you need to check whether all dependencies have already finished and only execute once the last dependency is ready.
Please be aware that the result node (result set to true) is not necessarily the last node that is executed. The author of the process graph may choose to set a non-end node to the result node!
Issues (as discussed in the VITO sprint)
During the last days we had to find out that we face several issues (see below) with the current process graphs definition. More complex process graphs won't work with our current JSON encoding as the JSON encoding was basically a tree so for instance branching from one imported collection into some "parallel" processing chains doesn't work well. In the end that led to limitations such as that we can't pass two bands into a multiply(x, y) method.
get
andset
for a first draft.The text was updated successfully, but these errors were encountered: