Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language-agnostic UDFs #87

Closed
m-mohr opened this issue Jun 4, 2018 · 21 comments
Closed

Language-agnostic UDFs #87

m-mohr opened this issue Jun 4, 2018 · 21 comments
Assignees
Milestone

Comments

@m-mohr
Copy link
Member

m-mohr commented Jun 4, 2018

So we are working on different implementations for UDFs.

It would be good to exchange ideas in the next weeks and add them here.

I am currently stripping down the API to only include endpoints that are considered to be at least in a release candidate state with the release of v0.3. UDFs seem to be still in early alpha, at least in the API definition. Therefore I removed the current specification:

openapi: 3.0.1
tags:
  - name: UDF Runtime Discovery
    description: >-
      Discovery of programming languages and their runtime environments to
      execute user-defined functions at the back-end.
paths:
  /udf_runtimes:
    get:
      summary: >-
        Returns the programming languages including their environments and UDF
        types supported.
      description: >-
        Describes how custom user-defined functions can be exposed to the data
        and which programming languages and environments are supported by the
        back-end.
      tags:
        - UDF Runtime Discovery
      security:
        - {}
        - Bearer: []
      responses:
        200:
          description: Description of UDF runtime support
          content:
            application/json:
              schema:
                type: array
                items:
                  type: object
                  description: >-
                    A map with language identifiers such as `R` as keys and an
                    object that defines available versions, extension packages,
                    and UDF schemas.
                  additionalProperties:
                    type: object
                    properties:
                      udf_types:
                        type: array
                        items:
                          $ref: '#/components/schemas/udf_type'
                      versions:
                        type: object
                        description: >-
                          A map with version identifiers as keys and an object
                          value that specifies which extension packages are
                          available for the particular version.
                        additionalProperties:
                          description: >-
                            Extension package identifiers that should include
                            their version number such as `'sf__0.5-4'`
                          properties:
                            packages:
                              type: array
                              items:
                                type: string
                          type: object
              examples:
                response:
                  value:
                    R:
                      udf_types:
                        - reduce_time
                        - reduce_space
                        - apply_pixel
                      versions:
                        3.1.0:
                          packages:
                            - Rcpp_0.12.10
                            - sp_1.2-5
                            - rmarkdown_1.6
                        3.3.3:
                          packages:
                            - Rcpp_0.12.10
                            - sf_0.5-4
                            - spacetime_1.2-0
        4XX:
          $ref: '#/components/responses/client_error_auth'
        5XX:
          $ref: '#/components/responses/server_error'
  '/udf_runtimes/{lang}/{udf_type}':
    parameters:
      - name: lang
        in: path
        description: Language identifier such as `R`
        required: true
        schema:
          type: string
          enum:
            - python
            - R
      - name: udf_type
        in: path
        description: >-
          The UDF types define how UDFs can be exposed to the data, how they can
          be parallelized, and how the result schema should be structured.
        required: true
        schema:
          type: string
          enum:
            - apply_pixel
            - apply_scene
            - reduce_time
            - reduce_space
            - window_time
            - window_space
            - window_spacetime
            - aggregate_time
            - aggregate_space
            - aggregate_spacetime
    get:
      summary: Returns the process description of UDF schemas.
      description: >-
        Returns the process description of UDF schemas, which offer different
        possibilities how user-defined scripts can be applied to the data.
      tags:
        - UDF Runtime Discovery
      security:
        - {}
        - Bearer: []
      responses:
        200:
          description: Process description
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/udf_description'
              examples:
                response:
                  value:
                    process_id: /udf/R/reduce_time
                    description: >-
                      Applies the given R script on all time series of the input
                      imagery. The R script gets pixel values (all bands) of
                      complete time series as input and must result in a single
                      value or tuple for multiple bands.
                    args:
                      imagery:
                        description: input image time series
                      script:
                        description: R script that will be applied over time series
        4XX:
          $ref: '#/components/responses/client_error_auth'
        5XX:
          $ref: '#/components/responses/server_error'
components:
  schemas:
    udf_type:
      type: string
      description: >-
        The UDF types define how UDFs can be exposed to the data, how they can
        be parallelized, and how the result schema should be structured.
      enum:
        - apply_pixel
        - apply_scene
        - reduce_time
        - reduce_space
        - window_time
        - window_space
        - window_spacetime
        - agregate_time
        - aggregate_space
        - aggregate_spacetime
    udf_description:
      description: >-
        Defines and describes a UDF using the same schema as the description of
        processes offered by the back-end.
      type: object
      required:
        - process_id
        - description
      properties:
        process_id:
          type: string
          description: The unique identifier of the process.
        description:
          type: string
          description: >-
            A short and concise description of what the process does and how the
            output looks like.
        link:
          type: string
          description: >-
            Reference to an external process definition if the process has been
            defined over different back ends within OpenEO
        args:
          type: object
          additionalProperties:
            type: object
            required:
              - description
            properties:
              description:
                type: string
                description: A short and concise description of the process argument.
              required:
                type: boolean
                default: true
                description: Defines whether an argument is required or optional.
            additionalProperties: true
      example:
        process_id: udf/R/reduce_time
        description: >-
          Applies an R function independently over all input time series that
          produces a zero-dimensional value (scalar or multi-band tuple) as
          output (per time series).
        args:
          imagery:
            description: input (image) time series
            required: true
          script:
            description: 'Script resource that has been uploaded to user space before. '
            required: true
@m-mohr m-mohr added in discussion udfs and UDF runtime disctovery work in progress labels Jun 4, 2018
@m-mohr m-mohr added this to the v0.4 milestone Jun 4, 2018
@jdries
Copy link

jdries commented Jun 4, 2018 via email

m-mohr added a commit that referenced this issue Jun 4, 2018
@m-mohr
Copy link
Member Author

m-mohr commented Jun 4, 2018

@GreatEmerald shared in our chat:

During the Proba-V Symposium some of our team got to talk to Leslie Gale from Space Applications, who shared a bit about what they have achieved in the EOPEN project so far. There was a demonstrator on how they tackled the issue of UDFs: they have a web frontend for generating Docker definition files. So the user selects what dependencies to deploy on it, and the site generates boilerplate Docker definition that you can then edit, or upload your script to be included into it. And then those files get uploaded to a backend and processing is done.
Leslie said that we could just reuse the same solution in openEO as well, the code is also open and out there. Perhaps there would also be a way to integrate EOPEN as a frontend or so in openEO as well.

@GreatEmerald
Copy link
Member

Yes, I'll post more info on that once I get a reply.

Looking at the issue of language-specific UDF support vs something based on Docker, it feels to me that the former currently is aimed at relatively basic processing (e.g. computing vegetation indices), as opposed to something complex (e.g. running custom time series breakpoint analysis with specific R package versions).

On one hand, as far as I can tell the simpler language-specific approach was what was initially envisioned for openEO, but then a Docker-based solution could be quite a bit more flexible (if more difficult for the user to set up).

@huhabla
Copy link

huhabla commented Jun 7, 2018 via email

@pramitghosh
Copy link

pramitghosh commented Jun 7, 2018

In order to maintain the interoperability of the UDFs with the backends, I believe some conventions are to be agreed upon. I was thinking more in the terms of a file-based system for transferring data to and from the backends for executing the UDFs. So, for example, the I/O for rasters could be in single-band GeoTIFFs (or Cloud Optimized GeoTIFFs as suggested by @m-mohr) in a specific directory structure and/or naming conventions with some more generic (say ASCII) files for metadata. Something similar could be thought of for feature and timeseries data too. A brief description of the strategy I am planning is here: https://github.com/pramitghosh/OpenEO.R.UDF#general-strategy. The backends need to provide the input in consistent formats and also the UDFs need to write the outputs to disk in consistent formats so that the backends could read it back again. Also, I think keeping the formats generic would help to ensure compatibility. I am already coordinating with @flahn for the R backend regarding this and it would be to discuss these issues with everyone for the other backends too. The issue of external dependencies could be solved, at least for the implementation using R, if the user provides them as a comma separated string in a text file along with the script, for example.

I was wondering if @huhabla is thinking in similar directions as well for the UDF functionality using Python.

Some important points to ponder upon:

Some personal opinions on using Docker and Language-agnostic UDFs:

  1. One thing I wanted to point out is the use of Docker for the UDFs. Since UDFs (whether using Docker or not) will involve reading and writing data to disk it is bound to be time-consuming in my opinion. I'm not sure but on top of this, introducing Docker might result in big performance issues which will in turn probably affect the user monetarily.
  2. Furthermore, Docker might be too fancy for a small subset of potential OpenEO users without advanced programming knowledge who would prefer to keep their UDFs simple.
  3. Correct me if I'm wrong but I believe making UDFs language-agnostic would imply the users could, in principle, structure their output in any format using any language of their choice which will have to be then read in by the backend. I think this could be a problem since in case there is some inconsistency in the UDF output structure/format such that it is not parsable by the backend which called it, it would throw an error after processing the whole UDF which would mean loss of computing time.

I would love to have everyone's opinion on these, at least on the strategy for the UDFs for now, since the UDFs are intricately connected to a number of other components as evident from today's (7.6.18) telco. Thanks!

@huhabla
Copy link

huhabla commented Jun 7, 2018 via email

@pramitghosh
Copy link

Dear Sören,

Thanks @huhabla for describing your approach for UDFs in Python. You are right that the backends usually have their own "native" file formats which they are more comfortable in which might not be GeoTIFFs. If the format is supported by GDAL/OGR we are good to go as such. I will try to incorporate this point in the implementation using R too.

Just to further clarify some points regarding your Python implementation could you please comment if I got these following points right:

  1. You are also using a file-based system for I/O to and from the UDF environment but the data formats are not necessarily GeoTIFFs but could be anything readable by GDAL/OGR. These binary files are supported by JSON files containing some metadata on them which would be parsed by your UDF implementation.
  2. Internally you are converting these binary files into ADTs using some standard Python-specific implementations - such as NumPy arrays for rasters - on which the users' Python script containing the UDF would be run. (However, whatever is coming in or going out are binary.)
  3. I'm not sure about this but if the user has his/her UDF in some other language then are you are exporting the NumPy arrays (or other Python-specific data structures) to some generic formats on which the UDF is run?
  4. After the UDF has run, the output is again written in binary (along with a JSON?) for the backends to read back.

If the above points are right I would say I am thinking on the same lines too (except point 3 above) when looking at the UDF implementations externally as a blackbox - apart from a few file format differences e.g. using specific generic formats like GeoTIFF for rasters currently (which I could change to GDAL/OGR readable ones without too much hassle), using CSV for storing the metadata instead of JSON (this could be changed too to make them conform more to each other - even if not exactly).

However, one thing that I am concerned a bit with is the interfacing of the UDF implementations with the different backends. I think once we come to a common ground regarding the I/O formats and structure it would be easier for the backend devs to make the backends communicate more easily with both the UDF implementations.

As a side note, I'm not sure but is converting multi-temporal multi-band GeoTIFFs to simple ADTs like arrays and lists a good idea? Will this not blow up already big data significantly?

Thanks!

@jdries
Copy link

jdries commented Jun 8, 2018

Note that this is the first issue where we actually start to assume that backends write files to disk at some point, except perhaps for creating a result that can be downloaded. This has a pretty major impact on the ability to do synchronous calls and web services.
In the proposal we had the concept of a 'file based API'. Should we perhaps see this issue as a first part of that API? And should this also affect other parts of the API? For instance, when working file based, it does make sense to have an OpenSearch catalogue allowing you to search for scenes that need to be provided as input. This is different from the current, 'datacube', approach.
Another option might be to find a way to stream tiles that are in memory in the backend into the docker container directly.

@huhabla
Copy link

huhabla commented Jun 8, 2018 via email

@huhabla
Copy link

huhabla commented Jun 8, 2018 via email

@huhabla
Copy link

huhabla commented Jun 8, 2018 via email

@pramitghosh
Copy link

Dear Sören,

Thanks for the detailed explanation of the Python implementation. Yes, the users' UDF need not worry about file handling (I was a bit worried about the UDFs in other languages part).

So, what I get now is that the data is converted to Python objects and the users' UDF runs on the object. The results are converted back again to GDAL readable formats for the backend. This is done by the program execute_udf. Another thing is that the whole thing happens in memory and the data is converted to Python objects by the backend.

For the R UDFs, I seem to have a similar direction except for the following major differences:

  • The data from the backend are written to disk physically (as Jeroen @jdries mentioned) at a location where both the backend and the UDF server has read/write access.
  • The data here is also converted to objects in R (which would provide the users writing the UDFs a consistent structure to work upon) but this conversion is independent of the backend since it is being done by a separate R package which is being developed (https://github.com/pramitghosh/OpenEO.R.UDF). This has to be (pre-)installed in the servers executing R UDFs. The conversion of the results back to GDAL readable formats for the backend would be taken care by this package too.
  • As for the UDFs themselves, the user writes their function definition in a script file and calls their own function in the same file (just as in the Python UDF examples you provided here: https://github.com/Open-EO/openeo-udf/tree/master/src/openeo_udf/functions). The only difference in the R implementation is that the user calls his/her own function as an argument to a function run_UDF() defined in the package I mentioned above.

So, the actual UDF "function", for example, could look as simple as this:

my_func = function(obj) {median(obj)}

and this could be called as

run_UDF(legend_name = "legend.csv", function_name = my_func, drop_dim = 4)

The arguments legend_name and drop_dim could eventually be omitted so that the call looks like

run_UDF(my_func)

So from the R perspective the exporting and importing is done by the backend (I am already coordinating with @flahn regarding this) but the conversion into objects is done by the server executing the R UDF. I think writing files to disk could have some advantages in the future such as keeping the memory cleaner, applying some other tools which does something with the files in place (maybe some sort of pre-/post-processing, for example), processing the individual files parallely etc.

Looking forward to hear everyone's opinions and/or suggestions on this approach.

Thanks!

@huhabla
Copy link

huhabla commented Jun 11, 2018 via email

@edzer
Copy link
Member

edzer commented Jun 12, 2018

Yes, of course, but at the cost of being no longer language-agnostic. It would be a pity if we'd be dropping that, as a goal, now.

@huhabla
Copy link

huhabla commented Jun 12, 2018

I am not sure if i understand this correctly. Why do we drop this goal now?
The OpenEO UDF swagger 2.0 description is IMHO language agnostic. The python reference implementation is based on the swagger description. It assures that the same python UDF code without any modification will run on different backends. The idea is that python UDF's work on python OpenEO UDF API objects (SpatialExtent, RasterCollectionTile,
VectorCollectionTile, StructuredData, UdfData that make use of numpy, geopandas and shapely), not on backend specific datatypes. If a backend has a python interface, then it will be easier to implement the python UDF support in the backend by using the python OpenEO UDF API reference library.

@edzer
Copy link
Member

edzer commented Jun 12, 2018

OK, I probably misunderstood. Let's have a chat some time next week!

@m-mohr
Copy link
Member Author

m-mohr commented Jun 12, 2018

It seems like there is a lot of confusion regarding the UDFs. This should be presented and discussed with all interested parties in the next weeks, maybe during one of the next dev telcos?!

@m-mohr
Copy link
Member Author

m-mohr commented Jun 13, 2018

Will be presented and discussed in the next dev telco on 21/06/2018 2pm.

@m-mohr m-mohr modified the milestones: v0.4, v0.5 Sep 12, 2018
@m-mohr m-mohr modified the milestones: v0.5, v1.0 Dec 6, 2018
@m-mohr
Copy link
Member Author

m-mohr commented Dec 10, 2018

After the discussions at VITO, we decided to add /udf_runtimes again with the following content:

  • List of runtimes:
    • identifier plus
    • programming language, version, libraries + versions or
    • docker identifier
  • a default environment

@m-mohr m-mohr modified the milestones: v1.0, v0.4 Dec 10, 2018
@m-mohr
Copy link
Member Author

m-mohr commented Jan 15, 2019

In the last dev telco we discussed to change the approach a bit and group the runtimes by programming language. Each can have multiple versions and a default version (usually latest).

@m-mohr
Copy link
Member Author

m-mohr commented Feb 8, 2019

UDF runtimes are implemented in the API. All further work will be tackled in openeo-udf, openeo-r-udf, openeo-processes etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants