Replies: 23 comments 22 replies
-
Our engine, Starry DB, has already implemented a similar set of APIs, from which we have reaped huge benefits. Two of us completed the development and launch of this Spark native engine in half a year. If possible, we would like to share all of this with everyone. |
Beta Was this translation helpful? Give feedback.
-
This could be very helpful for JVM frontends.
Curious did you mean Java + Scala SDK? Since it's a Scala syntax. |
Beta Was this translation helpful? Give feedback.
-
cc: @pedroerp @mbasmanova |
Beta Was this translation helpful? Give feedback.
-
We had discussions in the past about creating Java/JNI bindings, but I don't think anyone has actually done the work. Agreed it would be useful as more Java engines try to leverage Velox. The one complication is that Velox doesn't really have a single public API (rather, there are multiple APIs working in different layers), so I suppose the first discussion would be to agree on the exact APIs we would like to expose. It would also be nice if we made them (at least mostly) compatible across C++ and pyVelox also. |
Beta Was this translation helpful? Give feedback.
-
+1 For the Velox API discussion. At IBM we have lots of engineers writing functions and having a Java API would be super useful. |
Beta Was this translation helpful? Give feedback.
-
I think the API topic (in general, not specifically for java) should be discussed in the next monthly? @pedroerp I currently don't see anything planned, when will the next one be? |
Beta Was this translation helpful? Give feedback.
-
@aditi-pandit My understanding of the proposal is to to provide Java API for PlanNode.h, not for writing functions. @hn5092 Would you clarify? |
Beta Was this translation helpful? Give feedback.
-
True, would be a good topic to discuss on that forum. @sanumandla has the schedule; I think that event expired, but we should probably it bring to back to life. |
Beta Was this translation helpful? Give feedback.
-
The most urgent java API usage from Gluten is the wrapper to access the RowVector. One major performance issue from Gluten today is the UDFs. Simple UDF can be easy to port with the help of LLM today. But as long as the UDF uses some java package like org.apache.http.client.utils.URIBuilder, it ends up to search or port the whole package. If Velox has java API to access the rowvector data, we still need to port the row based UDF to columnar based but it's will be much more easier. Also we can improve performance by removing the R2C/C2R or similar logic. |
Beta Was this translation helpful? Give feedback.
-
The Java API for plannode is certainly a very important and fundamental one, and it is also a set of APIs that we have implemented. However, in my imagination, Velox could do much more. For example, by passing a Java list to it, it could perform calculations directly, without the need for engines like Spark. It's like a high-performance collection framework. Or, similar to DuckDB, it could directly read Parquet files and then perform computations, because many Java programs actually need this functionality very much, and the current Java/Scala lambda expressions cannot provide such efficient operations. |
Beta Was this translation helpful? Give feedback.
-
Our design entails a mapping between the entire vector batch and the underlying C++ Velox vector, with direct data read and write operations via We have replaced Spark's ParquetFormat vector batch with our vector implementation to address the issue of Velox Parquet not supporting many types. This replacement only involved modifications to the batch code. In terms of UDFs, we can develop a vector eval on the Spark side to carry out row-wise evaluation on the batch. Then, we can insert data into the Velox batch using our batch API. |
Beta Was this translation helpful? Give feedback.
-
@hn5092 Thank you for sharing. Are you working with Gluten or with your own solution?
This seems interesting. IIUC it's a pure Java implementation of ParquetFormat, and backed with Velox in-memory data format, is that correct? Did you try comparing this ParquetFormat with Velox's native Parquet reader / writer in regard of performance? Furthermore, I think we are basically discussing on both Velox query plan and Velox data format in the topic. Perhaps the two can be split. My personal feeling is implementing a plan API could anyway help on Java + Velox development. For data format, perhaps we can figure out the relationship between the new Java Velox format and existing Velox + Arrow solution at the beginning. Sharing data between Java and C++ may require for a well designed Velox C ABI, which could functionally overlap with Velox Arrow bridge a little bit. Do you have some thoughts on this part? |
Beta Was this translation helpful? Give feedback.
-
We already had the vector based UDFs in Gluten. current solution is to convert to Arrow format, then use Arrow's java API to access the data. This part does can be upstream to Spark but not much useful because Spark only has parquet scan output the columnar format. The initial plan in Velox is that we can should have data level consistency so we can get one data format, two wrappers (Arrow's record batch and Velox's rowvector). But it doesn't work well yet. @pedroerp should we add velox dataformat's java wrapper or should we leverage Arrow's java wrapper? |
Beta Was this translation helpful? Give feedback.
-
Data format one is more important to Gluten. Gluten needn't the Velox query plan's java wrapper. |
Beta Was this translation helpful? Give feedback.
-
@hn5092 you may move the topic to discussion |
Beta Was this translation helpful? Give feedback.
-
Additionally, I haven't yet seen any necessary benefits that Arrow provides for Java APIs. If there are advantages that I am not aware of, you can inform me. |
Beta Was this translation helpful? Give feedback.
-
@mbasmanova : From the IBM use-cases that I'm familiar with are close to what BinWei described above. The existing functions (scalars or aggregates) are written in Java and we want to wrap the row by row logic (or some functions could be columnar as well) within a Velox operator that can be invoked through SQL. Remote UDF can be used for this also, though the Java API would be more performant. This issue has been revised and rewritten few times, but when I commented that was my line of thought. Velox PlanNode API came up in conversation few times when we discussed PlanValidation of native plans at the co-ordinator side. |
Beta Was this translation helpful? Give feedback.
-
I hope this matter won't be put on hold. In the next few weeks, I'll take the time to submit a draft of the Velox Vector Java API for everyone to review. Please take a look and provide your own feedback. Upon completion, we will proceed with Phase 2, which involves the development of the Velox Node Plan Java API. |
Beta Was this translation helpful? Give feedback.
-
I'm +1 for adding bindings to the main APIs we need, but we need to be careful about the goals of the project. Velox is not meant to be a user facing data processing library by itself; it is not meant to be an alternative for things like Pandas, numpy, DuckDB and similar. The idea is that such projects could be built by leveraging Velox. So, in my opinion, if the goal is to provide a very thin wrapper to expose Velox's main API in Java, this would be part of Velox (like PyVelox, although that is still a prototype). If the plan is to provide an API for end users (data scientists and such) so that they could use Velox's power to build data pipelines (read parquet files, aggregate data, similar to Pandas), I still think it's a very useful and compelling thing to be built, but it sounds like a different project whatsoever. |
Beta Was this translation helpful? Give feedback.
-
Now, regarding the actual APIs to be exposed. In my view, the main value Velox provides is a way to very efficiently execute query plans. In this sense, what an external API should comprise, in this order:
It happens that many plan nodes take expressions, so we will also need:
Optionally, we could expose through the API a way to evaluate expressions as a standalone API independent of query plans:
Similarly, some of this API will rely on the actual data representation, so we will probably need to expose an API for Vectors (similar to Arrow).
I think it even make sense to tackle things in this order (although 1. is arguably the hardest to expose), since Velox's main value is to execute plans, not to represent data (the latter could also be done using something like Arrow). @mbasmanova @xiaoxmeng @oerling @Yuhta @majetideepak , thoughts? |
Beta Was this translation helpful? Give feedback.
-
@pedroerp We are using Arrow's Java API to access the data. There are two issues observed:
|
Beta Was this translation helpful? Give feedback.
-
I apologize for the inconvenience, as I have been busy organizing code and dealing with other matters. Today, I open-sourced one of our implementations, which, unfortunately, is based on Spark. Everyone is welcome to use it as a reference. This implementation may not be ideal; it is merely meant to offer a concept. My ultimate goal is for us all to adopt a unified and high-performance API, which would also help to reduce the cost of rebasing each time. The entire implementation is fundamentally designed around the Velox API.
|
Beta Was this translation helpful? Give feedback.
-
Description
Generally speaking, to invoke Velox from Java now, one needs to construct from the Substrait side, creating a Substrait execution plan. The whole usage and learning cost is very high, requiring a deep understanding of the entire SQL and also a profound knowledge of Substrait. This necessitates mastering a lot of things, and for developers, it is very inconvenient. Complexity implies that the cost of bug maintenance is huge. Here, I propose the idea of building a JAVA SDK.
Goals:
What needs to be done:
C++:
a. Native class, implementing this object can easily interface with a Java class.
b. Framework access to Java objects is also simple.
c. High performance.
Java:
a. Native class, an abstract class of the entire native object, inside which some checks can be done. Implementing this method of objects will have some default behaviors to prevent crashes like this.
Function resolution
Transfer of intermediate results
Velox Vector JAVA API
Beta Was this translation helpful? Give feedback.
All reactions