-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Initial Java Support for GDS to KvikIO #396
base: branch-24.12
Are you sure you want to change the base?
Add Initial Java Support for GDS to KvikIO #396
Conversation
java/README.md
Outdated
Java Bindings | ||
|
||
Summary | ||
These Java KvikIO bindings for GDS currently support only synchronous read and write IO operations using the underlying CuFile API. Support for batch IO and asynchronous operations are not yet supported. | ||
|
||
Dependencies | ||
The Java KvikIO bindings have been developed to work on Linux based systems and require CUDA to be installed and for GDS to be properly enabled. Instructions for how to install and enable GDS can be found on NVIDIA's website. To compile the shared library it is also necessary to have a JDK installed. To run the included example, it is also necessary to install JCuda as it is used to handle memory allocations and the transfer of data between host and GPU memory. JCuda jar files supporting CUDA 12.x can be found here: | ||
https://repo1.maven.org/maven2/org/jcuda/jcuda/12.0.0/jcuda-12.0.0.jar | ||
https://repo1.maven.org/maven2/org/jcuda/jcuda-natives/12.0.0/jcuda-natives-12.0.0.jar | ||
|
||
Compilation | ||
To recompile the .so file for your local system run the following command. Note: Update the command to reflect the directory where you have installed CUDA and your JDK. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use markdown formatting here, so that this will look a bit better when rendered.
Java Bindings | |
Summary | |
These Java KvikIO bindings for GDS currently support only synchronous read and write IO operations using the underlying CuFile API. Support for batch IO and asynchronous operations are not yet supported. | |
Dependencies | |
The Java KvikIO bindings have been developed to work on Linux based systems and require CUDA to be installed and for GDS to be properly enabled. Instructions for how to install and enable GDS can be found on NVIDIA's website. To compile the shared library it is also necessary to have a JDK installed. To run the included example, it is also necessary to install JCuda as it is used to handle memory allocations and the transfer of data between host and GPU memory. JCuda jar files supporting CUDA 12.x can be found here: | |
https://repo1.maven.org/maven2/org/jcuda/jcuda/12.0.0/jcuda-12.0.0.jar | |
https://repo1.maven.org/maven2/org/jcuda/jcuda-natives/12.0.0/jcuda-natives-12.0.0.jar | |
Compilation | |
To recompile the .so file for your local system run the following command. Note: Update the command to reflect the directory where you have installed CUDA and your JDK. | |
# Java Bindings | |
## Summary | |
These Java KvikIO bindings for GDS currently support only synchronous read and write IO operations using the underlying CuFile API. Support for batch IO and asynchronous operations are not yet supported. | |
## Dependencies | |
The Java KvikIO bindings have been developed to work on Linux based systems and require CUDA to be installed and for GDS to be properly enabled. Instructions for how to install and enable GDS can be found on NVIDIA's website. To compile the shared library it is also necessary to have a JDK installed. To run the included example, it is also necessary to install JCuda as it is used to handle memory allocations and the transfer of data between host and GPU memory. JCuda jar files supporting CUDA 12.x can be found here: | |
https://repo1.maven.org/maven2/org/jcuda/jcuda/12.0.0/jcuda-12.0.0.jar | |
https://repo1.maven.org/maven2/org/jcuda/jcuda-natives/12.0.0/jcuda-natives-12.0.0.jar | |
## Compilation | |
To recompile the .so file for your local system run the following command. Note: Update the command to reflect the directory where you have installed CUDA and your JDK. | |
(See https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for further syntax).
java/README.md
Outdated
https://repo1.maven.org/maven2/org/jcuda/jcuda/12.0.0/jcuda-12.0.0.jar | ||
https://repo1.maven.org/maven2/org/jcuda/jcuda-natives/12.0.0/jcuda-natives-12.0.0.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Prefer not to link to explicit files since this can easily go out of date without us remembering to update things. For preference, can we link to an appropriate installation document for JCuda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I can link to the installation instructions as well but found that the installation instructions themselves are a bit unclear/potentially out of date. I may keep these links but just add a note to confirm they are still up to date and reference the installation instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused by these bindings. They look to be bindings to the cufile API, and not the kvikio API. Is that deliberate? It seems like if so one will have to replicate a lot of the implementation that already exists in kvikio here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, these are directly bound to the underlying cufile API. In the initial discussions with the rest of the GDS team there was some concern about potential performance overhead of Java bindings -> C++ bindings -> C API. I have not investigated the overhead yet, but definitely see the benefit you are suggesting to sharing one underlying set of bindings. If you don't mind, I will add an issue to my backlog to investigate the potential performance differences and, if they are minimal, update these to point to the C++ bindings at a later date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, let's continue with raw java bindings for now.
Side note, I would be very surprised if the basic C++ API introduce any significant overhead. E.g. the overhead of calling read()
vs. calling cuFileRead()
is tiny.
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few high level comments before digging too deep into this code:
- I have a weak preference to wrap the C++ bindings rather than re-implement kvikio in Java. This will probably be easier than reimplementing the code as new features are introduced to kvikio, and gives more confidence in testing (bindings are probably less bug-prone than a full reimplementation).
- This will need to pass style checks. See the cuDF guide on Code Formatting. The same guidance applies to using
pre-commit
with kvikio development. - This will need a build system, probably Maven. Long commands in the README that explain how to supply all the dependencies are not a good solution if those dependencies can be fetched and built automatically. See how cuDF's Java bindings use Maven, for instance. https://github.com/rapidsai/cudf/tree/branch-24.08/java
- The end result should be that commands like
mvn clean install
work properly.
- The end result should be that commands like
- This will need CI builds and tests. Review the following resources on cuDF's Java CI and try to duplicate them:
- GitHub Action Job for PRs
- GitHub Action Job for Nightly Tests
- Java build/test script
- This uses conda to supply the libcudf package, and builds the Java bindings from that. If you use direct bindings and do not depend on the kvikio C++ library, then you can do something simpler that only requires installing cuFile from conda.
- Java dependencies.yaml file key
- This file key gives a list of what dependency lists to include. This file key is referenced by the CI build/test script. This file key should include the dependency list
cuda
, so you get the cufile package. See the rapids-dependency-file-generator README for more information.
- This file key gives a list of what dependency lists to include. This file key is referenced by the CI build/test script. This file key should include the dependency list
- Java dependencies.yaml list
- This dependency list should probably include
maven
andopenjdk
. Not sure what else.
- This dependency list should probably include
Give this a start and please feel free to schedule time with me to discuss anything that is unclear!
Is this still planned for 24.08 or should we move to 24.10? |
I will not likely finish the updates here for a couple weeks due to other priorities, so this should be moved to 24.10 |
Ok thanks Alex! 🙏 Have moved to 24.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick first pass of review -- mostly focused on packaging and CI. Give my suggestions a try and I'll comment /ok to test
to run your next commits on CI.
java/README.md
Outdated
#### Run example | ||
|
||
cd kvikio/java/ | ||
java -cp target/cufile-24.10.0-SNAPSHOT.jar:$HOME/.m2/repository/org/jcuda/jcuda/12.0.0/jcuda-12.0.0.jar:$HOME/.m2/repository/org/jcuda/jcuda-natives/12.0.0/jcuda-natives-12.0.0.jar -Djava.library.path=./target bindings.kvikio.example.Main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cufile-24.10.0-SNAPSHOT.jar
cuFile is not versioned like RAPIDS. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the version, the jar that is generated is not aware of what version of libcufile was used to generate it... that just depends on what version of the cuda toolkit the host machine has installed. It appears I can't make the jar not have a version number at all, so I'm inclined to have the version represent the version of kvikio. If you think there's a way I can inject the version of libcufile into maven at runtime I am open to that, but I have not found any information on how that would be done so far.
…io into add_initial_java_support
/ok to test |
1 similar comment
/ok to test |
java/pom.xml
Outdated
<configuration> | ||
<target> | ||
<!-- Compile native code using nvcc --> | ||
<exec executable="/usr/local/cuda/bin/nvcc"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aslobodaNV The first real failure: this path is not the correct path to nvcc
in the CI environment, which installed cuda-nvcc
with conda
. Can you just call nvcc
?
<exec executable="/usr/local/cuda/bin/nvcc"> | |
<exec executable="nvcc"> |
Error: Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:3.0.0:run (compile-native-code) on project cufile: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program "/usr/local/cuda/bin/nvcc" (in directory "/__w/kvikio/kvikio/java"): error=2, No such file or directory
Error: around Ant part ...<exec executable="/usr/local/cuda/bin/nvcc">... @ 4:49 in /__w/kvikio/kvikio/java/target/antrun/build-main.xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, the include paths should be located in the conda environment. The CUDA packages use a special "targets" layout. For example, libcufile-dev
will install cufile.h
into ${CONDA_PREFIX}/targets/x86_64-linux/include/cufile.h
.
The openjdk
package installs some headers into ${CONDA_PREFIX}/include
and some into ${CONDA_PREFIX}/include/linux
.
CMake knows how to detect and parse this layout of the CUDA Toolkit but I do not know how Maven or other build systems are supposed to consume it. I would ask the Spark team, perhaps they know: @revans2, do you have insight here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally fine with committing the update to just call nvcc instead of the path, I just didn't happen to have nvcc on my path when I wrote that. Let me look into the include paths and reach out to Bobby if I get stuck.
dependencies.yaml
Outdated
- output_types: conda | ||
packages: | ||
- maven | ||
- openjdk=8.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use openjdk=8.*
? It appears that the latest is openjdk=22.*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason, I just happened to have 8 installed and used that. I will update to 22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JDK8 is a very much a least common denominator. It is no longer supported without paying a lot of money to oracle though.
https://www.oracle.com/java/technologies/java-se-support-roadmap.html
and oracle is trying very hard to push people away from older versions of java.
java 11, 17, and 21 are long term support versions that are currently GA.
Java generally has really good backwards compatibility so almost anything compiled with JDK8 can run on anything newer. But newer code is not backwards compatible with older runtimes. But you can ask newer compilers to output binary code that is compatible with older JDK versions. That is what we do with the Spark plugin.
We also support whatever Spark supports for their default binary release. So for older versions of Spark it is JDK8. For some of the latest versions that is JDK17, but because of how we ask the compiler to output older binary versions it really becomes a requirement that they have at least that version of java installed.
I wouldn't update to jdk22 unless you also decide which version of java you want to be minimally compatible with and ask the compiler to output compatible .class files.
Also just another point to think about jcuda still uses openjdk8. This is probably to have as much compatibility as possible. https://github.com/jcuda/jcuda-main/blob/master/BUILDING.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that context, that makes a lot of sense. I have not been following the broader Java versioning ecosystem so was unaware of much of this. I think it is likely we will want to pick one of the versions with long term support, I would guess 17 or 21, but will think on it some more before making an update.
java/pom.xml
Outdated
<configuration> | ||
<target> | ||
<!-- Compile native code using nvcc --> | ||
<exec executable="/usr/local/cuda/bin/nvcc"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, the include paths should be located in the conda environment. The CUDA packages use a special "targets" layout. For example, libcufile-dev
will install cufile.h
into ${CONDA_PREFIX}/targets/x86_64-linux/include/cufile.h
.
The openjdk
package installs some headers into ${CONDA_PREFIX}/include
and some into ${CONDA_PREFIX}/include/linux
.
CMake knows how to detect and parse this layout of the CUDA Toolkit but I do not know how Maven or other build systems are supposed to consume it. I would ask the Spark team, perhaps they know: @revans2, do you have insight here?
Co-authored-by: Bradley Dice <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good. Just a few nits. Mostly around java convention. I am curious about your plans to distribute binaries. The java code itself is going to require the native library to work, but the native library is something that the user is expected to compile themselves. This is not a great user experience, but it is a real pain to try and distribute native binaries too. If you want/need some help with this there are some people on my team that have spent a lot of time to make this work well, especially for x86 linux.
driver.close(); | ||
})); | ||
initialized = true; | ||
} catch (Throwable t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: catching Throwable is not ideal. It can include all kinds of unrecoverable errors. At a minimum you probably want to print out why this could not be initialized in the error message. That way you can debug that a dependency was missing/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to point me to an example of best practices in this area? Happy to change it further from my latest update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I marked it as a nit because it is probably fine in this case. In fact when I went back to look at the CUDF code, I did the exact same thing 5 years ago.
It is just generally not good practice to catch a Throwable
, because Throwable
includes all Error
s and Error
s are generally considered to not be recoverable. But really the main thing here is giving the user enough information that they can debug why it is failing. You are printing the error message, but the full stack trace would be better. That is the main thing that I would suggest you do.
|
||
package bindings.kvikio.cufile; | ||
|
||
abstract class CuFileHandle implements AutoCloseable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to have some javadocs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will be sure to get these javadocs that you have requested in before we merge, just wanted to tackle the build issues we were seeing first. Thank you for all your feedback.
|
||
package bindings.kvikio.cufile; | ||
|
||
public final class CuFileReadHandle extends CuFileHandle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadocs would be very nice for this class as it is intended to be used by developers.
|
||
package bindings.kvikio.cufile; | ||
|
||
public final class CuFileWriteHandle extends CuFileHandle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here javadocs would be very nice for the developer experience.
I forgot to add that the APIs to take a raw pointer is far from ideal. It might be nice to have a separate library that is compiled against jcuda and the code kvikio jar so that you can hide the scary part of pulling the native address out of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally it looks really good. I am not going to approve it because I don't know enough about cufile to be able to say that this matches all of the requirements. But from a java perspective it looks decent.
dependencies.yaml
Outdated
- output_types: conda | ||
packages: | ||
- maven | ||
- openjdk=8.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JDK8 is a very much a least common denominator. It is no longer supported without paying a lot of money to oracle though.
https://www.oracle.com/java/technologies/java-se-support-roadmap.html
and oracle is trying very hard to push people away from older versions of java.
java 11, 17, and 21 are long term support versions that are currently GA.
Java generally has really good backwards compatibility so almost anything compiled with JDK8 can run on anything newer. But newer code is not backwards compatible with older runtimes. But you can ask newer compilers to output binary code that is compatible with older JDK versions. That is what we do with the Spark plugin.
We also support whatever Spark supports for their default binary release. So for older versions of Spark it is JDK8. For some of the latest versions that is JDK17, but because of how we ask the compiler to output older binary versions it really becomes a requirement that they have at least that version of java installed.
I wouldn't update to jdk22 unless you also decide which version of java you want to be minimally compatible with and ask the compiler to output compatible .class files.
Also just another point to think about jcuda still uses openjdk8. This is probably to have as much compatibility as possible. https://github.com/jcuda/jcuda-main/blob/master/BUILDING.md
java/pom.xml
Outdated
<version>3.8.0</version> | ||
<configuration> | ||
<source>21</source> | ||
<target>21</target> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where you can set what the output target is that you want.
java/src/main/native/CMakeLists.txt
Outdated
@@ -0,0 +1,21 @@ | |||
cmake_minimum_required(VERSION 3.10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAPIDS requires a minimum of cmake 3.26.4 right now, I think. Can we set a higher minimum for these bindings? I don't think we will ever test with a CMake version as old as 3.10.
/ok to test |
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the java side of things it looks good. I am not an expert on the APIs or all of the use cases, but the plumbing looks decent.
@aslobodaNV Looks like there are CI build failures. I'm not sure how to diagnose this.
|
/ok to test |
This PR is intended to add initial support for Java binding to GDS as part of the KvikIO library. In this PR are the minimal set of bindings required to support synchronous read and write IO operations via GDS as well as a single example to demonstrate how the bindings can be used alongside other CUDA libraries, such as JCuda. Full support for the GDS CuFile API, including batch and asynchronous IO, has not yet been implemented and more sophisticated error/exception handling is not yet in place. There is a README located within kvikio/java detailing how this new functionality can be compiled and built locally, along with detailed instructions on how to run the included usage example.