Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0 #32415

Closed
wants to merge 16 commits into from
Closed
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion LICENSE-binary
Original file line number Diff line number Diff line change
Expand Up @@ -456,7 +456,6 @@ org.antlr:ST4
org.antlr:stringtemplate
org.antlr:antlr4-runtime
antlr:antlr
com.github.fommil.netlib:core
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, this wasn't in the dev/deps files to begin with. Not totally sure why, but, yes then it only needs to come out here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You bring up a good point: com.github.fommil.netlib is still a transient dependency of Spark through org.scalanlp:breeze. I already submitted and merged a PR there to replace the dependency (scalanlp/breeze#811), but it hasn't been released yet.

For the sake of not regressing performance, let me re-add the dependency on com.github.fommil.netlib:all in pom.xml when enabling-Pnetlib-lgpl and leave a comment to explain when we can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, breeze uses it. OK yes let's keep this aspect then.

com.google.re2j:re2j
com.thoughtworks.paranamer:paranamer
org.scala-lang:scala-compiler
Expand Down Expand Up @@ -498,6 +497,9 @@ org.slf4j:jul-to-slf4j
org.slf4j:slf4j-api
org.slf4j:slf4j-log4j12
com.github.scopt:scopt_2.12
dev.ludovic.netlib:blas
dev.ludovic.netlib:arpack
dev.ludovic.netlib:lapack

core/src/main/resources/org/apache/spark/ui/static/dagre-d3.min.js
core/src/main/resources/org/apache/spark/ui/static/*dataTables*
Expand Down
6 changes: 3 additions & 3 deletions dev/deps/spark-deps-hadoop-2.7-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ apacheds-i18n/2.0.0-M15//apacheds-i18n-2.0.0-M15.jar
apacheds-kerberos-codec/2.0.0-M15//apacheds-kerberos-codec-2.0.0-M15.jar
api-asn1-api/1.0.0-M20//api-asn1-api-1.0.0-M20.jar
api-util/1.0.0-M20//api-util-1.0.0-M20.jar
arpack/1.3.2//arpack-1.3.2.jar
arpack/2.0.0//arpack-2.0.0.jar
arpack_combined_all/0.1//arpack_combined_all-0.1.jar
arrow-format/2.0.0//arrow-format-2.0.0.jar
arrow-memory-core/2.0.0//arrow-memory-core-2.0.0.jar
Expand All @@ -26,7 +26,7 @@ automaton/1.11-8//automaton-1.11-8.jar
avro-ipc/1.10.2//avro-ipc-1.10.2.jar
avro-mapred/1.10.2//avro-mapred-1.10.2.jar
avro/1.10.2//avro-1.10.2.jar
blas/1.3.2//blas-1.3.2.jar
blas/2.0.0//blas-2.0.0.jar
bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar
breeze-macros_2.12/1.0//breeze-macros_2.12-1.0.jar
breeze_2.12/1.0//breeze_2.12-1.0.jar
Expand Down Expand Up @@ -175,7 +175,7 @@ kubernetes-model-policy/5.3.0//kubernetes-model-policy-5.3.0.jar
kubernetes-model-rbac/5.3.0//kubernetes-model-rbac-5.3.0.jar
kubernetes-model-scheduling/5.3.0//kubernetes-model-scheduling-5.3.0.jar
kubernetes-model-storageclass/5.3.0//kubernetes-model-storageclass-5.3.0.jar
lapack/1.3.2//lapack-1.3.2.jar
lapack/2.0.0//lapack-2.0.0.jar
leveldbjni-all/1.8//leveldbjni-all-1.8.jar
libfb303/0.9.3//libfb303-0.9.3.jar
libthrift/0.12.0//libthrift-0.12.0.jar
Expand Down
6 changes: 3 additions & 3 deletions dev/deps/spark-deps-hadoop-3.2-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ annotations/17.0.0//annotations-17.0.0.jar
antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar
antlr4-runtime/4.8-1//antlr4-runtime-4.8-1.jar
aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar
arpack/1.3.2//arpack-1.3.2.jar
arpack/2.0.0//arpack-2.0.0.jar
arpack_combined_all/0.1//arpack_combined_all-0.1.jar
arrow-format/2.0.0//arrow-format-2.0.0.jar
arrow-memory-core/2.0.0//arrow-memory-core-2.0.0.jar
Expand All @@ -21,7 +21,7 @@ automaton/1.11-8//automaton-1.11-8.jar
avro-ipc/1.10.2//avro-ipc-1.10.2.jar
avro-mapred/1.10.2//avro-mapred-1.10.2.jar
avro/1.10.2//avro-1.10.2.jar
blas/1.3.2//blas-1.3.2.jar
blas/2.0.0//blas-2.0.0.jar
bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar
breeze-macros_2.12/1.0//breeze-macros_2.12-1.0.jar
breeze_2.12/1.0//breeze_2.12-1.0.jar
Expand Down Expand Up @@ -146,7 +146,7 @@ kubernetes-model-policy/5.3.0//kubernetes-model-policy-5.3.0.jar
kubernetes-model-rbac/5.3.0//kubernetes-model-rbac-5.3.0.jar
kubernetes-model-scheduling/5.3.0//kubernetes-model-scheduling-5.3.0.jar
kubernetes-model-storageclass/5.3.0//kubernetes-model-storageclass-5.3.0.jar
lapack/1.3.2//lapack-1.3.2.jar
lapack/2.0.0//lapack-2.0.0.jar
leveldbjni-all/1.8//leveldbjni-all-1.8.jar
libfb303/0.9.3//libfb303-0.9.3.jar
libthrift/0.12.0//libthrift-0.12.0.jar
Expand Down
7 changes: 3 additions & 4 deletions docs/ml-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,11 @@ The primary Machine Learning API for Spark is now the [DataFrame](sql-programmin

# Dependencies

MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/) and [netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing[^1]. Those packages may call native acceleration libraries such as [Intel MKL](https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html) or [OpenBLAS](http://www.openblas.net) if they are available as system libraries or in runtime library paths.
MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/), [dev.ludovic.netlib](https://github.com/luhenry/netlib), and [netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing[^1]. Those packages may call native acceleration libraries such as [Intel MKL](https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html) or [OpenBLAS](http://www.openblas.net) if they are available as system libraries or in runtime library paths.

Due to differing OSS licenses, `netlib-java`'s native proxies can't be distributed with Spark. See [MLlib Linear Algebra Acceleration Guide](ml-linalg-guide.html) for how to enable accelerated linear algebra processing. If accelerated native libraries are not enabled, you will see a warning message like below and a pure JVM implementation will be used instead:
However, native acceleration libraries can't be distributed with Spark. See [MLlib Linear Algebra Acceleration Guide](ml-linalg-guide.html) for how to enable accelerated linear algebra processing. If accelerated native libraries are not enabled, you will see a warning message like below and a pure JVM implementation will be used instead:
```
WARN BLAS: Failed to load implementation from:com.github.fommil.netlib.NativeSystemBLAS
WARN BLAS: Failed to load implementation from:com.github.fommil.netlib.NativeRefBLAS
WARN BLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
```

To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
Expand Down
36 changes: 13 additions & 23 deletions docs/ml-linalg-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,29 +21,15 @@ license: |

This guide provides necessary information to enable accelerated linear algebra processing for Spark MLlib.

Spark MLlib defines Vector and Matrix as basic data types for machine learning algorithms. On top of them, [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) and [LAPACK](https://en.wikipedia.org/wiki/LAPACK) operations are implemented and supported by [netlib-java](https://github.com/fommil/netlib-Java) (the algorithms may call [Breeze](https://github.com/scalanlp/breeze) and it will in turn call `netlib-java`). `netlib-java` can use optimized native linear algebra libraries (refered to as "native libraries" or "BLAS libraries" hereafter) for faster numerical processing. [Intel MKL](https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html) and [OpenBLAS](http://www.openblas.net) are two popular ones.
Spark MLlib defines Vector and Matrix as basic data types for machine learning algorithms. On top of them, [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) and [LAPACK](https://en.wikipedia.org/wiki/LAPACK) operations are implemented and supported by [dev.ludovic.netlib](https://github.com/luhenry/netlib) (the algorithms may also call [Breeze](https://github.com/scalanlp/breeze)). `dev.ludovic.netlib` can use optimized native linear algebra libraries (refered to as "native libraries" or "BLAS libraries" hereafter) for faster numerical processing. [Intel MKL](https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html) and [OpenBLAS](http://www.openblas.net) are two popular ones.

However due to license differences, the official released Spark binaries by default don't contain native libraries support for `netlib-java`.
The official released Spark binaries don't contain these native libraries.

The following sections describe how to enable `netlib-java` with native libraries support for Spark MLlib and how to install native libraries and configure them properly.

## Enable `netlib-java` with native library proxies

`netlib-java` depends on `libgfortran`. It requires GFORTRAN 1.4 or above. This can be obtained by installing `libgfortran` package. After installation, the following command can be used to verify if it is installed properly.
```
strings /path/to/libgfortran.so.3.0.0 | grep GFORTRAN_1.4
```

To build Spark with `netlib-java` native library proxies, you need to add `-Pnetlib-lgpl` to Maven build command line. For example:
```
$SPARK_SOURCE_HOME/build/mvn -Pnetlib-lgpl -DskipTests -Pyarn -Phadoop-2.7 clean package
```

If you only want to enable it in your project, include `com.github.fommil.netlib:all:1.1.2` as a dependency of your project.
The following sections describe how to install native libraries, configure them properly, and how to point `dev.ludovic.netlib` to these native libraries.

## Install native linear algebra libraries

Intel MKL and OpenBLAS are two popular native linear algebra libraries. You can choose one of them based on your preference. We provide basic instructions as below. You can refer to [netlib-java documentation](https://github.com/fommil/netlib-java) for more advanced installation instructions.
Intel MKL and OpenBLAS are two popular native linear algebra libraries. You can choose one of them based on your preference. We provide basic instructions as below.

### Intel MKL

Expand Down Expand Up @@ -72,16 +58,20 @@ sudo yum install openblas

To verify native libraries are properly loaded, start `spark-shell` and run the following code:
```
scala> import com.github.fommil.netlib.BLAS;
scala> System.out.println(BLAS.getInstance().getClass().getName());
scala> import dev.ludovic.netlib.NativeBLAS
scala> NativeBLAS.getInstance()
```

If they are correctly loaded, it should print `com.github.fommil.netlib.NativeSystemBLAS`. Otherwise the warnings should be printed:
If they are correctly loaded, it should print `dev.ludovic.netlib.NativeBLAS = dev.ludovic.netlib.blas.JNIBLAS@...`. Otherwise the warnings should be printed:
```
WARN BLAS: Failed to load implementation from:com.github.fommil.netlib.NativeSystemBLAS
WARN BLAS: Failed to load implementation from:com.github.fommil.netlib.NativeRefBLAS
WARN NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
java.lang.RuntimeException: Unable to load native implementation
at dev.ludovic.netlib.NativeBLAS.getInstance(NativeBLAS.java:44)
...
```

You can also point `dev.ludovic.netlib` to specific libraries names and paths. For example, `-Ddev.ludovic.netlib.blas.nativeLib=libmkl_rt.so` or `-Ddev.ludovic.netlib.blas.nativeLibPath=$MKLROOT/lib/intel64/libmkl_rt.so` for Intel MKL. You have similar parameters for LAPACK and ARPACK: `-Ddev.ludovic.netlib.lapack.nativeLib=...`, `-Ddev.ludovic.netlib.lapack.nativeLibPath=...`, `-Ddev.ludovic.netlib.arpack.nativeLib=...`, and `-Ddev.ludovic.netlib.arpack.nativeLibPath=...`.

If native libraries are not properly configured in the system, the Java implementation (javaBLAS) will be used as fallback option.

## Spark Configuration
Expand Down
13 changes: 0 additions & 13 deletions mllib-local/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -81,19 +81,6 @@
<artifactId>blas</artifactId>
</dependency>
</dependencies>
<profiles>
<profile>
<id>netlib-lgpl</id>
<dependencies>
<dependency>
<groupId>com.github.fommil.netlib</groupId>
<artifactId>all</artifactId>
<version>${netlib.java.version}</version>
<type>pom</type>
</dependency>
</dependencies>
</profile>
</profiles>
<build>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
Expand Down
Loading