[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence #10519

JoshRosen · 2015-12-30T03:26:54Z

Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the driver argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection.

In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our DriverRegistry and JDBC's DriverManager is not sufficient to ensure that it's actually used when creating the JDBC connection.

This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called DriverManager.getConnection() directly).

If a user did not specify a JDBC driver to use, then we call DriverManager.getDriver to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for DriverManager.getConnection() to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different).

This patch is inspired by a similar patch that I made to the spark-redshift library (databricks/spark-redshift#143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons).

SparkQA · 2015-12-30T05:17:43Z

Test build #48446 has finished for PR 10519 at commit 3554d68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-01-02T22:32:56Z

I'd appreciate any feedback on how we can/should test this change and prevent this behavior from regressing in the future.

JoshRosen · 2016-01-04T09:12:49Z

Jenkins, retest this please.

JoshRosen · 2016-01-04T09:13:10Z

/cc @yhuai @marmbrus or @rxin for a review pass.

SparkQA · 2016-01-04T10:59:12Z

Test build #48647 has finished for PR 10519 at commit 3554d68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-01-04T17:52:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+      DriverManager.getDriver(url).getClass.getCanonicalName
+    }
+    () => {
+      userSpecifiedDriverClass.foreach(DriverRegistry.register)


This is the one that register the right driver at the executor side, right?

Yep, that's right: this function gets shipped to executors, where it's called to create connections.

yhuai · 2016-01-04T17:56:09Z

Looks good to me. Just have a quick clarification question.

JoshRosen · 2016-01-04T18:36:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+    () => {
+      userSpecifiedDriverClass.foreach(DriverRegistry.register)
+      val driver: Driver = DriverManager.getDrivers.asScala.collectFirst {
+        case d: DriverWrapper if d.wrapped.getClass.getCanonicalName == driverClass => d


This is the only real bit of trickiness here and was the part that was missing from my spark-redshift patch.

yhuai · 2016-01-04T18:38:50Z

LGTM

yhuai · 2016-01-04T18:39:03Z

Merging to master.

yhuai · 2016-01-04T18:46:08Z

Also merging to branch 1.6.

Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection. In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection. This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly). If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different). This patch is inspired by a similar patch that I made to the `spark-redshift` library (databricks/spark-redshift#143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons). Author: Josh Rosen <[email protected]> Closes #10519 from JoshRosen/jdbc-driver-precedence. (cherry picked from commit 6c83d93) Signed-off-by: Yin Huai <[email protected]>

…etConnector() This is a followup to #143 which fixes a corner-case bug in our scanning of registered JDBC drivers: we need to properly handle Spark's `DriverWrapper` drivers, which are used to wrap JDBC drivers in order to make them accessible from the root classloader so that the `DriverManager` can find them. A simpler, reflection-free version of this change was incorporated into apache/spark#10519 Author: Josh Rosen <[email protected]> Closes #147 from JoshRosen/jdbc-driver-precedence-round-2.

Force user-specified JDBC driver to take precedence.

3554d68

JoshRosen mentioned this pull request Jan 4, 2016

[Spark-10625] [SQL] Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties #8785

Closed

yhuai reviewed Jan 4, 2016
View reviewed changes

JoshRosen reviewed Jan 4, 2016
View reviewed changes

asfgit closed this in 6c83d93 Jan 4, 2016

JoshRosen deleted the jdbc-driver-precedence branch January 4, 2016 18:44

JoshRosen mentioned this pull request Jan 4, 2016

Handle DriverWrapper when scanning registered drivers in JDBWrapper.getConnector() databricks/spark-redshift#147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence #10519

[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence #10519

JoshRosen commented Dec 30, 2015

SparkQA commented Dec 30, 2015

JoshRosen commented Jan 2, 2016

JoshRosen commented Jan 4, 2016

JoshRosen commented Jan 4, 2016

SparkQA commented Jan 4, 2016

yhuai Jan 4, 2016

JoshRosen Jan 4, 2016

yhuai commented Jan 4, 2016

JoshRosen Jan 4, 2016

yhuai commented Jan 4, 2016

yhuai commented Jan 4, 2016

yhuai commented Jan 4, 2016

[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence #10519

[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence #10519

Conversation

JoshRosen commented Dec 30, 2015

SparkQA commented Dec 30, 2015

JoshRosen commented Jan 2, 2016

JoshRosen commented Jan 4, 2016

JoshRosen commented Jan 4, 2016

SparkQA commented Jan 4, 2016

yhuai Jan 4, 2016

Choose a reason for hiding this comment

JoshRosen Jan 4, 2016

Choose a reason for hiding this comment

yhuai commented Jan 4, 2016

JoshRosen Jan 4, 2016

Choose a reason for hiding this comment

yhuai commented Jan 4, 2016

yhuai commented Jan 4, 2016

yhuai commented Jan 4, 2016