-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark] Update sbt testing task of Delta Connect Client to automatically use the Spark latest jars #3670
base: master
Are you sure you want to change the base?
Conversation
TEST_PARALLELISM_COUNT=4 build/sbt -DsparkVersion=master "++ ${{ matrix.scala }}" clean connectServer/assembly connectClient/test | ||
TEST_PARALLELISM_COUNT=4 build/sbt -DsparkVersion=master "++ ${{ matrix.scala }}" clean connectServer/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporarily for now to test connectClient
, cause Delta integration with Spark Master is broken atm so we cannot test connectServer and spark
val metadataUrl = new URL(latestSparkComponentJarDir + "maven-metadata.xml") | ||
|
||
// Fetch and parse the maven-metadata.xml file. | ||
val metadataXml = XML.load(metadataUrl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reasons, the XML that we download is
So it's missing the snapshotVersions
section in https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-connect_2.13/4.0.0-SNAPSHOT/maven-metadata.xml
I'm not sure why this is the case, I tried using other load
APIs in https://javadoc.io/doc/org.scala-lang.modules/scala-xml_2.13/latest/scala/xml/XML$.html, I'm not sure about changing the version of the scala-xml
library in https://github.com/delta-io/delta/blob/master/project/plugins.sbt#L46.
To get the latest jar name, we can either
- Extract the
timestamp
andbuildNumber
, and combine them together - Extract the
value
, nested insnapshotVersion
.
I think way 1 works just as fine as way 2, so I decided to not spend more time trying to fix the metadataURL
fetching to fetch the full file.
files | ||
} else { | ||
destDir.get() | ||
} | ||
}.taskValue, | ||
(Test / resourceGenerators) += Def.task { | ||
val src = url("https://repository.apache.org/content/groups/public/org/apache/spark/spark-connect_2.13/4.0.0-preview1/spark-connect_2.13-4.0.0-preview1.jar") | ||
val dest = (Test / resourceManaged).value / "spark-connect.jar" | ||
val sparkConnectComponentName = "spark-connect_2.13" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We define sparkComponentName on line 409 to have the suffix _2.13
, so also changing this to be coherent.
@@ -317,7 +367,6 @@ lazy val connectClient = (project in file("spark-connect/client")) | |||
val destDir = (Test / resourceManaged).value / "spark" | |||
if (!destDir.exists()) { | |||
IO.createDirectory(destDir) | |||
val files = mutable.Buffer.empty[File] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to maintain a list of files, we write to the directory directly.
Hey @tomvanbussel, could you please take a look at this PR? |
@@ -51,7 +51,7 @@ jobs: | |||
- name: Run Spark Master tests | |||
# when changing TEST_PARALLELISM_COUNT make sure to also change it in spark_test.yaml | |||
run: | | |||
TEST_PARALLELISM_COUNT=4 SHARD_ID=${{matrix.shard}} build/sbt -DsparkVersion=master "++ ${{ matrix.scala }}" clean spark/test | |||
TEST_PARALLELISM_COUNT=4 build/sbt -DsparkVersion=master "++ ${{ matrix.scala }}" clean connectServer/test | |||
TEST_PARALLELISM_COUNT=4 build/sbt -DsparkVersion=master "++ ${{ matrix.scala }}" clean connectServer/assembly connectClient/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporarily for now to test connectClient, cause Delta integration with Spark Master is broken atm so we cannot test connectServer and spark, will put things back the way they were when about to merge
TEST_PARALLELISM_COUNT=4 SHARD_ID=${{matrix.shard}} build/sbt -DsparkVersion=master "++ ${{ matrix.scala }}" clean spark/test | ||
TEST_PARALLELISM_COUNT=4 build/sbt -DsparkVersion=master "++ ${{ matrix.scala }}" clean connectServer/test | ||
TEST_PARALLELISM_COUNT=4 build/sbt -DsparkVersion=master "++ ${{ matrix.scala }}" clean connectServer/assembly connectClient/test | ||
TEST_PARALLELISM_COUNT=1 build/sbt -DsparkVersion=master "++ ${{ matrix.scala }}" clean connectServer/assembly connectClient/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporarily for now to test connectClient, cause Delta integration with Spark Master is broken atm so we cannot test connectServer and spark, will put things back the way they were when about to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! It feels a bit wasteful though to download the full release, only to replace many of the JARs with newer versions. Another downside of this approach is that we cannot use the caching provided by SBT. Would it be possible to instead rely on SBT Assembly to package all of our dependencies into one JAR that we can then load in our tests? (This may require creating another project in the build file.)
@tomvanbussel I did think about that approach, but we still need to download the |
@tomvanbussel Also, like you observed I downloaded the latest jar for each spark component, when I was naming the spark jar to be
What I had to do was download it and then rename it to Only then did it start to work. So I'm not sure if making 1 huge Uber jar will be accepted by https://github.com/apache/spark/blob/master/sbin/start-connect-server.sh |
@longvu-db We can copy the few bash files that we need to this repo. No need to download a 500MB tar just for those small scripts. |
@tomvanbussel But I'm not sure if making one huge Spark jar file would work based on the 2nd reason, spark submit is asking for each spark component jars |
@tomvanbussel We are expected to have this list of jars from this list - https://github.com/apache/spark/blob/9fc58aa4c0753ef42e0172d73e13b45a00a730f9/assembly/pom.xml#L4 |
@tomvanbussel But I can try to look into your suggestion of copying the minimal number of needed files (because the tar contains unrelated Spark stuffs that we don't need), however I do feel like we need separate multiple Spark component jar files instead of one Uber Jar file to keep |
Hey @xupefei, since Tom is OOO and you have experience in Delta and Spark Connect, would you be so kind to take a look at this PR? TLDR: We are testing the Delta Connect Client with Spark Connect Server, the last distribution was Spark 4.0 First Preview in May, we are no longer able to use this Spark distribution since it contains outdated Spark Jars that are no longer binary compatible with the current Delta-Spark. Right now, I made it work by still downloading the Spark 4.0 Preview distribution and then replace all the Spark 4.0 Preview jars with the latest nightly Spark jar releases. The Spark 4.0 Preview distribution is great to have since it contains the necessary bash scripts in sbin, and bin to start the Spark Connect Server. The problem that Tom raised is that this Spark 4.0 Preview distribution is 500MB which might be inefficient, so we can get the I attempted to not download Spark 4.0 Preview distribution by using GitHub APIs to download the Now, the problem is this doesn't work because there are some external libraries jars that the Spark Component jar expects, such as log4j-core. Based on this PR, we can get the list, or from many Spark components' pom.xml, one example is https://github.com/apache/spark/blob/master/core/pom.xml. But it looks like they both have unnecessary jars, building additional scraping code will take time, while hard-coding the library name with the version in a list somewhere doesn't sound like a good idea since it can easily get out-of-date with version upgrade. The great thing about Spark 4.0 Preview Distribution is that it already contains the necessary jars. My question is: How bad it is that we download 500MB every time we want to test |
@@ -317,7 +367,6 @@ lazy val connectClient = (project in file("spark-connect/client")) | |||
val destDir = (Test / resourceManaged).value / "spark" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a TODO
Which Delta project/connector is this regarding?
Description
Currently, in order to test Delta Connect Client with Spark Connect, we always fetch the Spark 4.0 First Preview jars, which were created in May 2024.
Since then, lots of things have changed in Spark, so the jars are outdated and the testing no longer works due to binary incompatibility between the latest version of Delta and Spark 4.0 First Preview version.
Therefore, we are updating the
sbt
testing task of the Delta Connect Client project to automatically fetch and use the latest jars from Spark nightly releases.How was this patch tested?
Ran
build/sbt -DsparkVersion=master clean connectServer/assembly connectClient/test
.Observed that the latest jar releases of Spark are downloaded and the Delta Connect Client test suite's (
DeltaTableSuite
) ran successfully.Does this PR introduce any user-facing changes?
No.