Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch 2.3 merge #264

Merged
merged 15 commits into from
Mar 21, 2019
Merged

Branch 2.3 merge #264

merged 15 commits into from
Mar 21, 2019

Commits on Mar 5, 2019

  1. [MINOR][DOCS] Clarify that Spark apps should mark Spark as a 'provide…

    …d' dependency, not package it
    
    ## What changes were proposed in this pull request?
    
    Spark apps do not need to package Spark. In fact it can cause problems in some cases. Our examples should show depending on Spark as a 'provided' dependency.
    
    Packaging Spark makes the app much bigger by tens of megabytes. It can also bring in conflicting dependencies that wouldn't otherwise be a problem. https://issues.apache.org/jira/browse/SPARK-26146 was what reminded me of this.
    
    ## How was this patch tested?
    
    Doc build
    
    Closes apache#23938 from srowen/Provided.
    
    Authored-by: Sean Owen <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    (cherry picked from commit 3909223)
    Signed-off-by: Sean Owen <[email protected]>
    srowen committed Mar 5, 2019
    Configuration menu
    Copy the full SHA
    c326628 View commit details
    Browse the repository at this point in the history

Commits on Mar 6, 2019

  1. [SPARK-24669][SQL] Invalidate tables in case of DROP DATABASE CASCADE

      ## What changes were proposed in this pull request?
    Before dropping database refresh the tables of that database, so as to refresh all cached entries associated with those tables.
    We follow the same when dropping a table.
    
    UT is added
    
    Closes apache#23905 from Udbhav30/SPARK-24669.
    
    Authored-by: Udbhav30 <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    (cherry picked from commit 9bddf71)
    Signed-off-by: Dongjoon Hyun <[email protected]>
    Udbhav30 authored and dongjoon-hyun committed Mar 6, 2019
    Configuration menu
    Copy the full SHA
    8b70980 View commit details
    Browse the repository at this point in the history
  2. [SPARK-27065][CORE] avoid more than one active task set managers for …

    …a stage
    
    ## What changes were proposed in this pull request?
    
    This is another attempt to fix the more-than-one-active-task-set-managers bug.
    
    apache#17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail.
    
    This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error.
    
    apache#21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed.
    
    However, apache#21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error.
    
    apache#22806 and apache#23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions.
    
    This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager.
    
    After this PR, apache#21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). apache#22806 and apache#23871 are its followups to fix the hole.
    
    ## How was this patch tested?
    
    existing tests.
    
    Closes apache#23927 from cloud-fan/scheduler.
    
    Authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Imran Rashid <[email protected]>
    (cherry picked from commit cb20fbc)
    Signed-off-by: Imran Rashid <[email protected]>
    cloud-fan authored and squito committed Mar 6, 2019
    Configuration menu
    Copy the full SHA
    877b8db View commit details
    Browse the repository at this point in the history

Commits on Mar 7, 2019

  1. [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empt…

    …y or not in updateAndGetCompilationStats
    
    ## What changes were proposed in this pull request?
    `CodeGenerator.updateAndGetCompilationStats` throws an unsupported exception for empty code size statistics. This pr added code to check if it is empty or not.
    
    ## How was this patch tested?
    Pass Jenkins.
    
    Closes apache#23947 from maropu/SPARK-21871-FOLLOWUP.
    
    Authored-by: Takeshi Yamamuro <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    maropu committed Mar 7, 2019
    Configuration menu
    Copy the full SHA
    dfde0c6 View commit details
    Browse the repository at this point in the history
  2. [SPARK-23433][SPARK-25250][CORE][BRANCH-2.3] Later created TaskSet sh…

    …ould learn about the finished partitions
    
    ## What changes were proposed in this pull request?
    
    This is an optional solution for apache#22806 .
    
    apache#21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why apache#22806 hit the issue.
    
    This pr extends apache#21131 's behavior by adding stageIdToFinishedPartitions into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into stageIdToFinishedPartitions and won't launch any duplicate tasks.
    
    ## How was this patch tested?
    
    Add.
    
    Closes apache#24007 from Ngone51/dev-23433-25250-branch-2.3.
    
    Authored-by: wuyi <[email protected]>
    Signed-off-by: Imran Rashid <[email protected]>
    Ngone51 authored and squito committed Mar 7, 2019
    Configuration menu
    Copy the full SHA
    a1ca566 View commit details
    Browse the repository at this point in the history

Commits on Mar 8, 2019

  1. [SPARK-26604][CORE][BACKPORT-2.4] Clean up channel registration for S…

    …treamManager
    
    ## What changes were proposed in this pull request?
    
    This is mostly a clean backport of apache#23521 to branch-2.4
    
    ## How was this patch tested?
    
    I've tested this with a hack in `TransportRequestHandler` to force `ChunkFetchRequest` to get dropped.
    
    Then making a number of `ExternalShuffleClient.fetchChunk` requests (which `OpenBlocks` then `ChunkFetchRequest`) and closing out of my test harness. A heap dump later reveals that the `StreamState` references are unreachable.
    
    I haven't run this through the unit test suite, but doing that now. Wanted to get this up as I think folks are waiting for it for 2.4.1
    
    Closes apache#24013 from abellina/SPARK-26604_cherry_pick_2_4.
    
    Lead-authored-by: Liang-Chi Hsieh <[email protected]>
    Co-authored-by: Alessandro Bellina <[email protected]>
    Signed-off-by: Marcelo Vanzin <[email protected]>
    (cherry picked from commit 216eeec)
    Signed-off-by: Marcelo Vanzin <[email protected]>
    2 people authored and Marcelo Vanzin committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    c45f8da View commit details
    Browse the repository at this point in the history

Commits on Mar 9, 2019

  1. [SPARK-27080][SQL] bug fix: mergeWithMetastoreSchema with uniform low…

    …er case comparison
    
    When reading parquet file with merging metastore schema and file schema, we should compare field names using uniform case. In current implementation, lowercase is used but one omission. And this patch fix it.
    
    Unit test
    
    Closes apache#24001 from codeborui/mergeSchemaBugFix.
    
    Authored-by: CodeGod <>
    Signed-off-by: Wenchen Fan <[email protected]>
    (cherry picked from commit a29df5f)
    Signed-off-by: Wenchen Fan <[email protected]>
    CodeGod authored and cloud-fan committed Mar 9, 2019
    Configuration menu
    Copy the full SHA
    b6d5b0a View commit details
    Browse the repository at this point in the history
  2. [SPARK-27111][SS] Fix a race that a continuous query may fail with In…

    …terruptedException
    
    Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally.
    
    To handle this case, this PR has the following changes:
    
    - Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase the waiting time of `stop` but should be minor because the operations here are very fast (just sending an RPC message in the same process and stopping a very simple thread).
    - Clear the interrupted status at the end so that it won't impact the `runContinuous` call. We may clear the interrupted status set by `stop`, but it doesn't affect the query termination because `runActivatedStream` will check `state` and exit accordingly.
    
    I also updated the clean up codes to make sure exceptions thrown from `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` won't stop the clean up.
    
    Jenkins
    
    Closes apache#24034 from zsxwing/SPARK-27111.
    
    Authored-by: Shixiong Zhu <[email protected]>
    Signed-off-by: Shixiong Zhu <[email protected]>
    (cherry picked from commit 6e1c082)
    Signed-off-by: Shixiong Zhu <[email protected]>
    zsxwing committed Mar 9, 2019
    Configuration menu
    Copy the full SHA
    4d1d0a4 View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2019

  1. Configuration menu
    Copy the full SHA
    21ae9ac View commit details
    Browse the repository at this point in the history
  2. [SPARK-26927][CORE] Ensure executor is active when processing events …

    …in dynamic allocation manager.
    
    There is a race condition in the `ExecutorAllocationManager` that the `SparkListenerExecutorRemoved` event is posted before the `SparkListenerTaskStart` event, which will cause the incorrect result of `executorIds`. Then, when some executor idles, the real executors will be removed even actual executor number is equal to `minNumExecutors` due to the incorrect computation of `newExecutorTotal`(may greater than the `minNumExecutors`), thus may finally causing zero available executors but a wrong positive number of executorIds was kept in memory.
    
    What's more, even the `SparkListenerTaskEnd` event can not make the fake `executorIds` released, because later idle event for the fake executors can not cause the real removal of these executors, as they are already removed and they are not exist in the `executorDataMap`  of `CoaseGrainedSchedulerBackend`, so that the `onExecutorRemoved` method will never be called again.
    
    For details see https://issues.apache.org/jira/browse/SPARK-26927
    
    This PR is to fix this problem.
    
    existUT and added UT
    
    Closes apache#23842 from liupc/Fix-race-condition-that-casues-dyanmic-allocation-not-working.
    
    Lead-authored-by: Liupengcheng <[email protected]>
    Co-authored-by: liupengcheng <[email protected]>
    Signed-off-by: Marcelo Vanzin <[email protected]>
    (cherry picked from commit d5cfe08)
    Signed-off-by: Marcelo Vanzin <[email protected]>
    Liupengcheng authored and Marcelo Vanzin committed Mar 12, 2019
    Configuration menu
    Copy the full SHA
    d0290ea View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1805d38 View commit details
    Browse the repository at this point in the history
  4. [MINOR][CORE] Use https for bintray spark-packages repository

    ## What changes were proposed in this pull request?
    
    This patch changes the schema of url from http to https for bintray spark-packages repository. Looks like we already changed the schema of repository url for pom.xml but missed inside the code.
    
    ## How was this patch tested?
    
    Manually ran the `--package` via `./bin/spark-shell --verbose  --packages "RedisLabs:spark-redis:0.3.2"`
    
    ```
    ...
    Ivy Default Cache set to: /Users/jlim/.ivy2/cache
    The jars for the packages stored in: /Users/jlim/.ivy2/jars
    :: loading settings :: url = jar:file:/Users/jlim/WorkArea/ScalaProjects/spark/dist/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
    RedisLabs#spark-redis added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent-2fee2e18-7832-4a4d-9e97-7b3d0fef766d;1.0
    	confs: [default]
    	found RedisLabs#spark-redis;0.3.2 in spark-packages
    	found redis.clients#jedis;2.7.2 in central
    	found org.apache.commons#commons-pool2;2.3 in central
    downloading https://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar ...
    	[SUCCESSFUL ] RedisLabs#spark-redis;0.3.2!spark-redis.jar (824ms)
    downloading https://repo1.maven.org/maven2/redis/clients/jedis/2.7.2/jedis-2.7.2.jar ...
    	[SUCCESSFUL ] redis.clients#jedis;2.7.2!jedis.jar (576ms)
    downloading https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.3/commons-pool2-2.3.jar ...
    	[SUCCESSFUL ] org.apache.commons#commons-pool2;2.3!commons-pool2.jar (150ms)
    :: resolution report :: resolve 4586ms :: artifacts dl 1555ms
    	:: modules in use:
    	RedisLabs#spark-redis;0.3.2 from spark-packages in [default]
    	org.apache.commons#commons-pool2;2.3 from central in [default]
    	redis.clients#jedis;2.7.2 from central in [default]
    	---------------------------------------------------------------------
    	|                  |            modules            ||   artifacts   |
    	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    	---------------------------------------------------------------------
    	|      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
    	---------------------------------------------------------------------
    ```
    
    Closes apache#24061 from HeartSaVioR/MINOR-use-https-to-bintray-repository.
    
    Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    (cherry picked from commit f57af22)
    Signed-off-by: Sean Owen <[email protected]>
    HeartSaVioR authored and srowen committed Mar 12, 2019
    Configuration menu
    Copy the full SHA
    959a7ec View commit details
    Browse the repository at this point in the history

Commits on Mar 19, 2019

  1. [SPARK-27112][CORE] : Create a resource ordering between threads to r…

    …esolve the deadlocks encountered when trying to kill executors either due to dynamic allocation or blacklisting
    
    Closes apache#24072 from pgandhi999/SPARK-27112-2.
    
    Authored-by: pgandhi <pgandhiverizonmedia.com>
    Signed-off-by: Imran Rashid <irashidcloudera.com>
    
    ## What changes were proposed in this pull request?
    
    There are two deadlocks as a result of the interplay between three different threads:
    
    **task-result-getter thread**
    
    **spark-dynamic-executor-allocation thread**
    
    **dispatcher-event-loop thread(makeOffers())**
    
    The fix ensures ordering synchronization constraint by acquiring lock on `TaskSchedulerImpl` before acquiring lock on `CoarseGrainedSchedulerBackend` in `makeOffers()` as well as killExecutors() method. This ensures resource ordering between the threads and thus, fixes the deadlocks.
    
    ## How was this patch tested?
    
    Manual Tests
    
    Closes apache#24134 from pgandhi999/branch-2.4-SPARK-27112.
    
    Authored-by: pgandhi <[email protected]>
    Signed-off-by: Imran Rashid <[email protected]>
    (cherry picked from commit 95e73b3)
    Signed-off-by: Imran Rashid <[email protected]>
    pgandhi authored and squito committed Mar 19, 2019
    Configuration menu
    Copy the full SHA
    7bb2b42 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    09d3530 View commit details
    Browse the repository at this point in the history

Commits on Mar 21, 2019

  1. [SPARK-27178][K8S][BRANCH-2.4] adding nss package to fix tests

    see also:  apache#24111
    
    while performing some tests on our existing minikube and k8s infrastructure, i noticed that the integration tests were failing. i dug in and discovered the following message buried at the end of the stacktrace:
    
    ```
      Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so
      	at sun.security.pkcs11.Secmod.initialize(Secmod.java:193)
      	at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:218)
      	... 81 more
    ```
    after i added the `nss` package to `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile`, everything worked.
    
    this is also impacting current builds.  see:  https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/8959/console
    
    i tested locally before pushing, and the build system will test the rest.
    
    Closes apache#24137 from shaneknapp/add-nss-package.
    
    Authored-by: shane knapp <[email protected]>
    Signed-off-by: Marcelo Vanzin <[email protected]>
    shaneknapp authored and markhamstra committed Mar 21, 2019
    Configuration menu
    Copy the full SHA
    8bf8526 View commit details
    Browse the repository at this point in the history