Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch 2.3 merge #264

Merged
merged 15 commits into from
Mar 21, 2019
Merged

Branch 2.3 merge #264

merged 15 commits into from
Mar 21, 2019

Conversation

markhamstra
Copy link

merging upstream bugfixes

srowen and others added 15 commits March 5, 2019 08:28
…d' dependency, not package it

## What changes were proposed in this pull request?

Spark apps do not need to package Spark. In fact it can cause problems in some cases. Our examples should show depending on Spark as a 'provided' dependency.

Packaging Spark makes the app much bigger by tens of megabytes. It can also bring in conflicting dependencies that wouldn't otherwise be a problem. https://issues.apache.org/jira/browse/SPARK-26146 was what reminded me of this.

## How was this patch tested?

Doc build

Closes apache#23938 from srowen/Provided.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 3909223)
Signed-off-by: Sean Owen <[email protected]>
  ## What changes were proposed in this pull request?
Before dropping database refresh the tables of that database, so as to refresh all cached entries associated with those tables.
We follow the same when dropping a table.

UT is added

Closes apache#23905 from Udbhav30/SPARK-24669.

Authored-by: Udbhav30 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 9bddf71)
Signed-off-by: Dongjoon Hyun <[email protected]>
…a stage

## What changes were proposed in this pull request?

This is another attempt to fix the more-than-one-active-task-set-managers bug.

apache#17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail.

This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error.

apache#21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed.

However, apache#21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error.

apache#22806 and apache#23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions.

This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager.

After this PR, apache#21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). apache#22806 and apache#23871 are its followups to fix the hole.

## How was this patch tested?

existing tests.

Closes apache#23927 from cloud-fan/scheduler.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Imran Rashid <[email protected]>
(cherry picked from commit cb20fbc)
Signed-off-by: Imran Rashid <[email protected]>
…y or not in updateAndGetCompilationStats

## What changes were proposed in this pull request?
`CodeGenerator.updateAndGetCompilationStats` throws an unsupported exception for empty code size statistics. This pr added code to check if it is empty or not.

## How was this patch tested?
Pass Jenkins.

Closes apache#23947 from maropu/SPARK-21871-FOLLOWUP.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
…ould learn about the finished partitions

## What changes were proposed in this pull request?

This is an optional solution for apache#22806 .

apache#21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why apache#22806 hit the issue.

This pr extends apache#21131 's behavior by adding stageIdToFinishedPartitions into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into stageIdToFinishedPartitions and won't launch any duplicate tasks.

## How was this patch tested?

Add.

Closes apache#24007 from Ngone51/dev-23433-25250-branch-2.3.

Authored-by: wuyi <[email protected]>
Signed-off-by: Imran Rashid <[email protected]>
…treamManager

## What changes were proposed in this pull request?

This is mostly a clean backport of apache#23521 to branch-2.4

## How was this patch tested?

I've tested this with a hack in `TransportRequestHandler` to force `ChunkFetchRequest` to get dropped.

Then making a number of `ExternalShuffleClient.fetchChunk` requests (which `OpenBlocks` then `ChunkFetchRequest`) and closing out of my test harness. A heap dump later reveals that the `StreamState` references are unreachable.

I haven't run this through the unit test suite, but doing that now. Wanted to get this up as I think folks are waiting for it for 2.4.1

Closes apache#24013 from abellina/SPARK-26604_cherry_pick_2_4.

Lead-authored-by: Liang-Chi Hsieh <[email protected]>
Co-authored-by: Alessandro Bellina <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit 216eeec)
Signed-off-by: Marcelo Vanzin <[email protected]>
…er case comparison

When reading parquet file with merging metastore schema and file schema, we should compare field names using uniform case. In current implementation, lowercase is used but one omission. And this patch fix it.

Unit test

Closes apache#24001 from codeborui/mergeSchemaBugFix.

Authored-by: CodeGod <>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a29df5f)
Signed-off-by: Wenchen Fan <[email protected]>
…terruptedException

Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally.

To handle this case, this PR has the following changes:

- Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase the waiting time of `stop` but should be minor because the operations here are very fast (just sending an RPC message in the same process and stopping a very simple thread).
- Clear the interrupted status at the end so that it won't impact the `runContinuous` call. We may clear the interrupted status set by `stop`, but it doesn't affect the query termination because `runActivatedStream` will check `state` and exit accordingly.

I also updated the clean up codes to make sure exceptions thrown from `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` won't stop the clean up.

Jenkins

Closes apache#24034 from zsxwing/SPARK-27111.

Authored-by: Shixiong Zhu <[email protected]>
Signed-off-by: Shixiong Zhu <[email protected]>
(cherry picked from commit 6e1c082)
Signed-off-by: Shixiong Zhu <[email protected]>
…in dynamic allocation manager.

There is a race condition in the `ExecutorAllocationManager` that the `SparkListenerExecutorRemoved` event is posted before the `SparkListenerTaskStart` event, which will cause the incorrect result of `executorIds`. Then, when some executor idles, the real executors will be removed even actual executor number is equal to `minNumExecutors` due to the incorrect computation of `newExecutorTotal`(may greater than the `minNumExecutors`), thus may finally causing zero available executors but a wrong positive number of executorIds was kept in memory.

What's more, even the `SparkListenerTaskEnd` event can not make the fake `executorIds` released, because later idle event for the fake executors can not cause the real removal of these executors, as they are already removed and they are not exist in the `executorDataMap`  of `CoaseGrainedSchedulerBackend`, so that the `onExecutorRemoved` method will never be called again.

For details see https://issues.apache.org/jira/browse/SPARK-26927

This PR is to fix this problem.

existUT and added UT

Closes apache#23842 from liupc/Fix-race-condition-that-casues-dyanmic-allocation-not-working.

Lead-authored-by: Liupengcheng <[email protected]>
Co-authored-by: liupengcheng <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit d5cfe08)
Signed-off-by: Marcelo Vanzin <[email protected]>
## What changes were proposed in this pull request?

This patch changes the schema of url from http to https for bintray spark-packages repository. Looks like we already changed the schema of repository url for pom.xml but missed inside the code.

## How was this patch tested?

Manually ran the `--package` via `./bin/spark-shell --verbose  --packages "RedisLabs:spark-redis:0.3.2"`

```
...
Ivy Default Cache set to: /Users/jlim/.ivy2/cache
The jars for the packages stored in: /Users/jlim/.ivy2/jars
:: loading settings :: url = jar:file:/Users/jlim/WorkArea/ScalaProjects/spark/dist/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
RedisLabs#spark-redis added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2fee2e18-7832-4a4d-9e97-7b3d0fef766d;1.0
	confs: [default]
	found RedisLabs#spark-redis;0.3.2 in spark-packages
	found redis.clients#jedis;2.7.2 in central
	found org.apache.commons#commons-pool2;2.3 in central
downloading https://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar ...
	[SUCCESSFUL ] RedisLabs#spark-redis;0.3.2!spark-redis.jar (824ms)
downloading https://repo1.maven.org/maven2/redis/clients/jedis/2.7.2/jedis-2.7.2.jar ...
	[SUCCESSFUL ] redis.clients#jedis;2.7.2!jedis.jar (576ms)
downloading https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.3/commons-pool2-2.3.jar ...
	[SUCCESSFUL ] org.apache.commons#commons-pool2;2.3!commons-pool2.jar (150ms)
:: resolution report :: resolve 4586ms :: artifacts dl 1555ms
	:: modules in use:
	RedisLabs#spark-redis;0.3.2 from spark-packages in [default]
	org.apache.commons#commons-pool2;2.3 from central in [default]
	redis.clients#jedis;2.7.2 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
	---------------------------------------------------------------------
```

Closes apache#24061 from HeartSaVioR/MINOR-use-https-to-bintray-repository.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit f57af22)
Signed-off-by: Sean Owen <[email protected]>
…esolve the deadlocks encountered when trying to kill executors either due to dynamic allocation or blacklisting

Closes apache#24072 from pgandhi999/SPARK-27112-2.

Authored-by: pgandhi <pgandhiverizonmedia.com>
Signed-off-by: Imran Rashid <irashidcloudera.com>

## What changes were proposed in this pull request?

There are two deadlocks as a result of the interplay between three different threads:

**task-result-getter thread**

**spark-dynamic-executor-allocation thread**

**dispatcher-event-loop thread(makeOffers())**

The fix ensures ordering synchronization constraint by acquiring lock on `TaskSchedulerImpl` before acquiring lock on `CoarseGrainedSchedulerBackend` in `makeOffers()` as well as killExecutors() method. This ensures resource ordering between the threads and thus, fixes the deadlocks.

## How was this patch tested?

Manual Tests

Closes apache#24134 from pgandhi999/branch-2.4-SPARK-27112.

Authored-by: pgandhi <[email protected]>
Signed-off-by: Imran Rashid <[email protected]>
(cherry picked from commit 95e73b3)
Signed-off-by: Imran Rashid <[email protected]>
see also:  apache#24111

while performing some tests on our existing minikube and k8s infrastructure, i noticed that the integration tests were failing. i dug in and discovered the following message buried at the end of the stacktrace:

```
  Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so
  	at sun.security.pkcs11.Secmod.initialize(Secmod.java:193)
  	at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:218)
  	... 81 more
```
after i added the `nss` package to `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile`, everything worked.

this is also impacting current builds.  see:  https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/8959/console

i tested locally before pushing, and the build system will test the rest.

Closes apache#24137 from shaneknapp/add-nss-package.

Authored-by: shane knapp <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
@markhamstra markhamstra merged commit 43345fb into alteryx:csd-2.3 Mar 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants