Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SKIPME merged Apache branch-1.6 #140

Merged
merged 36 commits into from
Jan 8, 2016
Merged

Conversation

markhamstra
Copy link

No description provided.

maropu and others added 30 commits December 28, 2015 21:29
…es in postgresql

If DataFrame has BYTE types, throws an exception:
org.postgresql.util.PSQLException: ERROR: type "byte" does not exist

Author: Takeshi YAMAMURO <[email protected]>

Closes apache#9350 from maropu/FixBugInPostgreJdbc.

(cherry picked from commit 73862a1)
Signed-off-by: Yin Huai <[email protected]>
…umn as value

`ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values.

For example:
```r
ifelse(lit(1) == lit(1), lit(2), lit(3))
ifelse(df$mpg > 0, df$mpg, 0)
```
will both fail with
```r
attempt to replicate an object of type 'environment'
```

The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR.

For reference, added test cases which trigger failures:
```r
. Error: when(), otherwise() and ifelse() with column on a DataFrame ----------
error in evaluating the argument 'x' in selecting a method for function 'collect':
  error in evaluating the argument 'col' in selecting a method for function 'select':
  attempt to replicate an object of type 'environment'
Calls: when -> when -> ifelse -> ifelse

1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126
5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label)
6: condition(object)
7: compare(actual, expected, ...)
8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))
Error: Test failures
Execution halted
```

Author: Forest Fang <[email protected]>

Closes apache#10481 from saurfang/spark-12526.

(cherry picked from commit d80cc90)
Signed-off-by: Shivaram Venkataraman <[email protected]>
Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information.

Author: Holden Karau <[email protected]>

Closes apache#10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.

(cherry picked from commit d1ca634)
Signed-off-by: Davies Liu <[email protected]>
…ith an unknown app Id

I got an exception when accessing the below REST API with an unknown application Id.
`http://<server-url>:18080/api/v1/applications/xxx/jobs`
Instead of an exception, I expect an error message "no such app: xxx" which is a similar error message when I access `/api/v1/applications/xxx`
```
org.spark-project.guava.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: no app with key xxx
	at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
	at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
	at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
	at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
	at org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:116)
	at org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:226)
	at org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:46)
	at org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
```

Author: Carson Wang <[email protected]>

Closes apache#10352 from carsonwang/unknownAppFix.

(cherry picked from commit b244297)
Signed-off-by: Marcelo Vanzin <[email protected]>
shivaram

Author: felixcheung <[email protected]>

Closes apache#10408 from felixcheung/rcodecomment.

(cherry picked from commit c3d5056)
Signed-off-by: Shivaram Venkataraman <[email protected]>
…ame to be called value

Author: Xiu Guo <[email protected]>

Closes apache#10515 from xguo27/SPARK-12562.

(cherry picked from commit 84f8492)
Signed-off-by: Reynold Xin <[email protected]>
…sible.

This patch updates the ExecutorRunner's terminate path to use the new java 8 API
to terminate processes more forcefully if possible. If the executor is unhealthy,
it would previously ignore the destroy() call. Presumably, the new java API was
added to handle cases like this.

We could update the termination path in the future to use OS specific commands
for older java versions.

Author: Nong Li <[email protected]>

Closes apache#10438 from nongli/spark-12486-executors.

(cherry picked from commit 8f65939)
Signed-off-by: Andrew Or <[email protected]>
also only allocate required buffer size

Author: Pete Robbins <[email protected]>

Closes apache#10421 from robbinspg/master.

(cherry picked from commit b504b6a)
Signed-off-by: Davies Liu <[email protected]>

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala
Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection.

In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection.

This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly).

If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different).

This patch is inspired by a similar patch that I made to the `spark-redshift` library (databricks/spark-redshift#143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons).

Author: Josh Rosen <[email protected]>

Closes apache#10519 from JoshRosen/jdbc-driver-precedence.

(cherry picked from commit 6c83d93)
Signed-off-by: Yin Huai <[email protected]>
This is the related thread: http://search-hadoop.com/m/q3RTtO3ReeJ1iF02&subj=Re+partitioning+json+data+in+spark

Michael suggested fixing the doc.

Please review.

Author: tedyu <[email protected]>

Closes apache#10499 from ted-yu/master.

(cherry picked from commit 40d0396)
Signed-off-by: Michael Armbrust <[email protected]>
…he row length.

The reader was previously not setting the row length meaning it was wrong if there were variable
length columns. This problem does not manifest usually, since the value in the column is correct and
projecting the row fixes the issue.

Author: Nong Li <[email protected]>

Closes apache#10576 from nongli/spark-12589.

(cherry picked from commit 34de24a)
Signed-off-by: Yin Huai <[email protected]>

Conflicts:
	sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
checked that the change is in Spark 1.6.0.
shivaram

Author: felixcheung <[email protected]>

Closes apache#10574 from felixcheung/rwritemodedoc.

(cherry picked from commit 8896ec9)
Signed-off-by: Shivaram Venkataraman <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes apache#10516 from marmbrus/datasetCleanup.

(cherry picked from commit 53beddc)
Signed-off-by: Michael Armbrust <[email protected]>
…termining the number of reducers: aggregate operator

change expected partition sizes

Author: Pete Robbins <[email protected]>

Closes apache#10599 from robbinspg/branch-1.6.
This patch added Py4jCallbackConnectionCleaner to clean the leak sockets of Py4J every 30 seconds. This is a workaround before Py4J fixes the leak issue py4j/py4j#187

Author: Shixiong Zhu <[email protected]>

Closes apache#10579 from zsxwing/SPARK-12617.

(cherry picked from commit 047a31b)
Signed-off-by: Davies Liu <[email protected]>
…erializer is called only once

There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (py4j/py4j#184)

Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed.

Author: Shixiong Zhu <[email protected]>

Closes apache#10514 from zsxwing/SPARK-12511.

(cherry picked from commit 6cfe341)
Signed-off-by: Davies Liu <[email protected]>
SPARK-12450 . Un-persist broadcasted variables in KMeans.

Author: RJ Nowling <[email protected]>

Closes apache#10415 from rnowling/spark-12450.

(cherry picked from commit 78015a8)
Signed-off-by: Joseph K. Bradley <[email protected]>
Successfully ran kinesis demo on a live, aws hosted kinesis stream against master and 1.6 branches.  For reasons I don't entirely understand it required a manual merge to 1.5 which I did as shown here: BrianLondon@075c22e

The demo ran successfully on the 1.5 branch as well.

According to `mvn dependency:tree` it is still pulling a fairly old version of the aws-java-sdk (1.9.37), but this appears to have fixed the kinesis regression in 1.5.2.

Author: BrianLondon <[email protected]>

Closes apache#10492 from BrianLondon/remove-only.

(cherry picked from commit ff89975)
Signed-off-by: Sean Owen <[email protected]>
Add ```read.text``` and ```write.text``` for SparkR.
cc sun-rui felixcheung shivaram

Author: Yanbo Liang <[email protected]>

Closes apache#10348 from yanboliang/spark-12393.

(cherry picked from commit d1fea41)
Signed-off-by: Shivaram Venkataraman <[email protected]>
If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`.

Author: zero323 <[email protected]>

Closes apache#9986 from zero323/SPARK-12006.

(cherry picked from commit fcd013c)
Signed-off-by: Joseph K. Bradley <[email protected]>
Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext.

Author: Shixiong Zhu <[email protected]>

Closes apache#10621 from zsxwing/SPARK-12617-2.

(cherry picked from commit 1e6648d)
Signed-off-by: Shixiong Zhu <[email protected]>
…lt root path to gain the streaming batch url.

Author: huangzhaowei <[email protected]>

Closes apache#10617 from SaintBacchus/SPARK-12672.
…of default root path to gain the streaming batch url."

This reverts commit 8f0ead3. Will merge apache#10618 instead.
… pyspark

JIRA: https://issues.apache.org/jira/browse/SPARK-12016

We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#10100 from viirya/fix-load-py-wordvecmodel.

(cherry picked from commit b51a4cd)
Signed-off-by: Joseph K. Bradley <[email protected]>
Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot:

![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png)

Author: jerryshao <[email protected]>

Closes apache#10618 from jerryshao/SPARK-12673.

(cherry picked from commit 174e72c)
Signed-off-by: Shixiong Zhu <[email protected]>
MapPartitionsRDD was keeping a reference to `prev` after a call to
`clearDependencies` which could lead to memory leak.

Author: Guillaume Poulin <[email protected]>

Closes apache#10623 from gpoulin/map_partition_deps.

(cherry picked from commit b673852)
Signed-off-by: Reynold Xin <[email protected]>
…not None"

This reverts commit fcd013c.

Author: Yin Huai <[email protected]>

Closes apache#10632 from yhuai/pythonStyle.

(cherry picked from commit e5cde7a)
Signed-off-by: Yin Huai <[email protected]>
modify 'spark.memory.offHeap.enabled' default value to false

Author: zzcclp <[email protected]>

Closes apache#10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.

(cherry picked from commit 84e77a1)
Signed-off-by: Reynold Xin <[email protected]>
zero323 and others added 6 commits January 7, 2016 10:33
If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list.

Author: zero323 <[email protected]>

Closes apache#10644 from zero323/SPARK-12006.

(cherry picked from commit 592f649)
Signed-off-by: Joseph K. Bradley <[email protected]>
…pping splits

https://issues.apache.org/jira/browse/SPARK-12662

cc yhuai

Author: Sameer Agarwal <[email protected]>

Closes apache#10626 from sameeragarwal/randomsplit.

(cherry picked from commit f194d99)
Signed-off-by: Reynold Xin <[email protected]>
There is a bug in the calculation of ```maxSplitSize```.  The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```.

Author: Darek Blasiak <[email protected]>

Closes apache#10546 from datafarmer/setminpartitionsbug.

(cherry picked from commit 8346518)
Signed-off-by: Sean Owen <[email protected]>
…owBatching configurations for Streaming

/cc tdas brkyvz

Author: Shixiong Zhu <[email protected]>

Closes apache#10453 from zsxwing/streaming-conf.

(cherry picked from commit c94199e)
Signed-off-by: Tathagata Das <[email protected]>
…branch 1.6)

backport apache#10609 to branch 1.6

Author: Shixiong Zhu <[email protected]>

Closes apache#10656 from zsxwing/SPARK-12591-branch-1.6.
markhamstra added a commit that referenced this pull request Jan 8, 2016
SKIPME merged Apache branch-1.6
@markhamstra markhamstra merged commit 07a9e45 into alteryx:csd-1.6 Jan 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.