Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18686][SparkR][ML] Several cleanup and improvements for spark.logit. #16117

Closed
wants to merge 3 commits into from

Conversation

yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Dec 2, 2016

What changes were proposed in this pull request?

Several cleanup and improvements for spark.logit:

  • summary should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
  • summary should not return areaUnderROC, roc, pr, ..., since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
  • SparkR test improvement: comparing the training result with native R glmnet.
  • Remove argument aggregationDepth from spark.logit, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.

How was this patch tested?

Unit tests.

The summary output after this change:
multinomial logistic regression:

> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             versicolor  virginica   setosa
(Intercept)  1.514031    -2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165

binomial logistic regression:

> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69556 has finished for PR 16117 at commit e78adcd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Copy link
Contributor

@wangmiao1981 wangmiao1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it looks good to me. I have some minor comments.


val pipeline = PipelineModel.load(pipelinePath)
new LogisticRegressionWrapper(pipeline, features, isLoaded = true)
new LogisticRegressionWrapper(pipeline, features, labels)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line?

#'
#' # fitted values on training data
#' fitted <- predict(model, training)
#'
#' # save fitted model to input path
#' path <- "path/to/model"
#' write.ml(blr_model, path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you changed blr_model -> model, it should be model here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated. Thanks.

features <- callJMethod(jobj, "rFeatures")
labels <- callJMethod(jobj, "labels")
coefficients <- callJMethod(jobj, "rCoefficients")
nCol <- length(coefficients) / length(features)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do the nCol calculation and column name on scala side? So, we don't have to call rFeatures and labels on R side, which makes the logic simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could. The reason I did this way is we may add more model statistics which are different for binomial and multinomial logistic regression later, so we need to distinguish them at R side in any way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@SparkQA
Copy link

SparkQA commented Dec 4, 2016

Test build #69638 has finished for PR 16117 at commit 6fc9dec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' # summary of binary logistic regression
#' blr_summary <- summary(blr_model)
#' blr_fmeasure <- collect(select(blr_summary$fMeasureByThreshold, "threshold", "F-Measure"))
#' df <- suppressWarnings(createDataFrame(iris))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we tried not to have suppressWarnings in example - while a warning message is not great (we have a JIRA on columns with _...) IMO it's best not to have it in example which probably raises more questions for a causal reader.

@felixcheung
Copy link
Member

LGTM except a minor comment on example. thanks!

@SparkQA
Copy link

SparkQA commented Dec 6, 2016

Test build #69718 has finished for PR 16117 at commit 914c9ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Since this will involve breaking change if we merge it after 2.1 release, and actually I think it belongs to the scope of QA, so I'll merge it into master and branch-2.1, thanks for all your reviewing.

asfgit pushed a commit that referenced this pull request Dec 7, 2016
…logit.

## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             versicolor  virginica   setosa
(Intercept)  1.514031    -2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang <[email protected]>

Closes #16117 from yanboliang/spark-18686.

(cherry picked from commit 90b59d1)
Signed-off-by: Yanbo Liang <[email protected]>
@asfgit asfgit closed this in 90b59d1 Dec 7, 2016
@yanboliang yanboliang deleted the spark-18686 branch December 7, 2016 08:34
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
…logit.

## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             versicolor  virginica   setosa
(Intercept)  1.514031    -2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang <[email protected]>

Closes apache#16117 from yanboliang/spark-18686.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…logit.

## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             versicolor  virginica   setosa
(Intercept)  1.514031    -2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang <[email protected]>

Closes apache#16117 from yanboliang/spark-18686.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants