Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide #22675

Closed
wants to merge 6 commits into from

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Oct 9, 2018

What changes were proposed in this pull request?

Spark datasource for image/libsvm user guide

How was this patch tested?

Scala:
1

Java:
2

Python:
3

R:
4

@SparkQA
Copy link

SparkQA commented Oct 9, 2018

Test build #97141 has finished for PR 22675 at commit 887f528.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

</div>

<div data-lang="java" markdown="1">
[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why did we put the image source inside of Spark, rather then a separate module? (see also #21742 (comment)). Avro was put as a separate module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mengxr as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it depends on how important the use case is. For example, CSV was created as an external data source and later merged into Spark. See https://issues.apache.org/jira/browse/SPARK-21866?focusedCommentId=16148268&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16148268.

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant (external) Avro was merged into external/... in Apache Spark as a separate module due to the reason above. Image data source is merged into Spark's main code rather then a separate module. I don't object to bring an external into Apache Spark and I don't doubt you guys's judgement - ++1 for bring this in actually.

My point is I was wondering why this exists in Spark's main code whereas the ideal approach is to put them external/....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cloud-fan and @gatorsmile, am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sympathize with the comment, but I think it makes some sense tucked into ML rather than a standalone module.

@felixcheung
Copy link
Member

we need this for 2.4?

@WeichenXu123
Copy link
Contributor Author

This do not block 2.4 release. But merge before 2.4 is better.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some proofreading but the idea seems OK

---

In this section, we introduce how to use data source in ML to load data.
Beside some general data sources "parquat", "csv", "json", "jdbc", we also provide some specific data source for ML.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • data sources -> data sources like
  • Do you want to just say "Parquet, CSV, JSON, and JDBC"? they aren't code identifiers here
  • parquat -> parquet
  • data source -> data sources


## Image data source

This image data source is used to load libsvm data files from directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from directory -> from a directory
Maybe say "This data source loads images in libsvm format from a directory"?

<div class="codetabs">
<div data-lang="scala" markdown="1">
[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource)
implements Spark SQL data source API for loading image data as DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implements Spark -> implements a Spark
as DataFrame -> as a DataFrame
This sentence is repeated three times. Can you move the shared text out of the language-specific code blocks?

</div>

<div data-lang="python" markdown="1">
In scala we implement Spark SQL data source API for loading image data as DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala -> Scala, but this is about Python.

@SparkQA
Copy link

SparkQA commented Oct 10, 2018

Test build #97184 has finished for PR 22675 at commit cc4c967.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2018

Test build #97185 has finished for PR 22675 at commit 994d09d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

---

In this section, we introduce how to use data source in ML to load data.
Beside some general data sources like Parquet, CSV, JSON, JDBC, we also provide some specific data source for ML.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON, JDBC -> JSON and JDBC

## Image data source

This image data source is used to load image files from a directory.
The loaded DataFrame has one StructType column: "image". containing image data stored as image schema.
Copy link
Member

@HyukjinKwon HyukjinKwon Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we describe which image we can load? For instance, I think this delegates to ImageIO in Java which allows to read compressed format like PNG or JPG to raw image representation like BMP so that OpenCV can handles them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

## Image data source

This image data source is used to load image files from a directory.
The loaded DataFrame has one StructType column: "image". containing image data stored as image schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also describe the schema structure and what each field means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

{% endhighlight %}
</div>

<div data-lang="python" markdown="1">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about SQL syntax? I think we can use CREATE TABLE tableA USING LOCATION 'data/image.png'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like SQL features and fit all datasources. Put it in spark SQL doc will be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add an example for R as well then? It wouldn't be too difficult to add the equivalent examples. Also, I don't think we will add the equivalent examples in different languages at different pages.

implements Spark SQL data source API for loading image data as DataFrame.

{% highlight java %}
Dataset<Row> imagesDF = spark.read().format("image").load("data/mllib/images/origin");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a simple transformation so that how the image datasource can be utilized?

@SparkQA
Copy link

SparkQA commented Oct 18, 2018

Test build #97514 has finished for PR 22675 at commit ba0d888.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library.
The loaded DataFrame has one StructType column: "image". containing image data stored as image schema.
The schema of the `image` column is:
- origin: String (represents the file path of the image)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use SQL types consistently, for instance, StringType, IntegerType

## Image data source

This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library.
The loaded DataFrame has one StructType column: "image". containing image data stored as image schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. -> ,.

---
layout: global
title: Data sources
displayTitle: Data sources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be Datasource or Data sources? I am saying this because there looks a mismatch with the menu above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data sources.

---

In this section, we introduce how to use data source in ML to load data.
Beside some general data sources like Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really personal preference tho .. like -> such as

## Image data source

This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library.
The loaded DataFrame has one StructType column: "image". containing image data stored as image schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we consistently make some codes such as StructType as codes like `StructType`

@HyukjinKwon
Copy link
Member

Looks cool otherwise!

@SparkQA
Copy link

SparkQA commented Oct 18, 2018

Test build #97541 has finished for PR 22675 at commit 1b54044.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

---

In this section, we introduce how to use data source in ML to load data.
Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"some specific data sources for ML"

@imatiach-msft
Copy link
Contributor

@WeichenXu123 this looks good - one recommendation though, in the examples it might be good to pass the param that ignores bad images so that the not-image.txt is not included in the output, it might confuse people who are new to the API - unless you feel strongly about having the not-image.txt output in the examples


**Table of Contents**

* This will become a table of contents (this text will be scraped).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this convention, to have this text here in the table of contents? "* This will become a table of contents (this text will be scraped)."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This keep the same with other ML algo page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, great

- origin: `StringType` (represents the file path of the image)
- height: `IntegerType` (height of the image)
- width: `IntegerType` (width of the image)
- nChannels: `IntegerType` (number of the image channels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: number of image channels (no "the")

@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97889 has finished for PR 22675 at commit 8231cb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@imatiach-msft
Copy link
Contributor

LGTM, this is great to have!

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from me as well

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.4!

asfgit pushed a commit that referenced this pull request Oct 25, 2018
## What changes were proposed in this pull request?

Spark datasource for image/libsvm user guide

## How was this patch tested?
Scala:
<img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png">

Java:
<img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png">

Python:
<img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png">

R:
<img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png">

Closes #22675 from WeichenXu123/add_image_source_doc.

Authored-by: WeichenXu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6540c2f)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 6540c2f Oct 25, 2018
@WeichenXu123 WeichenXu123 deleted the add_image_source_doc branch October 25, 2018 15:12
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Spark datasource for image/libsvm user guide

## How was this patch tested?
Scala:
<img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png">

Java:
<img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png">

Python:
<img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png">

R:
<img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png">

Closes apache#22675 from WeichenXu123/add_image_source_doc.

Authored-by: WeichenXu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants