-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide #22675
Conversation
Test build #97141 has finished for PR 22675 at commit
|
</div> | ||
|
||
<div data-lang="java" markdown="1"> | ||
[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why did we put the image source inside of Spark, rather then a separate module? (see also #21742 (comment)). Avro was put as a separate module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mengxr as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually it depends on how important the use case is. For example, CSV was created as an external data source and later merged into Spark. See https://issues.apache.org/jira/browse/SPARK-21866?focusedCommentId=16148268&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16148268.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant (external) Avro was merged into external/...
in Apache Spark as a separate module due to the reason above. Image data source is merged into Spark's main code rather then a separate module. I don't object to bring an external into Apache Spark and I don't doubt you guys's judgement - ++1 for bring this in actually.
My point is I was wondering why this exists in Spark's main code whereas the ideal approach is to put them external/...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cloud-fan and @gatorsmile, am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sympathize with the comment, but I think it makes some sense tucked into ML rather than a standalone module.
we need this for 2.4? |
This do not block 2.4 release. But merge before 2.4 is better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs some proofreading but the idea seems OK
docs/ml-datasource.md
Outdated
--- | ||
|
||
In this section, we introduce how to use data source in ML to load data. | ||
Beside some general data sources "parquat", "csv", "json", "jdbc", we also provide some specific data source for ML. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- data sources -> data sources like
- Do you want to just say "Parquet, CSV, JSON, and JDBC"? they aren't code identifiers here
- parquat -> parquet
- data source -> data sources
docs/ml-datasource.md
Outdated
|
||
## Image data source | ||
|
||
This image data source is used to load libsvm data files from directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from directory -> from a directory
Maybe say "This data source loads images in libsvm format from a directory"?
docs/ml-datasource.md
Outdated
<div class="codetabs"> | ||
<div data-lang="scala" markdown="1"> | ||
[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource) | ||
implements Spark SQL data source API for loading image data as DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implements Spark -> implements a Spark
as DataFrame -> as a DataFrame
This sentence is repeated three times. Can you move the shared text out of the language-specific code blocks?
docs/ml-datasource.md
Outdated
</div> | ||
|
||
<div data-lang="python" markdown="1"> | ||
In scala we implement Spark SQL data source API for loading image data as DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala -> Scala, but this is about Python.
Test build #97184 has finished for PR 22675 at commit
|
Test build #97185 has finished for PR 22675 at commit
|
docs/ml-datasource.md
Outdated
--- | ||
|
||
In this section, we introduce how to use data source in ML to load data. | ||
Beside some general data sources like Parquet, CSV, JSON, JDBC, we also provide some specific data source for ML. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON, JDBC
-> JSON and JDBC
docs/ml-datasource.md
Outdated
## Image data source | ||
|
||
This image data source is used to load image files from a directory. | ||
The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we describe which image we can load? For instance, I think this delegates to ImageIO in Java which allows to read compressed format like PNG or JPG to raw image representation like BMP so that OpenCV can handles them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
docs/ml-datasource.md
Outdated
## Image data source | ||
|
||
This image data source is used to load image files from a directory. | ||
The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also describe the schema structure and what each field means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
{% endhighlight %} | ||
</div> | ||
|
||
<div data-lang="python" markdown="1"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about SQL syntax? I think we can use CREATE TABLE tableA USING LOCATION 'data/image.png'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like SQL features and fit all datasources. Put it in spark SQL doc will be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add an example for R as well then? It wouldn't be too difficult to add the equivalent examples. Also, I don't think we will add the equivalent examples in different languages at different pages.
docs/ml-datasource.md
Outdated
implements Spark SQL data source API for loading image data as DataFrame. | ||
|
||
{% highlight java %} | ||
Dataset<Row> imagesDF = spark.read().format("image").load("data/mllib/images/origin"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do a simple transformation so that how the image datasource can be utilized?
Test build #97514 has finished for PR 22675 at commit
|
docs/ml-datasource.md
Outdated
This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library. | ||
The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. | ||
The schema of the `image` column is: | ||
- origin: String (represents the file path of the image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use SQL types consistently, for instance, StringType, IntegerType
docs/ml-datasource.md
Outdated
## Image data source | ||
|
||
This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library. | ||
The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
-> ,
.
--- | ||
layout: global | ||
title: Data sources | ||
displayTitle: Data sources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be Datasource
or Data sources
? I am saying this because there looks a mismatch with the menu above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data sources.
docs/ml-datasource.md
Outdated
--- | ||
|
||
In this section, we introduce how to use data source in ML to load data. | ||
Beside some general data sources like Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really personal preference tho .. like
-> such as
docs/ml-datasource.md
Outdated
## Image data source | ||
|
||
This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library. | ||
The loaded DataFrame has one StructType column: "image". containing image data stored as image schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we consistently make some codes such as StructType
as codes like `StructType`
Looks cool otherwise! |
Test build #97541 has finished for PR 22675 at commit
|
docs/ml-datasource.md
Outdated
--- | ||
|
||
In this section, we introduce how to use data source in ML to load data. | ||
Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data source for ML. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"some specific data sources for ML"
@WeichenXu123 this looks good - one recommendation though, in the examples it might be good to pass the param that ignores bad images so that the not-image.txt is not included in the output, it might confuse people who are new to the API - unless you feel strongly about having the not-image.txt output in the examples |
|
||
**Table of Contents** | ||
|
||
* This will become a table of contents (this text will be scraped). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this convention, to have this text here in the table of contents? "* This will become a table of contents (this text will be scraped)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This keep the same with other ML algo page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok, great
docs/ml-datasource.md
Outdated
- origin: `StringType` (represents the file path of the image) | ||
- height: `IntegerType` (height of the image) | ||
- width: `IntegerType` (width of the image) | ||
- nChannels: `IntegerType` (number of the image channels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: number of image channels (no "the")
1b54044
to
8231cb2
Compare
Test build #97889 has finished for PR 22675 at commit
|
LGTM, this is great to have! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from me as well
thanks, merging to master/2.4! |
## What changes were proposed in this pull request? Spark datasource for image/libsvm user guide ## How was this patch tested? Scala: <img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png"> Java: <img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png"> Python: <img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png"> R: <img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png"> Closes #22675 from WeichenXu123/add_image_source_doc. Authored-by: WeichenXu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6540c2f) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? Spark datasource for image/libsvm user guide ## How was this patch tested? Scala: <img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png"> Java: <img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png"> Python: <img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png"> R: <img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png"> Closes apache#22675 from WeichenXu123/add_image_source_doc. Authored-by: WeichenXu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Spark datasource for image/libsvm user guide
How was this patch tested?
Scala:
Java:
Python:
R: