Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12517] add default RDD name for one created via sc.textFile #10456

Closed
wants to merge 10 commits into from
Closed

[SPARK-12517] add default RDD name for one created via sc.textFile #10456

wants to merge 10 commits into from

Conversation

wyaron
Copy link
Contributor

@wyaron wyaron commented Dec 23, 2015

The feature was first added at commit: 7b877b2 but was later removed (probably by mistake) at commit: fc8b581.
This change sets the default path of RDDs created via sc.textFile(...) to the path argument.

Here is the symptom:

scala> sc.textFile("/home/root/.bashrc").name
res5: String = null

scala> sc.binaryFiles("/home/root/.bashrc").name
res6: String = /home/root/.bashrc

  • while using Spark 1.3.1:

scala> sc.textFile("/home/root/.bashrc").name
res0: String = /home/root/.bashrc

scala> sc.binaryFiles("/home/root/.bashrc").name
res1: String = /home/root/.bashrc

The feature was first added at commit: 7b877b2 but was later removed (probably by mistake) at at commit: fc8b581.

here is the symptom:

using spark-1.5.2-bin-hadoop2.6 I get:
=================================
scala> sc.textFile("/home/root/.bashrc").name
res5: String = null

scala> sc.binaryFiles("/home/root/.bashrc").name
res6: String = /home/root/.bashrc

while using Spark 1.3.1:
=================================

scala> sc.textFile("/home/root/.bashrc").name
res0: String = /home/root/.bashrc

scala> sc.binaryFiles("/home/root/.bashrc").name
res1: String = /home/root/.bashrc
@andrewor14
Copy link
Contributor

@wyaron please file a JIRA and attach to the title of this PR. See how other patches are opened.

@andrewor14
Copy link
Contributor

The changes here look OK to me.

@andrewor14
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48271 has finished for PR 10456 at commit 86efdad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member

sarutak commented Dec 24, 2015

LGTM as well. Could you add testcases for this change?

@wyaron wyaron changed the title add default RDD name for one created via sc.textFile [SPARK-12517] add default RDD name for one created via sc.textFile Dec 24, 2015
This change extends SparkContextSuite to verify that RDDs that are created using file paths have their
default name set to the path.
This change extends SparkContextSuite to verify that RDDs that are created using file paths have their
default name set to the path.
@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48305 has finished for PR 10456 at commit 081a723.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48307 has finished for PR 10456 at commit fe3d87d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48309 has finished for PR 10456 at commit a2f5f07.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48310 has finished for PR 10456 at commit 50cd119.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


var targetPath = mockPath + "textFile"
assert(sc.textFile(targetPath).name == targetPath)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you compare by === instead of == ?
We can get better error messages by === operator.

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48312 has finished for PR 10456 at commit 825f3f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -274,6 +274,31 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext {
}
}

test("Default path for file based RDDs is properly set (SPARK-12517)") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is broken.

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48313 has finished for PR 10456 at commit 99b5bde.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wyaron
Copy link
Contributor Author

wyaron commented Dec 24, 2015

Thanks for the quick CR !
Apparently, the new unit test has just failed for sc.wholeTextFiles(...) :-)
My guess is that it was broken in commit 297048f.
The change called setName on the WholeTextFileRDD and then called map on it which yields a new RDD without the default name set (see code snippet below).

It seems that setName should be called AFTER the map is invoked, hence on the returned RDD.
Note that wholeTextFiles provides the file name for each data item in the RDD so the defaul path name is less useful... still, for consistency purposes it should be fixed.

here is the original change:
new WholeTextFileRDD(
this,
classOf[WholeTextFileInputFormat],
- classOf[String],
- classOf[String],
+ classOf[Text],
+ classOf[Text],
updateConf,
- minPartitions).setName(path)
+ minPartitions).setName(path).map(record => (record._1.toString, record._2.toString))
}

I have committed the proposed change which moves setName after the map. I've tested locally and now the name field is valid.

does it make sense ?

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48314 has finished for PR 10456 at commit 08d3fe2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member

sarutak commented Dec 25, 2015

I have committed the proposed change which moves setName after the map. I've tested locally and now the name field is valid.

does it make sense ?

Yeah, I think it's good catch.

@andrewor14
Copy link
Contributor

LGTM

@sarutak
Copy link
Member

sarutak commented Dec 28, 2015

O.K, I'll merge this into master and branch-1.6. Thanks @wyaron !

asfgit pushed a commit that referenced this pull request Dec 28, 2015
The feature was first added at commit: 7b877b2 but was later removed (probably by mistake) at commit: fc8b581.
This change sets the default path of RDDs created via sc.textFile(...) to the path argument.

Here is the symptom:

* Using spark-1.5.2-bin-hadoop2.6:

scala> sc.textFile("/home/root/.bashrc").name
res5: String = null

scala> sc.binaryFiles("/home/root/.bashrc").name
res6: String = /home/root/.bashrc

* while using Spark 1.3.1:

scala> sc.textFile("/home/root/.bashrc").name
res0: String = /home/root/.bashrc

scala> sc.binaryFiles("/home/root/.bashrc").name
res1: String = /home/root/.bashrc

Author: Yaron Weinsberg <[email protected]>
Author: yaron <[email protected]>

Closes #10456 from wyaron/master.

(cherry picked from commit 73b70f0)
Signed-off-by: Kousuke Saruta <[email protected]>
@asfgit asfgit closed this in 73b70f0 Dec 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants