-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3694] RDD and Task serialization debugging output #3518
Conversation
Can one of the admins verify this patch? |
I'm going to let Jenkins test this, but my hunch is that the first run is going to fail due to Scalastyle warnings / errors. I'll comment on a couple of these style points inline. Jenkins, this is ok to test. |
* @return - An output string qualifying success or failure. | ||
*/ | ||
private def isSerializable(rdd: RDD[_]): String = { | ||
SerializationHelper.isSerializable(closureSerializer,rdd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put spaces between arguments: closureSerializer, rdd
(see https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide)
Test build #24598 has finished for PR 3518 at commit
|
Test PASSed. |
Hi @JoshRosen - just checking in to make sure things are moving on #3638 since it's a blocker to this patch. Please let me know how that's going, looks to be almost complete. Thanks! |
Test build #25420 has started for PR 3518 at commit
|
Hi @JoshRosen, #3638 has now been merged and I've resolved the minor merge conflicts and pushed the updates. If you could please review this at your convenience, I'd love to have it merged in as well. Thanks! |
Test build #25420 has finished for PR 3518 at commit
|
Test PASSed. |
@@ -827,9 +868,21 @@ class DAGScheduler( | |||
// might modify state of objects referenced in their closures. This is necessary in Hadoop | |||
// where the JobConf/Configuration object is not thread-safe. | |||
var taskBinary: Broadcast[Array[Byte]] = null | |||
|
|||
// Check if RDD serialization debugging is enabled | |||
val debugSerialization: Boolean = sc.getConf.getBoolean("spark.serializer.debug", false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than having a config option for this, why not just always run this debugging output after we've seen a serialization failure? The performance overhead won't matter much if we do it after a failure only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - this does that already. Yeah so I'd just remove the config option and just always print debugging output if there is a failure when serializing in the TaskSetManager
. We usually try not to add config options unless there is a really compelling reason to not have the feature enabled.
Hey just took a quick pass with some code style suggestions (more coming) and usability suggestions. One thing, would it be possible to track the name of the fields you are traversing? This would make the debugging output more useful. Also, is there a good reason to print the hash code? How would users use that? |
Hi Patrick - thanks for the feedback. I would love to print out the names of the fields but I wasn't able to figure out a way to do that - could you suggest how? I wasn't sure if printing the hash code was useful or not, Josh included it in his original example of a traversal so I figured I'd leave it in. I didn't know if there would be a way to look it up post-facto. |
Test build #25620 has started for PR 3518 at commit
|
Test build #25620 has finished for PR 3518 at commit
|
Test FAILed. |
Test build #25624 has started for PR 3518 at commit
|
Test build #25624 has finished for PR 3518 at commit
|
Test PASSed. |
Hey @ilganeli - I took a slightly deeper look this time. I still don't totally follow how this all hooks together, but I wonder if it's possible to write a single utility function that is much simpler. It would just do the following:
And it would do something like:
This wouldn't work for custom serializers, it would only work for the Java serializer. However, that's all we support for closure's anyways. You can get the name and type of each field using reflection. There doesn't need to be any specific handling of the fact that the objects you are inspecting are RDD's. The ideal output is something like this - it would print the path along with field name, type, and maybe the toString of each object encountered.
In this example there is an rdd in the lineage that has a filed called |
Hey so it looks like while I was reviewing this patch @rxin actually ran into this and just wrote a fix himself (#4093). That fix is actually even simpler than what I was proposing and almost strictly better (the only downside is you don't get field names). So I'm guessing we'll go with that one, but thanks for taking a whack at this. |
BTW - my apologies for marking this as a starter task, it turned out to be more complicated. We can credit you for having worked on the feature as well. |
Hey @pwendell - not a problem. The solutions are similar but Reynold's has fewer moving parts. I appreciate the recognition. Thanks! |
Hi all - in addition to what was explicitly requested in the original JIRA, I also added the ability to have a trace of the serialization for RDDs so that you can see which specific dependency is unserializable. For debugging task serialization, I added a debug log output that shows the file and jar dependencies. However, I am unsure whether I can add more functionality there. For the RDD, it is possible to attempt to serialize each dependency in turn, which is why I can identify which component fails. For task debugging, I did not see a straightforward way to do the same thing. If anyone can suggest an approach here, I would be happily to implement it.