Skip to content

Commit

Permalink
[SPARK-4806] Streaming doc update for 1.2
Browse files Browse the repository at this point in the history
Important updates to the streaming programming guide
- Make the fault-tolerance properties easier to understand, with information about write ahead logs
- Update the information about deploying the spark streaming app with information about Driver HA
- Update Receiver guide to discuss reliable vs unreliable receivers.

Author: Tathagata Das <[email protected]>
Author: Josh Rosen <[email protected]>
Author: Josh Rosen <[email protected]>

Closes #3653 from tdas/streaming-doc-update-1.2 and squashes the following commits:

f53154a [Tathagata Das] Addressed Josh's comments.
ce299e4 [Tathagata Das] Minor update.
ca19078 [Tathagata Das] Minor change
f746951 [Tathagata Das] Mentioned performance problem with WAL
7787209 [Tathagata Das] Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into streaming-doc-update-1.2
2184729 [Tathagata Das] Updated Kafka and Flume guides with reliability information.
2f3178c [Tathagata Das] Added more information about writing reliable receivers in the custom receiver guide.
91aa5aa [Tathagata Das] Improved API Docs menu
5707581 [Tathagata Das] Added Pythn API badge
b9c8c24 [Tathagata Das] Merge pull request #26 from JoshRosen/streaming-programming-guide
b8c8382 [Josh Rosen] minor fixes
a4ef126 [Josh Rosen] Restructure parts of the fault-tolerance section to read a bit nicer when skipping over the headings
65f66cd [Josh Rosen] Fix broken link to fault-tolerance semantics section.
f015397 [Josh Rosen] Minor grammar / pluralization fixes.
3019f3a [Josh Rosen] Fix minor Markdown formatting issues
aa8bb87 [Tathagata Das] Small update.
195852c [Tathagata Das] Updated based on Josh's comments, updated receiver reliability and deploying section, and also updated configuration.
17b99fb [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-doc-update-1.2
a0217c0 [Tathagata Das] Changed Deploying menu layout
67fcffc [Tathagata Das] Added cluster mode + supervise example to submitting application guide.
e45453b [Tathagata Das] Update streaming guide, added deploying section.
192c7a7 [Tathagata Das] Added more info about Python API, and rewrote the checkpointing section.
  • Loading branch information
tdas committed Dec 11, 2014
1 parent 2a5b5fd commit b004150
Show file tree
Hide file tree
Showing 7 changed files with 819 additions and 551 deletions.
13 changes: 7 additions & 6 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
<!-- Google analytics script -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-32518208-1']);
_gaq.push(['_setAccount', 'UA-32518208-2']);
_gaq.push(['_trackPageview']);

(function() {
Expand Down Expand Up @@ -79,9 +79,9 @@
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="api/scala/index.html#org.apache.spark.package">Scaladoc</a></li>
<li><a href="api/java/index.html">Javadoc</a></li>
<li><a href="api/python/index.html">Python API</a></li>
<li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
<li><a href="api/java/index.html">Java</a></li>
<li><a href="api/python/index.html">Python</a></li>
</ul>
</li>

Expand All @@ -91,10 +91,11 @@
<li><a href="cluster-overview.html">Overview</a></li>
<li><a href="submitting-applications.html">Submitting Applications</a></li>
<li class="divider"></li>
<li><a href="ec2-scripts.html">Amazon EC2</a></li>
<li><a href="spark-standalone.html">Standalone Mode</a></li>
<li><a href="spark-standalone.html">Spark Standalone</a></li>
<li><a href="running-on-mesos.html">Mesos</a></li>
<li><a href="running-on-yarn.html">YARN</a></li>
<li class="divider"></li>
<li><a href="ec2-scripts.html">Amazon EC2</a></li>
</ul>
</li>

Expand Down
133 changes: 74 additions & 59 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ title: Spark Configuration
Spark provides three locations to configure the system:

* [Spark properties](#spark-properties) control most application parameters and can be set by using
a [SparkConf](api/core/index.html#org.apache.spark.SparkConf) object, or through Java
a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object, or through Java
system properties.
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
the IP address, through the `conf/spark-env.sh` script on each node.
Expand All @@ -23,8 +23,8 @@ application. These properties can be set directly on a
(e.g. master URL and application name), as well as arbitrary key-value pairs through the
`set()` method. For example, we could initialize an application with two threads as follows:

Note that we run with local[2], meaning two threads - which represents "minimal" parallelism,
which can help detect bugs that only exist when we run in a distributed context.
Note that we run with local[2], meaning two threads - which represents "minimal" parallelism,
which can help detect bugs that only exist when we run in a distributed context.

{% highlight scala %}
val conf = new SparkConf()
Expand All @@ -35,7 +35,7 @@ val sc = new SparkContext(conf)
{% endhighlight %}

Note that we can have more than 1 thread in local mode, and in cases like spark streaming, we may actually
require one to prevent any sort of starvation issues.
require one to prevent any sort of starvation issues.

## Dynamically Loading Spark Properties
In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For
Expand All @@ -48,8 +48,8 @@ val sc = new SparkContext(new SparkConf())

Then, you can supply configuration values at runtime:
{% highlight bash %}
./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
{% endhighlight %}

The Spark shell and [`spark-submit`](submitting-applications.html)
Expand Down Expand Up @@ -123,7 +123,7 @@ of the most common options to set are:
<td>
Limit of total size of serialized results of all partitions for each Spark action (e.g. collect).
Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size
is above this limit.
is above this limit.
Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory
and memory overhead of objects in JVM). Setting a proper limit can protect the driver from
out-of-memory errors.
Expand Down Expand Up @@ -217,6 +217,45 @@ Apart from these, the following properties are also available, and may be useful
Set a special library path to use when launching executor JVM's.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.strategy</code></td>
<td>(none)</td>
<td>
Set the strategy of rolling of executor logs. By default it is disabled. It can
be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
the maximum file size for rolling.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.time.interval</code></td>
<td>daily</td>
<td>
Set the time interval by which the executor logs will be rolled over.
Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.size.maxBytes</code></td>
<td>(none)</td>
<td>
Set the max size of the file by which the executor logs will be rolled over.
Rolling is disabled by default. Value is set in terms of bytes.
See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.maxRetainedFiles</code></td>
<td>(none)</td>
<td>
Sets the number of latest rolling log files that are going to be retained by the system.
Older log files will be deleted. Disabled by default.
</td>
</tr>
<tr>
<td><code>spark.files.userClassPathFirst</code></td>
<td>false</td>
Expand Down Expand Up @@ -250,10 +289,11 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.python.profile.dump</code></td>
<td>(none)</td>
<td>
The directory which is used to dump the profile result before driver exiting.
The directory which is used to dump the profile result before driver exiting.
The results will be dumped as separated file for each RDD. They can be loaded
by ptats.Stats(). If this is specified, the profile result will not be displayed
automatically.
</td>
</tr>
<tr>
<td><code>spark.python.worker.reuse</code></td>
Expand All @@ -269,8 +309,8 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.executorEnv.[EnvironmentVariableName]</code></td>
<td>(none)</td>
<td>
Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
process. The user can specify multiple of these and to set multiple environment variables.
Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
process. The user can specify multiple of these and to set multiple environment variables.
</td>
</tr>
<tr>
Expand Down Expand Up @@ -475,9 +515,9 @@ Apart from these, the following properties are also available, and may be useful
<td>
The codec used to compress internal data such as RDD partitions, broadcast variables and
shuffle outputs. By default, Spark provides three codecs: <code>lz4</code>, <code>lzf</code>,
and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
e.g.
<code>org.apache.spark.io.LZ4CompressionCodec</code>,
and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
e.g.
<code>org.apache.spark.io.LZ4CompressionCodec</code>,
<code>org.apache.spark.io.LZFCompressionCodec</code>,
and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
</td>
Expand Down Expand Up @@ -945,7 +985,7 @@ Apart from these, the following properties are also available, and may be useful
(resources are executors in yarn mode, CPU cores in standalone mode)
to wait for before scheduling begins. Specified as a double between 0.0 and 1.0.
Regardless of whether the minimum ratio of resources has been reached,
the maximum amount of time it will wait before scheduling begins is controlled by config
the maximum amount of time it will wait before scheduling begins is controlled by config
<code>spark.scheduler.maxRegisteredResourcesWaitingTime</code>.
</td>
</tr>
Expand All @@ -954,7 +994,7 @@ Apart from these, the following properties are also available, and may be useful
<td>30000</td>
<td>
Maximum amount of time to wait for resources to register before scheduling begins
(in milliseconds).
(in milliseconds).
</td>
</tr>
<tr>
Expand Down Expand Up @@ -1023,7 +1063,7 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td>
<td>
Whether Spark acls should are enabled. If enabled, this checks to see if the user has
access permissions to view or modify the job. Note this requires the user to be known,
access permissions to view or modify the job. Note this requires the user to be known,
so if the user comes across as null no checks are done. Filters can be used with the UI
to authenticate and set the user.
</td>
Expand Down Expand Up @@ -1062,17 +1102,31 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.streaming.blockInterval</code></td>
<td>200</td>
<td>
Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced
into blocks of data before storing them in Spark.
Interval (milliseconds) at which data received by Spark Streaming receivers is chunked
into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the
<a href="streaming-programming-guide.html#level-of-parallelism-in-data-receiving">performance
tuning</a> section in the Spark Streaming programing guide for more details.
</td>
</tr>
<tr>
<td><code>spark.streaming.receiver.maxRate</code></td>
<td>infinite</td>
<td>
Maximum rate (per second) at which each receiver will push data into blocks. Effectively,
each stream will consume at most this number of records per second.
Maximum number records per second at which each receiver will receive data.
Effectively, each stream will consume at most this number of records per second.
Setting this configuration to 0 or a negative number will put no limit on the rate.
See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
in the Spark Streaming programing guide for mode details.
</td>
</tr>
<tr>
<td><code>spark.streaming.receiver.writeAheadLogs.enable</code></td>
<td>false</td>
<td>
Enable write ahead logs for receivers. All the input data received through receivers
will be saved to write ahead logs that will allow it to be recovered after driver failures.
See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
in the Spark Streaming programing guide for more details.
</td>
</tr>
<tr>
Expand All @@ -1086,45 +1140,6 @@ Apart from these, the following properties are also available, and may be useful
higher memory usage in Spark.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.strategy</code></td>
<td>(none)</td>
<td>
Set the strategy of rolling of executor logs. By default it is disabled. It can
be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
the maximum file size for rolling.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.time.interval</code></td>
<td>daily</td>
<td>
Set the time interval by which the executor logs will be rolled over.
Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.size.maxBytes</code></td>
<td>(none)</td>
<td>
Set the max size of the file by which the executor logs will be rolled over.
Rolling is disabled by default. Value is set in terms of bytes.
See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.maxRetainedFiles</code></td>
<td>(none)</td>
<td>
Sets the number of latest rolling log files that are going to be retained by the system.
Older log files will be deleted. Disabled by default.
</td>
</tr>
</table>

#### Cluster Managers
Expand Down
Loading

0 comments on commit b004150

Please sign in to comment.