From 9103680509c0d778e68e12594bc381d8eff88291 Mon Sep 17 00:00:00 2001 From: Lilian Lee Date: Wed, 9 May 2018 16:57:30 +0800 Subject: [PATCH] tispark: update the link and expression (#464) --- tispark/tispark-quick-start-guide.md | 4 +- tispark/tispark-user-guide.md | 74 +++++++++++++--------------- 2 files changed, 35 insertions(+), 43 deletions(-) diff --git a/tispark/tispark-quick-start-guide.md b/tispark/tispark-quick-start-guide.md index ae3ba4fa7af2f..2f0a775915c4a 100644 --- a/tispark/tispark-quick-start-guide.md +++ b/tispark/tispark-quick-start-guide.md @@ -3,9 +3,9 @@ title: TiSpark Quick Start Guide category: User Guide --- -# Quick Start Guide for the TiDB Connector for Spark +# TiSpark Quick Start Guide -To make it easy to try [the TiDB Connector for Spark](tispark-user-guide.md), TiDB cluster integrates Spark, TiSpark jar package and TiSpark sample data by default, in both the Pre-GA and master versions installed using TiDB-Ansible. +To make it easy to [try TiSpark](tispark-user-guide.md), the TiDB cluster integrates Spark, TiSpark jar package and TiSpark sample data by default, in both the Pre-GA and master versions installed using TiDB-Ansible. ## Deployment information diff --git a/tispark/tispark-user-guide.md b/tispark/tispark-user-guide.md index ca1596be77b26..dd1c6edfce8d9 100644 --- a/tispark/tispark-user-guide.md +++ b/tispark/tispark-user-guide.md @@ -1,39 +1,38 @@ --- -title: TiDB Connector for Spark User Guide +title: TiSpark User Guide category: user guide --- -# TiDB Connector for Spark User Guide +# TiSpark User Guide -The TiDB Connector for Spark is a thin layer built for running Apache Spark on top of TiDB/TiKV to answer the complex OLAP queries. It takes advantages of both the Spark platform and the distributed TiKV cluster and seamlessly glues to TiDB, the distributed OLTP database, to provide a Hybrid Transactional/Analytical Processing (HTAP) solution to serve as a one-stop solution for both online transactions and analysis. +[TiSpark](https://github.com/pingcap/tispark) is a thin layer built for running Apache Spark on top of TiDB/TiKV to answer the complex OLAP queries. It takes advantages of both the Spark platform and the distributed TiKV cluster and seamlessly glues to TiDB, the distributed OLTP database, to provide a Hybrid Transactional/Analytical Processing (HTAP) solution to serve as a one-stop solution for both online transactions and analysis. -The TiDB Connector for Spark depends on the TiKV cluster and the PD cluster. You also need to set up a Spark cluster. This document provides a brief introduction to how to setup and use the TiDB Connector for Spark. It requires some basic knowledge of Apache Spark. For more information, see [Spark website](https://spark.apache.org/docs/latest/index.html). +TiSpark depends on the TiKV cluster and the PD cluster. You also need to set up a Spark cluster. This document provides a brief introduction to how to setup and use TiSpark. It requires some basic knowledge of Apache Spark. For more information, see [Spark website](https://spark.apache.org/docs/latest/index.html). ## Overview -The TiDB Connector for Spark is an OLAP solution that runs Spark SQL directly on TiKV, the distributed storage engine. +TiSpark is an OLAP solution that runs Spark SQL directly on TiKV, the distributed storage engine. -![TiDB Connector for Spark architecture](../media/tispark-architecture.png) +![TiSpark architecture](../media/tispark-architecture.png) -+ TiDB Connector for Spark integrates with Spark Catalyst Engine deeply. It provides precise control of the computing, which allows Spark read data from TiKV efficiently. It also supports index seek, which improves the performance of the point query execution significantly. ++ TiSpark integrates with Spark Catalyst Engine deeply. It provides precise control of the computing, which allows Spark read data from TiKV efficiently. It also supports index seek, which improves the performance of the point query execution significantly. + It utilizes several strategies to push down the computing to reduce the size of dataset handling by Spark SQL, which accelerates the query execution. It also uses the TiDB built-in statistical information for the query plan optimization. -+ From the data integration point of view, TiDB Connector for Spark and TiDB serve as a solution runs both transaction and analysis directly on the same platform without building and maintaining any ETLs. It simplifies the system architecture and reduces the cost of maintenance. -+ also, you can deploy and utilize tools from the Spark ecosystem for further data processing and manipulation on TiDB. For example, using the TiDB Connector for Spark for data analysis and ETL; retrieving data from TiKV as a machine learning data source; generating reports from the scheduling system and so on. ++ From the data integration point of view, TiSpark and TiDB serve as a solution runs both transaction and analysis directly on the same platform without building and maintaining any ETLs. It simplifies the system architecture and reduces the cost of maintenance. ++ also, you can deploy and utilize tools from the Spark ecosystem for further data processing and manipulation on TiDB. For example, using TiSpark for data analysis and ETL; retrieving data from TiKV as a machine learning data source; generating reports from the scheduling system and so on. ## Environment setup -+ The current version of the TiDB Connector for Spark supports Spark 2.1. For Spark 2.0 and Spark 2.2, it has not been fully tested yet. It does not support any versions earlier than 2.0. -+ The TiDB Connector for Spark requires JDK 1.8+ and Scala 2.11 (Spark2.0 + default Scala version). -+ The TiDB Connector for Spark runs in any Spark mode such as YARN, Mesos, and Standalone. - ++ The current version of TiSpark supports Spark 2.1. For Spark 2.0 and Spark 2.2, it has not been fully tested yet. It does not support any versions earlier than 2.0. ++ TiSpark requires JDK 1.8+ and Scala 2.11 (Spark2.0 + default Scala version). ++ TiSpark runs in any Spark mode such as YARN, Mesos, and Standalone. ## Recommended configuration -### Deployment of TiKV and the TiDB Connector for Spark clusters +### Deployment of TiKV and TiSpark clusters #### Configuration of the TiKV cluster -For independent deployment of TiKV and the TiDB Connector for Spark, it is recommended to refer to the following recommendations +For independent deployment of TiKV and TiSpark, it is recommended to refer to the following recommendations + Hardware configuration - For general purposes, please refer to the TiDB and TiKV hardware configuration [recommendations](https://github.com/pingcap/docs/blob/master/op-guide/recommendation.md#deployment-recommendations). @@ -67,12 +66,11 @@ For independent deployment of TiKV and the TiDB Connector for Spark, it is recom scheduler-worker-pool-size = 4 ``` -#### Configuration of the independent deployment of the Spark cluster and the TiDB Connector for Spark cluster +#### Configuration of the independent deployment of the Spark cluster and TiSpark cluster - See the [Spark official website](https://spark.apache.org/docs/latest/hardware-provisioning.html) for the detail hardware recommendations. -The following is a short overview of the TiDB Connector for Spark configuration. +The following is a short overview of TiSpark configuration. It is recommended to allocate 32G memory for Spark. Please reserve at least 25% of the memory for the operating system and buffer cache. @@ -86,61 +84,57 @@ SPARK_WORKER_MEMORY = 32g SPARK_WORKER_CORES = 8 ``` -#### Hybrid deployment configuration for the TiDB Connector for Spark and TiKV cluster +#### Hybrid deployment configuration for TiSpark and TiKV clusters -For the hybrid deployment of the TiDB Connector for Spark and TiKV, add the TiDB Connector for Spark required resources to the TiKV reserved resources, and allocate 25% of the memory for the system. +For the hybrid deployment of TiSpark and TiKV, add TiSpark required resources to the TiKV reserved resources, and allocate 25% of the memory for the system. -## Deploy the TiDB Connector for Spark +## Deploy the TiSpark cluster -Download the TiDB Connector for Spark's jar package [here](http://download.pingcap.org/tispark-0.1.0-SNAPSHOT-jar-with-dependencies.jar). +Download TiSpark's jar package [here](http://download.pingcap.org/tispark-0.1.0-SNAPSHOT-jar-with-dependencies.jar). -### Deploy the TiDB Connector for Spark on the existing Spark cluster +### Deploy TiSpark on the existing Spark cluster -Running TiDB Connector for Spark on an existing Spark cluster does not require a reboot of the cluster. You can use Spark's `--jars` parameter to introduce the TiDB Connector for Spark as a dependency: +Running TiSpark on an existing Spark cluster does not require a reboot of the cluster. You can use Spark's `--jars` parameter to introduce TiSpark as a dependency: ```sh spark-shell --jars $PATH/tispark-0.1.0.jar ``` -If you want to deploy TiDB Connector for Spark as a default component, simply place the TiDB Connector for Spark jar package into the jars path for each node of the Spark cluster and restart the Spark cluster: +If you want to deploy TiSpark as a default component, simply place the TiSpark jar package into the jars path for each node of the Spark cluster and restart the Spark cluster: ```sh ${SPARK_INSTALL_PATH}/jars ``` -In this way, you can use either `Spark-Submit` or `Spark-Shell` to use the TiDB Connector for Spark directly. - - -### Deploy TiDB Connector for Spark without the Spark cluster +In this way, you can use either `Spark-Submit` or `Spark-Shell` to use TiSpark directly. +### Deploy TiSpark without the Spark cluster If you do not have a Spark cluster, we recommend using the standalone mode. To use the Spark Standalone model, you can simply place a compiled version of Spark on each node of the cluster. If you encounter problems, see its [official website](https://spark.apache.org/docs/latest/spark-standalone.html). And you are welcome to [file an issue](https://github.com/pingcap/tispark/issues/new) on our GitHub. - #### Download and install You can download [Apache Spark](https://spark.apache.org/downloads.html) -For the Standalone mode without Hadoop support, use Spark 2.1.x and any version of Pre-build with Apache Hadoop 2.x with Hadoop dependencies. If you need to use the Hadoop cluster, please choose the corresponding Hadoop version. You can also choose to build from the [source code](https://spark.apache.org/docs/2.1.0/building-spark.html) to match the previous version of the official Hadoop 2.6. Please note that the TiDB Connector for Spark currently only supports Spark 2.1.x version. - -Suppose you already have a Spark binaries, and the current PATH is `SPARKPATH`, please copy the TiDB Connector for Spark jar package to the `${SPARKPATH}/jars` directory. +For the Standalone mode without Hadoop support, use Spark 2.1.x and any version of Pre-build with Apache Hadoop 2.x with Hadoop dependencies. If you need to use the Hadoop cluster, please choose the corresponding Hadoop version. You can also choose to build from the [source code](https://spark.apache.org/docs/2.1.0/building-spark.html) to match the previous version of the official Hadoop 2.6. Please note that TiSpark currently only supports Spark 2.1.x version. + +Suppose you already have a Spark binaries, and the current PATH is `SPARKPATH`, please copy the TiSpark jar package to the `${SPARKPATH}/jars` directory. #### Start a Master node Execute the following command on the selected Spark Master node: - + ```sh cd $SPARKPATH -./sbin/start-master.sh +./sbin/start-master.sh ``` After the above step is completed, a log file will be printed on the screen. Check the log file to confirm whether the Spark-Master is started successfully. You can open the [http://spark-master-hostname:8080](http://spark-master-hostname:8080) to view the cluster information (if you does not change the Spark-Master default port number). When you start Spark-Slave, you can also use this panel to confirm whether the Slave is joined to the cluster. #### Start a Slave node - Similarly, you can start a Spark-Slave node with the following command: ```sh @@ -168,11 +162,9 @@ And stop it like below: ./sbin/stop-tithriftserver.sh ``` - ## Demo -Assuming that you have successfully started the TiDB Connector for Spark cluster as described above, here's a quick introduction to how to use Spark SQL for OLAP analysis. Here we use a table named `lineitem` in the `tpch` database as an example. - +Assuming that you have successfully started the TiSpark cluster as described above, here's a quick introduction to how to use Spark SQL for OLAP analysis. Here we use a table named `lineitem` in the `tpch` database as an example. Assuming that your PD node is located at `192.168.1.100`, port `2379`, add the following command to `$SPARK_HOME/conf/spark-defaults.conf`: @@ -250,8 +242,8 @@ TiSpark on PySpark is a Python package build to support the Python language with Q: What are the pros/cons of independent deployment as opposed to a shared resource with an existing Spark / Hadoop cluster? -A: You can use the existing Spark cluster without a separate deployment, but if the existing cluster is busy, TiDB Connector for Spark will not be able to achieve the desired speed. +A: You can use the existing Spark cluster without a separate deployment, but if the existing cluster is busy, TiSpark will not be able to achieve the desired speed. Q: Can I mix Spark with TiKV? -A: If TiDB and TiKV are overloaded and run critical online tasks, consider deploying the TiDB Connector for Spark separately. You also need to consider using different NICs to ensure that OLTP's network resources are not compromised and affect online business. If the online business requirements are not high or the loading is not large enough, you can consider mixing the TiDB Connector for Spark with TiKV deployment. +A: If TiDB and TiKV are overloaded and run critical online tasks, consider deploying TiSpark separately. You also need to consider using different NICs to ensure that OLTP's network resources are not compromised and affect online business. If the online business requirements are not high or the loading is not large enough, you can consider mixing TiSpark with TiKV deployment. \ No newline at end of file