Spark 是大数据的一个快速通用集群计算系统。它提供了高效的 Scala,Java 和 Python 的 API ,并且提供数据分析的通用计算图形优化引擎。同时,它也支持一系列丰富高水平的工具,包括用于 SQL 的 Spark SQL 和数据结构处理,用于机器学习的 MLlib,用于图形处理的 GraphX ,和 实时流处理的 Spark Streaming。具体可参考:http://spark.apache.org/
你能在 project web page 和project wiki 找到最新的 Spark 文档,包括编程指南。这个 ReadMe 文件仅仅包含基本的安装指导。
Spark 目前使用 Apache Maven 进行编译的。编译 Spark 和 它的例子程序,可以运行:
mvn -DskipTests clean package
(如果你已经下载了 pre-built 安装包,那就不需要编译了)。更多的详细文档可以从项目网站 “Building Spark” 上获取。
使用 Spark 的最简单方式就是通过 Scala Shell:
./bin/spark-shell
尝试以下命令,该命令正常返回 1000
scala> sc.parallelize(1 to 1000).count()
同样,如果你更喜欢 Python,你可以使用 Python shell:
./bin/pyspark
尝试以下命令,该命令正常返回 1000:
sc.parallelize(range(1000)).count()
Spark 在 example 目录下也存在一个简单的程序,为了运行其中一个,使用
./bin/run-example <class> [params]
例如在本地运行 Pi:
./bin/run-example SparkPi
在运行程序样例来提交所运行的例子到集群中,你可以设置 MASTER 环境变量。这个可以在 “YARN” 模式 上运行 mesos:// 或者 spark:// URL,“yarn-cluster” 或者 “yarn-client” 。也可以使用一个进程或者用 “local[N]” 代表的 N 个线程在本地上运行 “local” 模式。如果类在 examples
包中,你同样可以使用一个简短的类名。例如:
MASTER=spark://host:7077 ./bin/run-example SparkPi
如果没有提供参数,许多程序例子打印用法帮助
测试第一个需求building Spark。一旦编译了 Spark,使用以下命令,测试就能运行:
./dev/run-tests
请查看如何 运行所有的自动测试 的指南
Spark 使用 Hadoop core 库与 HDFS 和其他支持 Hadoop 存储系统进行通信,因为 Hadoop 不同版本之间的协议已经发生了不少改变,你必须编译 Spark 以适应你的集群正在运行相同的 Hadoop 版本。
请参考编译文档 “Specifying the Hadoop Version”
编译特定 Hadoop 发布版本的详细指南,包括编译特定的 Hive 和 Hive Thriftserver 发布版。同样查看 “Third Party Hadoop Distributions” 来指导编译一个应用在特定发布版本的 Spark 上。
若要对配置 Spark 有大致概念,请参考在线文档 配置指南
Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, and Python, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.
http://spark.apache.org/
You can find the latest Spark documentation, including a programming guide, on the project web page and project wiki.
This README file only contains basic setup instructions.
Spark is built using Apache Maven. To build Spark and its example programs, run:
mvn -DskipTests clean package
(You do not need to do this if you downloaded a pre-built package.)
More detailed documentation is available from the project site, at “Building Spark”.
The easiest way to start using Spark is through the Scala shell:
./bin/spark-shell
Try the following command, which should return 1000:
scala> sc.parallelize(1 to 1000).count()
Alternatively, if you prefer Python, you can use the Python shell:
./bin/pyspark
And run the following command, which should also return 1000:
>>> sc.parallelize(range(1000)).count()
Spark also comes with several sample programs in the examples
directory.
To run one of them, use ./bin/run-example <class> [params]
. For example:
./bin/run-example SparkPi
will run the Pi example locally.
You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, “yarn-cluster” or “yarn-client” to run on YARN, and “local” to run locally with one thread, or “local[N]” to run locally with N threads. You can also use an abbreviated class name if the class is in the examples
package. For instance:
MASTER=spark://host:7077 ./bin/run-example SparkPi
Many of the example programs print usage help if no params are given.
Testing first requires building Spark. Once Spark is built, tests can be run using:
./dev/run-tests
Please see the guidance on how to run all automated tests.
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at “Specifying the Hadoop Version”
for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions. See also “Third Party Hadoop Distributions” for guidance on building a Spark application that works with a particular distribution.
Please refer to the Configuration guide in the online documentation for an overview on how to configure Spark.