spark 官网首页

简单的spark概述:
原文:
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
译:
Apache Spark 是一个快速的通用集群计算系统,它提供JAVA Scala Python R 的高级API,以及支持常规的执行图和优化引擎,并且还支持一组丰富的跟高级别的工具,包括Spark SQL 用于SQL和结构化数据的处理,MLlib机器学习,GraphX用于图形处理,Spark Streaming 流处理。

安全:
原文:
Security in Spark is OFF by default. This could mean you are vulnerable to attack by default. Please see Spark Security before downloading and running Spark.
译:
默认情况下Spark的安全性是处于关闭状态的,这可能意味着你在默认情况下容易受到攻击,下载并运行Spark之前,请参阅Spark Security(http://spark.apache.org/docs/latest/security.html)。

下载和注意事项:
原文:
Get Spark from the downloads page of the project website. This documentation is for Spark version 2.4.5. Spark uses Hadoop’s client libraries for HDFS and YARN. Downloads are pre-packaged for a handful of popular Hadoop versions. Users can also download a “Hadoop free” binary and run Spark with any Hadoop version by augmenting Spark’s classpath. Scala and Java users can include Spark in their projects using its Maven coordinates and in the future Python users can also install Spark from PyPI.
If you’d like to build Spark from source, visit Building Spark.
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.
Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.4.5 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0. Support for Scala 2.10 was removed as of 2.3.0. Support for Scala 2.11 is deprecated as of Spark 2.4.1 and will be removed in Spark 3.0.
译:
从项目网站的下载页面获取Spark(https://spark.apache.org/downloads.html),目前的本文适用于Spark将Hadoop的客户端库用户HDFS和YARN,下载是为少数流行的Hadoop版本欲先打包的,好可以下载免费的Hadoop二进制文件,并通过扩展Spark的classpath(http://spark.apache.org/docs/latest/hadoop-provided.html)在任何Hadoop版本上运行Spark,Scala和Java用户可以使用Maven坐标将Spark包含在他们的项目中,将来Ppython用户还可以从PyPI安装Spark。

如果您想从源代码构建Spark,访问(http://spark.apache.org/docs/latest/building-spark.html)。

Spark可在windows和unix的系统(例如Linux,Mac os)上运行,在一台计算机上本地运行很容易,所要做的就是java在系统上安装PATH或JAVA_HOME指向JAVA_HOME指向安装的环境变量。

Spark 可在Java 8, Python2.7 + 3.4 + 和 R 3.1上运行,对于Scala API, Spark 2.4.5 使用Scala 2.1.2,将使用兼容的Scala版本(2.12.x)。

注意:自Spark2.2.0起,已删除了对Java 7 , Python 2.6和2.6.5之前的旧版本Hadoop版本的支持,Spark从2.3.0版本开始,不再支持Scala 2.1.0, 从Spark 2.4.1开始不再支持Scala 2.11,他将Spark3.0中删除。

运行示例和外壳:
原文:
Spark comes with several sample programs. Scala, Java, Python and R examples are in the examples/src/main directory. To run one of the Java or Scala sample programs, use bin/run-example [params] in the top-level Spark directory. (Behind the scenes, this invokes the more general spark-submit script for launching applications). For example,

./bin/run-example SparkPi 10

译:
Spark附带了几个示例程序,目录中有Spark Java Python R 和示例 examples/src/main ,要运行Java或者Scala示例程序之一,请 bin/run-example [params] 在顶级Spark目录中使用,(在后台调用用的spark-submit(http://spark.apache.org/docs/latest/submitting-applications.html)脚本来启动应用程序)例如:

./bin/run-example SparkPi 10

原文:

You can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework.

./bin/spark-shell --master local[2]

译:

你还可以通过修改后的Scala shell 版本以交互方式运行Spark,这是学习框架的好方法。

./bin/spark-shell --master local[2]

原文:

The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a full list of options, run Spark shell with the --help option.

译:
该 --master 指定分布式集群的主URL(http://spark.apache.org/docs/latest/submitting-applications.html#master-urls),或local使用一个线程local[N] 在本地运行,或使用N个线程在本地运行,您因该先从local进行测试开始,有关选项的完整列表,请运行带有 --help参数的Spark shell。

原文:
Spark also provides a Python API. To run Spark interactively in a Python interpreter, use bin/pyspark:

./bin/pyspark --master local[2]

译:
Spark 还提供了Python API,要在Python 解释器中交互式运行Spark,请使用 bin/pyspark

./bin/pyspark --master local[2]

原文:
Example applications are also provided in Python. For example,

./bin/spark-submit examples/src/main/python/pi.py 10

译:
Python 还提供了示例应用程序。

./bin/spark-submit examples/src/main/python/pi.py 10

原文:

Spark also provides an experimental R API since 1.4 (only DataFrames APIs included). To run Spark interactively in a R interpreter, use bin/sparkR:

./bin/sparkR --master local[2]

译:
从1.4开始,Spark还提供了实验性R API (仅包含DataFrames API), 要在R解释器中交互式运行Spark,请使用bin/sparkR:

./bin/sparkR --master local[2]

原文:

Example applications are also provided in R. For example,

./bin/spark-submit examples/src/main/r/dataframe.R

译:
R中还提供了示例应用程序,例如。

./bin/spark-submit examples/src/main/r/dataframe.R

在集群上启动:

原文:
The Spark cluster mode overview explains the key concepts in running on a cluster. Spark can run both by itself, or over several existing cluster managers. It currently provides several options for deployment:

译:
Spark 集群模式概述(http://spark.apache.org/docs/latest/cluster-overview.html)介绍了在集群上运行的关键概念,也可以在多个县有集群管理器上运行,当前提供了几种部署选项:

原文:
Standalone Deploy Mode: simplest way to deploy Spark on a private cluster

译:独立部署模式:(http://spark.apache.org/docs/latest/spark-standalone.html)在私有集群上部署Spark的最简单方法

原文:
Apache Mesos

译:Apache Mesos (http://spark.apache.org/docs/latest/running-on-mesos.html)

原文:
Hadoop YARN

译:Hadoop YARN (http://spark.apache.org/docs/latest/running-on-yarn.html)

原文:
Kubernetes
译:Kubernetes (http://spark.apache.org/docs/latest/running-on-kubernetes.html)

从这开始(目录):
编辑指南:
Programming Guides:

Quick Start: a quick introduction to the Spark API; start here!
快速入门(http://spark.apache.org/docs/latest/quick-start.html):Spark API快速入门;从这里开始!

RDD Programming Guide: overview of Spark basics - RDDs (core but old API), accumulators, and broadcast variables
RDD编程指南(http://spark.apache.org/docs/latest/rdd-programming-guide.html):Spark基础概述-RDD(核心但旧的API),累加器和广播变量

Spark SQL, Datasets, and DataFrames: processing structured data with relational queries (newer API than RDDs)
Spark SQL,数据集和数据帧(http://spark.apache.org/docs/latest/sql-programming-guide.html):使用关系查询(比RDD更新的API)处理结构化数据

Structured Streaming: processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)
结构化流(http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html):使用关系查询处理结构化数据流(使用数据集和数据帧,API比DStreams更新)

Spark Streaming: processing data streams using DStreams (old API)
Spark Streaming(http://spark.apache.org/docs/latest/streaming-programming-guide.html):使用DStreams处理数据流(旧API)

MLlib: applying machine learning algorithms
MLlib(http://spark.apache.org/docs/latest/ml-guide.html):应用机器学习算法

GraphX: processing graphs
GraphX(http://spark.apache.org/docs/latest/graphx-programming-guide.html):处理图形

API文件:
API Docs:

Spark Scala API (Scaladoc) (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package)
Spark Java API (Javadoc) (http://spark.apache.org/docs/latest/api/java/index.html)
Spark Python API (Sphinx) (http://spark.apache.org/docs/latest/api/python/index.html)
Spark R API (Roxygen2) (http://spark.apache.org/docs/latest/api/R/index.html)
Spark SQL, Built-in Functions (MkDocs) (http://spark.apache.org/docs/latest/api/sql/index.html)

部署指南:
Deployment Guides:

Cluster Overview: overview of concepts and components when running on a cluster
集群概述(http://spark.apache.org/docs/latest/cluster-overview.html):在集群上运行时的概念和组件概述

Submitting Applications: packaging and deploying applications
提交应用程序(http://spark.apache.org/docs/latest/submitting-applications.html):打包和部署应用程序

部署方式:
Deployment modes:

Amazon EC2: scripts that let you launch a cluster on EC2 in about 5 minutes
Amazon EC2(https://github.com/amplab/spark-ec2):可使您在大约5分钟内在EC2上启动集群的脚本

Standalone Deploy Mode: launch a standalone cluster quickly without a third-party cluster manager
独立部署模式(http://spark.apache.org/docs/latest/spark-standalone.html):无需第三方集群管理器即可快速启动独立集群

Mesos: deploy a private cluster using Apache Mesos
Mesos(http://spark.apache.org/docs/latest/running-on-mesos.html):使用Apache Mesos部署私有集群

YARN: deploy Spark on top of Hadoop NextGen (YARN)
YARN(http://spark.apache.org/docs/latest/running-on-yarn.html):在Hadoop NextGen(YARN)之上部署Spark

Kubernetes: deploy Spark on top of Kubernetes
Kubernetes(http://spark.apache.org/docs/latest/running-on-kubernetes.html):在Kubernetes之上部署Spark

其他文件:
Other Documents:

Configuration: customize Spark via its configuration system
配置(http://spark.apache.org/docs/latest/configuration.html):通过其配置系统自定义Spark

Monitoring: track the behavior of your applications
监视(http://spark.apache.org/docs/latest/monitoring.html):跟踪应用程序的行为

Tuning Guide: best practices to optimize performance and memory use
调优指南(http://spark.apache.org/docs/latest/tuning.html):优化性能和内存使用的最佳做法

Job Scheduling: scheduling resources across and within Spark applications
作业调度(http://spark.apache.org/docs/latest/job-scheduling.html):在Spark应用程序之间和内部调度资源

Security: Spark security support
安全性(http://spark.apache.org/docs/latest/security.html):Spark安全性支持

Hardware Provisioning: recommendations for cluster hardware
硬件配置(http://spark.apache.org/docs/latest/hardware-provisioning.html):有关群集硬件的建议

与其他存储系统集成
Integration with other storage systems:

Cloud Infrastructures
云基础架构(http://spark.apache.org/docs/latest/cloud-integration.html)

OpenStack Swift
OpenStack迅捷(http://spark.apache.org/docs/latest/storage-openstack-swift.html)

Building Spark: build Spark using the Maven system
构建Spark(http://spark.apache.org/docs/latest/building-spark.html):使用Maven系统构建Spark

Contributing to Spark
为Spark贡献(https://spark.apache.org/contributing.html)

Third Party Projects: related third party Spark projects
第三方项目(https://spark.apache.org/third-party-projects.html):相关的第三方Spark项目

外部资源:
External Resources:

Spark Homepage
Spark主页(https://spark.apache.org)

Spark Community resources, including local meetups
Spark社区资源(https://spark.apache.org/community.html),包括本地社区

StackOverflow tag apache-spark
StackOverflow标签 apache-spark(https://stackoverflow.com/questions/tagged/apache-spark)

Mailing Lists: ask questions about Spark here
邮件列表(https://spark.apache.org/mailing-lists.html):在此处询问有关Spark的问题

AMP Camps: a series of training camps at UC Berkeley that featured talks and exercises about Spark, Spark Streaming, Mesos, and more. Videos, slides and exercises are available online for free.
AMP营地(http://ampcamp.berkeley.edu):加州大学伯克利分校的一系列训练营,其中包含有关Spark,Spark Streaming,Mesos等的讲座和练习。视频, 幻灯片和练习可在线免费获得。

Code Examples: more are also available in the examples subfolder of Spark (Scala, Java, Python, R)
代码示例(https://spark.apache.org/examples.html):examplesSpark 的子文件夹(Scala, Java, Python, R)中也提供更多示例

你可能感兴趣的:(Spark,spark,大数据,python)