Interactive Analysis with the Spark Shell
通过Spark Shell交互式分析
Basics
基础知识
More on RDD Operations
有关RDD操作的更多知识
Caching
缓存
Self-Contained Applications
自包含应用
Whereto Go from Here
由此去哪儿
This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python. See the programming guide fora more complete reference.
本教程给出使用Spark的简要介绍。我们将首先通过Spark的交互式shell(Python或Scala中)介绍API,然后演示如何在Java、Scala和Python中编写应用。有关更全面而完整的参考,请参见programming guide(编程指南)。
To follow along with this guide, first download a packaged release of Spark from the Spark website.Since we won’t be using HDFS, you can download a package for any versionof Hadoop.
要跟随本指南,请先从Spark website(Spark网站)下载Spark的打包发行版。由于我们将不会使用HDFS,因而您可以下载任何Hadoop版本的包。
Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala(which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Start it by running the following in the Spark directory:
Spark的shell提供了一种简单的方式来学习API,同时也是交互式分析数据的强大工具。该工具在Scala(运行在Java虚拟机上,因而是使用现有Java库的好方式)或Python中均可用。在Spark目录下运行以下命令即可启动该工具:
Scala
Python
./bin/pyspark
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:
Spark的主要抽象是称作弹性分布式数据集(RDD)的分布式项目集合。RDD可以从Hadoop InputFormats(如HDFS文件)创建,也可以通过变换其他RDD来创建。下面我们从Spark源目录下的README文件中的文本来生成新的RDD:
>>>
textFile
=
sc
.
textFile(
"README.md"
)
RDDs have actions,which return values, and transformations,which return pointers to new RDDs. Let’s start with a few actions:
RDD有actions(动作),动作会返回值,还有transformations(变换),变换会返回指向新的RDD的指针。下面我们通过几个动作来开始:
>>>
textFile
.
count()
# Number of items in this RDD
126
>>>
textFile
.
first()
# First item in this RDD
u'# Apache Spark'
Now let’s use a transformation. We will use the filter
transformation to return a new RDD with a subset of the items in the file.
现在,我们使用变换。我们将使用 filter
变换来返回包含文件中项目子集的新的RDD。
>>>
linesWithSpark
=
textFile
.
filter(
lambda
line:
"Spark"
in
line)
We can chain together transformations and actions:
我们可以将变换与动作链在一起:
>>>
textFile
.
filter(
lambda
line:
"Spark"
in
line)
.
count()
# How many lines contain "Spark"?
15
More on RDD Operations
有关
RDD
操作的更多知识
RDD actions and transformations can be used for more complex computations. Let’s say we want to find the line with the most words:
RDD动作和变换可用于更为复杂的计算。譬如说,我们想要找出词数最多的行:
Scala
Python
>>>
textFile
.
map(
lambda
line:
len
(line
.
split()))
.
reduce(
lambda
a, b: a
if
(a
>
b)
else
b)
15
This first maps a line to an integer value, creating a new RDD. reduce
is called on that RDD to find the largest line count. The arguments to map
and reduce
are Python anonymous functions (lambdas),but we can also pass any top-level Python function we want. For example, we’ll define a max
function to make this code easier to understand:
此操作首先将行映射为整数值,创建一个新的RDD。在该RDD上调用 reduce
来找出最大的行长计数。 map
和reduce
的参数是Python(匿名函数) anonymous functions (lambdas),不过我们也可以传递自己所需的任何顶级Python函数。例如,我们将要定义一个 max
函数,
以使此代码更容易理解:
>>>
def
max
(a, b):
...
if
a
>
b:
...
return
a
...
else
:
...
return
b
...
>>>
textFile
.
map(
lambda
line:
len
(line
.
split()))
.
reduce(
max
)
15
One common data flow pattern is MapReduce, as popularized by Hadoop.Spark can implement MapReduce flows easily:
一种常见的数据流模式是MapReduce,如Hadoop所普及推广的。Spark可以轻松实现MapReduce流:
>>>
wordCounts
=
textFile
.
flatMap(
lambda
line: line
.
split())
.
map(
lambda
word: (word,
1
))
.
reduceByKey(
lambda
a, b: a
+
b)
Here,we combined the flatMap
, map
,and reduceByKey
transformations to compute the per-word counts in the file as an RDD of (string,int) pairs. To collect the word counts in our shell, we can use the collect
action:
此处,我们结合运用了 flatMap
、
map
和
reduceByKey
变换来计算文件中每个词的计数,以作为(string,int)对儿的RDD。要在我们的shell中收集这些词计数,我们可以使用 collect
动作:
>>>
wordCounts
.
collect()
[(
u'and'
,
9
), (
u'A'
,
1
), (
u'webpage'
,
1
), (
u'README'
,
1
), (
u'Note'
,
1
), (
u'"local"'
,
1
), (
u'variable'
,
1
),
...
]
Spark also supports pulling data sets into a cluster-wide in-memory cache.This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithSpark
dataset to be cached:
Spark还支持将数据集拖入集群范围的内存中的缓存。这在数据被反复访问时非常有用,比如在查询一个小的“热”数据集时,或在运行像PageRank这样的迭代算法时。作为一个简单的示例,下面我们将linesWithSpark 数据集标记为要进行缓存:
Scala
Python
>>>
linesWithSpark
.
cache()
>>>
linesWithSpark
.
count()
19
>>>
linesWithSpark
.
count()
19
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting bin/pyspark
to a cluster, as described in the programming guide.
使用Spark来探查和缓存100行的文本文件可能貌似愚蠢。有意思之处在于所用的这些函数可以在非常大的数据集上使用,即使这些数据集分布在成百上千个节点上。您也可以通过将 bin/pyspark
连接到集群来交互式执行此操作,如programming guide(编程指南)中所述。
Suppose we wish to write a self-contained application using the Spark API.We will walk through a simple application in Scala (with sbt), Java(with Maven), and Python.
假定我们想要使用Spark API来编写自包含应用。在Scala(通过sbt)、Java(通过Maven)和Python中,应用编写比较简单。
Scala
Java
Python
Now we will show how to write an application using the Python API(PySpark).
As an example, we’ll create a simple Spark application, SimpleApp.py
:
现在,我们将演示如何使用Python API (PySpark)来编写应用。
作为示例,我们将创建一个简单的Spark应用 SimpleApp.py
:
"""SimpleApp.py"""
from
pyspark
import
SparkContext
logFile
=
"YOUR_SPARK_HOME/README.md"
# Should be some file on your system
sc
=
SparkContext(
"local"
,
"Simple App"
)
logData
=
sc
.
textFile(logFile)
.
cache()
numAs
=
logData
.
filter(
lambda
s:
'a'
in
s)
.
count()
numBs
=
logData
.
filter(
lambda
s:
'b'
in
s)
.
count()
(
"Lines with a:
%i
, lines with b:
%i
"
%
(numAs, numBs))
This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.As with the Scala and Java examples, we use a SparkContext to create RDDs. We can pass Python functions to Spark, which are automatically serialized along with any variables that they reference. For applications that use custom classes or third-party libraries, we can also add code dependencies to spark-submit
through its --py-files
argument by packaging them into a .zip file (see spark-submit--help
for details). SimpleApp
is simple enough that we do not need to specify any code dependencies.
此程序只统计文本文件中包含‘a’的行的数目和包含‘b’的行的数目。请注意,您需要将YOUR_SPARK_HOME替换为您的Spark安装位置。如同Scala和Java示例,我们使用SparkContext来创建RDD。我们可以将Python函数传递给Spark,这些函数将自动随其引用的任何变量一同序列化。对于使用自定义类或第三方库的应用,我们还可以通过其 --py-files
参数向 spark-submit
添加代码依赖,方法是将这些代码依赖打包到.zip文件中(详情请参见 spark-submit--help
)。 SimpleApp
足够简单,我们无需指定任何代码依赖。
We can run this application using the bin/spark-submit
script:
我们可以使用 bin/spark-submit
脚本运行此应用:
# Use spark-submit to run your application
$
YOUR_SPARK_HOME/bin/spark-submit
\
--master
local
[
4
]
\
SimpleApp.py
...
Lines with a: 46, Lines with b: 23
Congratulations on running your first Spark application!
恭喜您运行您的第一个Spark应用!
For an in-depth overview of the API, start with the Spark programming guide,or see “Programming Guides” menu for other components.
如要深度概览API,请开始学习 Spark programming guide(Spark编程指南),或参见其他组件的“编程指南”菜单。
For running applications on a cluster, head to the deployment overview.
如要在集群上运行应用,请前往 deployment overview(部署概览)。
Finally,Spark includes several samples in the examples
directory(Scala, Java, Python, R).You can run them as follows:
最后,Spark在 examples
目录(Scala、Java、Python、R)中包含有多个示例。您可以如下运行这些示例:
# For Scala and Java, use run-example:
./bin/run-example SparkPi
# For Python examples, use spark-submit directly:
./bin/spark-submit examples/src/main/python/pi.py
# For R examples, use spark-submit directly:
./bin/spark-submit examples/src/main/r/dataframe.R