Spark英中对照翻译(PySpark中文版新手快速入门-Quick Start)-中文指南,教程(Python版)-20161115

[源:http://spark.apache.org/docs/latest/quick-start.html]

[译:李文 ]

Quick Start

快速入门

  • Interactive Analysis with the Spark Shell

  • 通过Spark Shell交互式分析

    • Basics

    • 基础知识

    • More on RDD Operations

    • 有关RDD操作的更多知识

    • Caching

    • 缓存

  • Self-Contained Applications

  • 自包含应用

  • Whereto Go from Here

  • 由此去哪儿

This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python. See thprogramming guide fora more complete reference.

本教程给出使用Spark的简要介绍。我们将首先通过Spark的交互式shell(Python或Scala中)介绍API,然后演示如何在Java、Scala和Python中编写应用。有关更全面而完整的参考,请参见programming guide(编程指南)

To follow along with this guide, first download a packaged release of Spark from the Spark website.Since we won’t be using HDFS, you can download a package for any versionof Hadoop.

要跟随本指南,请先从Spark websiteSpark网站)下载Spark的打包发行版。由于我们将不会使用HDFS,因而您可以下载任何Hadoop版本的

Interactive Analysis with the Spark Shell

通过Spark Shell交互式分析

Basics

基础知识

Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala(which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Start it by running the following in the Spark directory:

Sparkshell提供了一种简单的方式来学习API,同时也是交互式分析数据的强大工具。该工具在Scala(运行在Java虚拟机上,因而是使用现有Java库的好方式)或Python中均可用。在Spark目录下运行以下命令即可启动该工具:

  • Scala

  • Python

./bin/pyspark

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:

Spark的主要抽象是称作弹性分布式数据集(RDD)的分布式项目集合。RDD可以从Hadoop InputFormats(如HDFS文件)创建,也可以通过变换其他RDD来创建。下面我们从Spark源目录下的README文件中的文本来生成新的RDD

>>> textFile = sc.textFile("README.md")

RDDs have actions,which return values, and transformations,which return pointers to new RDDs. Let’s start with a few actions:

RDDactions(动作),动作会返回值,还有transformations(变换),变换会返回指向新的RDD的指针。下面我们通过几个动作来开始:

>>> textFile.count() # Number of items in this RDD
126

>>> textFile.first() # First item in this RDD
u'# Apache Spark'

Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file.

现在,我们使用变换。我们将使用 filter 变换来返回包含文件中项目子集的新的RDD

>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)

We can chain together transformations and actions:

我们可以将变换与动作链在一起:

>>> textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"?
15

More on RDD Operations

有关RDD操作的更多知识

RDD actions and transformations can be used for more complex computations. Let’s say we want to find the line with the most words:

RDD动作和变换可用于更为复杂的计算。譬如说,我们想要找出词数最多的行:

  • Scala

  • Python

>>> textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)
15

This first maps a line to an integer value, creating a new RDD. reduce is called on that RDD to find the largest line count. The arguments to map and reduce are Python anonymous functions (lambdas),but we can also pass any top-level Python function we want. For example, we’ll define a max function to make this code easier to understand:

此操作首先将行映射为整数值,创建一个新的RDD。在该RDD上调用 reduce 找出最大的行长计数。 map reduce 的参数是Python(匿名函数) anonymous functions (lambdas),不过我们也可以传递自己所需的任何顶级Python函数。例如,我们将要定义一个 max函数,以使此代码更容易理解:

>>> def max(a, b):
...     if a > b:
...         return a
...     else:
...         return b
...

>>> textFile.map(lambda line: len(line.split())).reduce(max)
15

One common data flow pattern is MapReduce, as popularized by Hadoop.Spark can implement MapReduce flows easily:

一种常见的数据流模式是MapReduce,如Hadoop所普及推广的。Spark可以轻松实现MapReduce流:

>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

Here,we combined the flatMapmap,and reduceByKey transformations to compute the per-word counts in the file as an RDD of (string,int) pairs. To collect the word counts in our shell, we can use the collect action:

此处,我们结合运用了 flatMap map reduceByKey 变换来计算文件中每个词的计数,以作为(string,int)对儿的RDD。要在我们的shell中收集这些词计数,我们可以使用 collect 动作:

>>> wordCounts.collect()
[(u'and', 9), (u'A', 1), (u'webpage', 1), (u'README', 1), (u'Note', 1), (u'"local"', 1), (u'variable', 1), ...]

Caching

缓存

Spark also supports pulling data sets into a cluster-wide in-memory cache.This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithSpark dataset to be cached:

Spark还支持将数据集拖入集群范围的内存中的缓存。这在数据被反复访问时非常有用,比如在查询一个小的“热”数据集时,或在运行像PageRank这样的迭代算法时。作为一个简单的示例,下面我们将linesWithSpark 数据集标记为要进行缓存:

  • Scala

  • Python

>>> linesWithSpark.cache()

>>> linesWithSpark.count()
19

>>> linesWithSpark.count()
19

It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting bin/pyspark to a cluster, as described in the programming guide.

使用Spark来探查和缓存100行的文本文件可能貌似愚蠢。有意思之处在于所用的这些函数可以在非常大的数据集上使用,即使这些数据集分布在成百上千个节点上。您也可以通过将 bin/pyspark 连接到集群来交互式执行此操作,如programming guide(编程指南)中所述。

Self-Contained Applications

自包含应用

Suppose we wish to write a self-contained application using the Spark API.We will walk through a simple application in Scala (with sbt), Java(with Maven), and Python.

假定我们想要使用Spark API来编写自包含应用。在Scala(通过sbt)、Java(通过Maven)和Python中,应用编写比较简单。

  • Scala

  • Java

  • Python

Now we will show how to write an application using the Python API(PySpark).

As an example, we’ll create a simple Spark application, SimpleApp.py:

现在,我们将演示如何使用Python API (PySpark)来编写应用。

作为示例,我们将创建一个简单的Spark应用 SimpleApp.py

"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.As with the Scala and Java examples, we use a SparkContext to create RDDs. We can pass Python functions to Spark, which are automatically serialized along with any variables that they reference. For applications that use custom classes or third-party libraries, we can also add code dependencies to spark-submit through its --py-files argument by packaging them into a .zip file (see spark-submit--help for details). SimpleApp is simple enough that we do not need to specify any code dependencies.

此程序只统计文本文件中包含‘a’的行的数目和包含‘b’的行的数目。请注意,您需要将YOUR_SPARK_HOME替换为您的Spark安装位置。如同Scala和Java示例,我们使用SparkContext来创建RDD。我们可以将Python函数传递给Spark,这些函数将自动随其引用的任何变量一同序列化。对于使用自定义类或第三方库的应用,我们还可以通过其 --py-files 参数向 spark-submit 添加代码依赖,方法是将这些代码依赖打包到.zip文件中(详情请参见 spark-submit--help )。 SimpleApp 足够简单,我们无需指定任何代码依赖。

We can run this application using the bin/spark-submit script:

我们可以使用 bin/spark-submit 脚本运行此应用:

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --master local[4] \
  SimpleApp.py
...
Lines with a: 46, Lines with b: 23

Where to Go from Here

由此去哪儿(后续事项)

Congratulations on running your first Spark application!

恭喜您运行您的第一个Spark应用!

  • For an in-depth overview of the API, start with the Spark programming guide,or see “Programming Guides” menu for other components.

  • 如要深度概览API,请开始学习 Spark programming guide(Spark编程指南),或参见其他组件的“编程指南”菜单。

  • For running applications on a cluster, head to the deployment overview.

  • 如要在集群上运行应用,请前往 deployment overview(部署概览)。

  • Finally,Spark includes several samples in the examples directory(ScalaJavaPythonR).You can run them as follows:

  • 最后,Spark在 examples 目录(ScalaJavaPythonR)中包含有多个示例。您可以如下运行这些示例:

# For Scala and Java, use run-example:
./bin/run-example SparkPi

# For Python examples, use spark-submit directly:
./bin/spark-submit examples/src/main/python/pi.py

# For R examples, use spark-submit directly:
./bin/spark-submit examples/src/main/r/dataframe.R

 

你可能感兴趣的:(Spark英中对照翻译(PySpark中文版新手快速入门-Quick Start)-中文指南,教程(Python版)-20161115)