在centos下hadoop伪分布式配置及python测试中,我们已经安装好了Hadoop,下面我们来安装Spark。
由于Spark是用Scala编写的,所以我们要先安装Scala。
$ wget -P ~/download/ https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
$ cd ~/download/
$ tar zxf scala-2.11.8.tgz
$ mv scala-2.11.* /opt/scala
$ echo 'export SCALA_HOME=/opt/scala' >> /etc/bashrc
apache官网目前只提供2.*版的Spark,我们选择spark-2.4.2版进行安装。
$ wget -P ~/download/ http://mirror.bit.edu.cn/apache/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.6.tgz
$ tar zxf spark-2.4.2-bin-hadoop2.6.tgz -C /opt
$ cd /opt/
$ mv spark-2.4.2-bin-hadoop2.6/ spark
$ echo 'export SPARK_HOME=/opt/spark' >> /etc/bashrc
$ echo 'alias spark-shell=$SPARK_HOME/bin/spark-shell' >> /etc/bashrc
$ source /etc/bashrc
这个时候我们启动Spark-Shell,可能会报错:
$ spark-shell
# Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/spark/launcher/Main : Unsupported major.minor version 52.0
这是因为spark-2.4.2需要更高版本的Java支持,于是我们把Java更新到1.8.0。
$ yum -y install java-1.8.0-openjdk*
$ ls -lrt /etc/alternatives/java
# lrwxrwxrwx 1 root root 73 May 1 20:57 /etc/alternatives/java -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/java
$ vi /etc/bashrc
用vi命令修改/etc/bashrc:
...
# export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.221-2.6.18.0.el7_6.x86_64/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/
...
保存文件后,成功进入spark-shell。
$ source /etc/bashrc
$ spark-shell
scala> import org.apache.spark.SparkConf
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkContext._
scala> val conf = new SparkConf().setMaster("localhost").setAppName("App Name")
scala> val sc = new SparkContext(conf)
# 读入文件
scala> val input = sc.textFile("/root/p*.txt")
# 分割单词
scala> val words = input.flatMap(line => line.split(" "))
# 统计词频
scala> val count = words.map((_, 1)).reduceByKey(_+_)
# 打印显示
scala> count.collect().foreach(println)
# (university,1)
# (priority,1)
# (next,2)
# (hence,,1)
# (low-priced.,1)
# (its,9)
# (others,1)
# (customized,1)
# (extraordinary,1)
# (have,6)
# ...
# 记得关闭sc
scala> sc.stop()
编写spark_test. py文件:
import os
import sys
sys.path.append("/opt/spark/python")
sys.path.append("/opt/spark/python/lib/py4j-0.10.7-src.zip")
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[*]").setAppName("App Name")
sc = SparkContext(conf=conf)
text = sc.textFile("/root/p*.txt")
words = text.flatMap(lambda line: line.split(" "))
count = words.map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
for each in count.collect():
print(each[0], each[1])
sc.stop()
配置pyspark:
$ echo 'export PATH=$SPARK_HOME/bin:$PATH' >> /etc/bashrc
$ echo 'export PYSPARK_PYTHON=/usr/bin/python3.6' >> /etc/bashrc
$ echo 'export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.6' >> /etc/bashrc
$ source /etc/bashrc
测试,跟scala结果(顺序)不同:
$ python spark_test.py
# 19/05/02 15:56:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
# Setting default log level to "WARN".
# To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
# police 7
# with 12
# of 56
# cards 6
# murdered 2
# 48
# Shocking 1
# details 1
# from 7
# university 1
# kept 1
# ...
$