linux安装使用spark

1 Spark简介

2 安装Spark

实验环境:
宿主机操作系统: Windows10
虚拟机软件:VMware Workstation
虚拟机操作系统:Ubuntu2004LTS

2.1 安装scala

# 下载deb包
hadoop@hadoop1:~$ wget https://downloads.lightbend.com/scala/2.13.4/scala-2.13.4.deb
# 安装
hadoop@hadoop1:~$ sudo dpkg --install scala-2.13.4.deb

2.2 安装spark

# 安装spark
hadoop@hadoop1:~$ wget https://downloads.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgz
hadoop@hadoop1:~$ tar -zxvf spark-3.0.3-bin-hadoop3.2.tgz
hadoop@hadoop1:~$ mv spark-3.0.3-bin-hadoop3.2 spark-3.0.3

# 添加环境变量
hadoop@hadoop1:~$ sudo vi /etc/profile.d/spark.sh
添加:
export SPARK_HOME=/home/hadoop/spark-3.0.3
export PATH=$PATH:$HBASE_HOME/bin

hadoop@hadoop1:~$ source /etc/profile.d/spark.sh

# 配置spark
hadoop@hadoop1:~$ cd spark-3.0.3/conf/

# 配置slaves
hadoop@hadoop1:~$ cp slaves.template slaves
hadoop@hadoop1:~$ vi slaves
修改为:(多节点可添加配置)
hadoop1

# 配置spark-env.sh
hadoop@hadoop1:~$ cp spark-env.sh.template  spark-env.sh
hadoop@hadoop1:~$ vi  spark-env.sh
添加:
export JAVA_HOME=/home/hadoop/jdk1.8.0_321/
export HADOOP_HOME=/home/hadoop/hadoop-3.2.3/
export HADOOP_CONF_DIR=/home/hadoop/hadoop-3.2.3/etc/hadoop
export SCALA_HOME=/home/hadoop/scala-2.13.4/
export SPARK_MASTER_HOST=hadoop1
export SPARK_PID_DIR=/home/hadoop/spark-3.0.3/data
export SPARK_LOCAL_DIR=/home/hadoop/spark-3.0.3
export SPARK_EXECUTOR_MEMORY=512M
export SPARK_WORKER_MEMORY=4G

# 配置spark-defaults.conf
hadoop@hadoop1:~$ cp spark-defaults.conf.template  spark-defaults.conf
hadoop@hadoop1:~$ vi spark-defaults.conf
添加:
spark.master                     spark://hadoop1:7077

# 启动服务
hadoop@hadoop1:~$ $SPARK_HOME/sbin/start-all.sh
hadoop@hadoop1:~$ jps
出现:
35568 Master  (主节点)
35733 Worker  (工作节点)

# 此时可以通过web访问spark:http://192.168.17.100:8080/

# 停止服务
hadoop@hadoop1:~$ $SPARK_HOME/sbin/stop-all.sh

2.3 运行测试程序

# 运行测试程序
hadoop@hadoop1:~$ $SPARK_HOME/bin/run-example SparkPi 10
# 通过访问网址查看结果:http://hadoop1:8080/cluster

# 启动spark-shell
hadoop@hadoop1:~$ $SPARK_HOME/bin/spark-shell

hadoop@hadoop1:~$ val textFile=sc.textFile("file:///home/hadoop/spark-3.0.3/README.md")
hadoop@hadoop1:~$ textFile.count()
出现结果为:
res3: Long = 108 即运行成功

# 退出spark-shell
hadoop@hadoop1:~$ :quit

2.4 安装Jupyter notebook和python3

linux安装jupyter:

# 更新源
hadoop@hadoop1:~$ sudo apt update

# 安装python
hadoop@hadoop1:~$ sudo apt install python3-pip python3-dev

# 为Jupyter创建虚拟环境
hadoop@hadoop1:~$ sudo -H pip3 install --upgrade pip
hadoop@hadoop1:~$ sudo -H pip3 install virtualenv
hadoop@hadoop1:~$ mkdir ~/my_project_dir
hadoop@hadoop1:~$ cd ~/my_project_dir
hadoop@hadoop1:~/my_project_dir$ virtualenv my_project_env

# 进入虚拟环境
hadoop@hadoop1:~/my_project_dir$ source my_project_env/bin/activate

# 安装Jupyter
(my_project_env)hadoop@hadoop1:~/my_project_dir$ pip install jupyter

# 配置密码
(my_project_env)hadoop@hadoop1:~/my_project_dir$ jupyter notebook --generate-config
(my_project_env)hadoop@hadoop1:~/my_project_dir$ jupyter notebook password

# 运行Jupyter
(my_project_env)hadoop@hadoop1:~/my_project_dir$ jupyter notebook

然后Windows+putty通过远程ssh连接使用jupyter notebook:
打开putty输入地址:

linux安装使用spark_第1张图片

点击左侧ssh–>tunnel,输入:

linux安装使用spark_第2张图片

点击add,然后点击open。
然后可以在Windows浏览器上输入loacalhost:8000访问linux上的jupyter了。

参考链接:How To Set Up Jupyter Notebook with Python 3 on Ubuntu 20.04

安装pip时遇到报错,如下命令解决:

sudo apt-get remove python-pip-whl
sudo apt -f install
sudo apt update && sudo apt dist-upgrade
sudo apt install python3-pip

2.5 在Windows上安装PySpark

首先进入虚拟环境,然后执行:

(my_project_env)hadoop@hadoop1:~/my_project_dir$ pip install pyspark==3.0.3
# pyspark版本必须和spark一致,否则pyspark无法在jupyter上正常运行

(my_project_env)hadoop@hadoop1:~/my_project_dir$ pyspark

参考链接:Installation

3 使用Spark

3.1 使用Python Spark Shell

(my_project_env)hadoop@hadoop1:~/my_project_dir$ pyspark

>>> from pyspark import SparkContext
>>> sc = SparkContext('local[*]')
>>> txt = sc.textFile('file:///YourFileDirectory/input/try1.txt')
>>> print(txt.count())
>>> as_lines = txt.filter(lambda line: 'as' in line.lower())
>>> print(as_lines.count())

3.2 使用spark-submit

将python代码写好存放于try1.py中:

from pyspark import SparkContext
sc = SparkContext('local[*]')
sc.setLogLevel('WARN')

txt = sc.textFile('file:///YourFileDirectory/input/try1.txt')
print(txt.count())

as_lines = txt.filter(lambda line: 'as' in line.lower())
print(as_lines.count())

然后在命令行执行spark-submit批处理执行:

(my_project_env)hadoop@hadoop1:~/my_project_dir$ spark-submit try1.py

3.3 使用jupyter notebook

参考2.4中的步骤,运程连接jupyter notebook,将python代码复制到.ipynb文件中执行。

你可能感兴趣的:(spark,linux,scala)