anaconda创建虚拟环境

         最近在做项目时需要提交pyspark任务到公司的Spark集群上,由于没有集群节点的相关权限,打算采用anaconda创建pyspark的虚拟环境来进行。整个过程分为以下5步:(1)安装Anaconda;(2)创建python虚拟环境(3)安装python相关依赖;(4)打包python虚拟环境;(5)提交任务执行

1. 安装Anaconda
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda2-2019.03-Linux-x86_64.sh
bash Anaconda2-2019.03-Linux-x86_64.sh

vi /etc/profile
export PATH=$PATH:/root/anaconda2/bin
source /etc/profile

2.创建python虚拟环境
conda create --name nn_pyspark python=2
conda list -n nn_pyspark

#删除虚拟环境
# conda remove -n nn_pyspark –all

3.安装python相关依赖
conda install -n nn_pyspark numpy
conda install -n nn_pyspark pandas
conda install -n nn_pyspark paddlepaddle==2.0.0rc1 -c paddle

4.打包python虚拟环境
zip -r nn_pyspark.zip nn_pyspark/
#上传到hdfs目录
/home/work/kevin_new/spark-client/bin/spark-class org.apache.hadoop.fs.FsShell -put -f
/home/work/kevin_new/nn_pyspark.zip hdfs://xxxxx:9902/user/dmp4shareone/tmp/pyspark
/home/work/kevin_new/spark-client/bin/spark-class org.apache.hadoop.fs.FsShell -ls hdfs://xxxxx:9902/user/dmp4shareone/tmp/pyspark

5.提交任务执行
#指定driver的python依赖环境
export PATH=/home/work/kevin_new/nn_pyspark/bin:$PATH    
/home/work/kevin_new/spark-client/bin/spark-submit \
--deploy-mode client \
--master yarn \
--driver-memory 2g \
--num-executors 4 \
--executor-memory 2g \
--executor-cores 2 \
--archives hdfs://xxxxx:9902/user/dmp4shareone/tmp/pyspark/nn_pyspark.zip#nn_pyspark \
--conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=./nn_pyspark/nn_pyspark/bin/python" \
# 指定executor的python依赖环境
--conf "spark.executorEnv.PYSPARK_PYTHON=./nn_pyspark/nn_pyspark/bin/python" \
--conf "spark.executorEnv.PYTHONHOME=./nn_pyspark/nn_pyspark" \
--conf "spark.executorEnv.PYTHONPATH=./nn_pyspark/nn_pyspark/bin" \
/home/work/kevin_new/py_spark_program/nn_pyspark.py \
#spark任务参数
./nn_pyspark/nn_pyspark/nn_model/bfcmodel  

说明:在安装python相关依赖pandas或numpy的过程中,如果出现链接超时,可以给conda设置镜像。脚本如下:

# 添加Anaconda的TUNA镜像
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
# TUNA的help中镜像地址加有引号,需要去掉
# 设置搜索时显示通道地址
conda config --set show_channel_urls yes

参考链接:https://www.jianshu.com/p/d3321f7866ce

 

你可能感兴趣的:(Spark,Python,anaconda,pyspark)