在anaconda环境下搭建python3.5 + jupyter sparkR,scala,pyspark

在anaconda环境下搭建python3.5 + jupyter sparkR,scala,pyspark

多用户jupyterhub+kubernetes 认证:https://my.oschina.net/u/2306127/blog/1837196

https://ojerk.cn/Ubuntu%E4%B8%8B%E5%A4%9A%E7%94%A8%E6%88%B7%E7%89%88jupyterhub%E9%83%A8%E7%BD%B2/

https://repo.continuum.io/archive/

ubuntu16.4

curl -O https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh

python3.5.2

 

We can now verify the data integrity of the installer with cryptographic hash verification through the SHA-256 checksum. We’ll use the sha256sum command along with the filename of the script:

sha256sum Anaconda3-4.4.0-Linux-x86_64.sh

You’ll receive output that looks similar to this:

73b51715a12b6382dd4df3dd1905b531bd6792d4aa7273b2377a0436d45f0e78  Anaconda3-4.2.0-Linux-x86_64.sh

You should check the output against the hashes available at the Anaconda with Python 3 on 64-bit Linux page for your appropriate Anaconda version. As long as your output matches the hash displayed in the sha2561 row then you’re good to go.

Now we can run the script:

bash Anaconda3-4.2.0-Linux-x86_64.sh

Anaconda3 will now be installed into this location:

/opt/anaconda3 
[/opt/anaconda3] >>> 

PATH=/opt/anaconda3/bin

conda create -n jupyter_py352_env python=3.5.2

jupyterhub单服务器多用户模式安装

首先安装python3以上版本。

source activate jupyter_env35

 

执行以下命令

 

 sudo apt-get install gcc

sudo apt-get install openssl

sudo apt-get install libssl-dev 

centos:

sudo yum  install openssl-devel


# Installing Python3 (dependency of jupyterhub is on python3)
$ sudo apt-get install python3-pip 

# 安装最新版本的npm/nodejs-legacy
sudo apt-get install npm nodejs-legacy

1.2 nodejs 安装

nodejs 和 npm 的安装:

apt install nodejs-legacy
apt install npm

更新(推荐更新到新版,apt 安装的版本太旧,会导致很多错误):

npm install -g n  # 安装
n stable  # 更新 nodejs
npm install -g npm
# 安装hub和代理
conda install jupyterhub
npm install -g configurable-http-proxy
# needed if running the notebook servers locally
pip install jupyter-conda
conda install notebook 

 

判断是否安装成功:

jupyterhub -h

configurable-http-proxy -h

 

jupyterhub --no-ssl

增加用户用于登录:

添加用户和组

  • groupadd jupyter_usergroup
  •  sudo useradd -c "jupyter user test" -g jupyter_usergroup -d /home/jupyter_user2  jupyter_user2 -m
  • ls /home/
  • useradd jupyter_user2

    passwd jupyter_user2 123456

    例:将用户 user1 加入到 users组中,
    usermod –g jupyter_user2 jupyter_usergroup

  • 查看jupyterhub用户组的所有用户:
 GID=`grep 'jupyterhub' /etc/group|awk -F':' '{print $3}'`
 awk -F":" '{print $1"\t"$4}' /etc/passwd |grep $GID
  • passwd 用户名 修改某个用户的密码[root用户]

 

 

运行以下命令生成配置文件

 

jupyterhub --generate-config

 

修改配置文件:


## The number of threads to allocate for encryption
#c.CryptKeeper.n_threads =8
c.JupyterHub.ip = '0.0.0.0'
c.JupyterHub.port = 8000
c.PAMAuthenticator.encoding = 'utf-8'
c.LocalAuthenticator.create_system_users = True
c.LocalAuthenticator.group_whitelist = {'jupyterhub'}
c.Authenticator.whitelist = {'ubuntu','jupyter_user1','jupyter_user2','test01'}
c.JupyterHub.admin_users = {'ubuntu'}
c.JupyterHub.statsd_prefix = 'jupyterhub'

增加白名单及管理员用户

登录 serverip:8000

安装jupyterLab

参考:https://blog.csdn.net/ds19991999/article/details/83663349

 

(ljt_env) ubuntu@node1:~/ljt_test$ python

Python 3.5.4 |Anaconda, Inc.| (default, Feb 19 2018, 10:59:04)

[GCC 7.2.0] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>> from notebook.auth import passwd

>>> passwd()

Enter password:

Verify password:

'sha1:25f46ecf43f0:8f778092033e870fec6718189eaeba118aec807a'

vim /home/ubuntu/.jupyter/jupyter_notebook_config.py

在文件末尾添加

c.NotebookApp.allow_root = True

c.NotebookApp.ip = '0.0.0.0'

c.NotebookApp.notebook_dir = u'/home/ubuntu/ljt_test/jupyterhubHome'

c.NotebookApp.open_browser = False

c.NotebookApp.password = u' sha1:25f46ecf43f0:8f778092033e870fec6718189eaeba118aec807a '

c.NotebookApp.port = 8000

 

 

(ljt_env) ubuntu@node1:~/ljt_test$ jupyter-lab --version

0.34.9

 

开启 jupyterhub

1.激活python3.5 虚拟环境

 cd /home/ubuntu/ljt_test

source activate ljt_env

 

2. 检查jupyterhub 服务

ps -ef|grep jupyterhub

lsof -i:8000

kill -9 45654

tail -fn200 /home/ubuntu/ljt_test/jupyterhub.log

3.若没有开启jupyter服务,开启服务

nohup jupyterhub --no-ssl > jupyterhub.log &

nohup jupyterhub -f /etc/jupyterhub_config.py --no-ssl > jupyterhub.log &

nohup jupyterhub -f /home/ubuntu/ljt_test/jupyterhub_config.py --ssl-key /home/ubuntu/ljt_test/mykey.key --ssl-cert /home/ubuntu/ljt_test/mycert.pem  > jupyterhub.log &

4、测试访问

IP+端口测试访问

5、用户管理

用户白名单的用户会自动添加,但无密码,需要修改密码才能登录;

新添加用户:sudo useradd jupyter_user2 -d /home/ubuntu/jupyter_user2 -m

用户添加组:sudo adduser jupyter_user2 jupyterhub

修改用户密码:echo crxis:crxis|chpasswd

问题:

PAM Authentication failed ([email protected]): [PAM Error 7] Authentication failure

验证器说明

 
PAMAuthenticator 默认,内置身份验证器
OAuthenticator OAuth + JupyterHub身份验证器= OAuthenticator
LdapAuthenticator 用于JupyterHub的简单LDAP身份验证器插件
kdcAuthenticator JupyterHub的Kerberos身份验证器插件

https://blog.chairco.me/posts/2018/06/how%20to%20build%20a%20jupytre-hub%20for%20team.html

注意:我的服务器登陆后默认就是管理员,所以下面过程都是在管理员root身份下安装的,如果当前用户不是管理员那会出现乱七八糟的问题(猜测是因为认证的问题,而上面那个博客里详细记录了認證方式(PAM)的配置过程。可以试试)

https://jupyterhub.readthedocs.io/en/latest/quickstart.html

https://cloud.tencent.com/developer/article/1349526

https://cloud.tencent.com/developer/article/1349531

 

  • jupyterhub -f ./jupyterhub_config.py --ssl-key ./mykey.key --ssl-cert ./mycert.pem 

参考:

https://www.digitalocean.com/community/tutorials/how-to-install-the-anaconda-python-distribution-on-ubuntu-16-04

https://github.com/jupyterhub/jupyterhub/wiki/Installation-of-Jupyterhub-on-remote-server

 

https://www.cnblogs.com/crxis/p/9078278.html

 

https://blog.csdn.net/papageno_xue/article/details/79710708

https://49.4.6.110:8000/

云服务器搭建神器JupyterLab

 

https://blog.csdn.net/ds19991999/article/details/83663349

记一次在服务器上配置 Jupyterhub 作为系统服务

https://blog.huzicheng.com/2018/01/04/jupyter-as-a-service/

配置 jupyterhub

https://www.brothereye.cn/ubuntu/431/

make Pyspark working inside jupyterhub

conda install -c conda-forge pyspark==2.2.0 

vim /etc/profile 

export JAVA_HOME=/usr/java/jdk1.8.0_181

export JRE_HOME=$JAVA_HOME/jre

export CLASSPATH=.:${JAVA_HOME}/lib

export PATH=$PATH:$HADOOP_HOME/bin

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

export PATH=${SPARK_HOME}/bin:$PATH

export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

source /etc/profile

 

How to integrate JupyterHub with the existing Cloudera cluster

https://github.com/jupyterhub/jupyterhub/issues/2116

 

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

 

Vim /home/ubuntu/anaconda3/share/jupyter/kernels/python3/kernel.json

{
	"argv": ["/home/ljt/anaconda3/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}"],
	"display_name": "Python 3.6+Pyspark2.4.3",
	"language": "python",
	"env": {
		"HADOOP_CONF_DIR": "/mnt/e/hadoop/3.1.1/conf",
		"PYSPARK_PYTHON": "/home/ljt/anaconda3/bin/python",
		"SPARK_HOME": "/mnt/f/spark/2.4.3",
		"WRAPPED_SPARK_HOME": "/etc/spark",
		"PYTHONPATH": "/mnt/f/spark/2.4.3/python/:/mnt/f/spark/2.4.3/python/lib/py4j*src.zip",
		"PYTHONSTARTUP": "/mnt/f/spark/2.4.3/python/pyspark/shell.py",
		"PYSPARK_SUBMIT_ARGS": "--master yarn  --deploy-mode client pyspark-shell" 
	}
}

jupyterhub -f jupyterhub_config.py

nohup jupyterhub -f /home/ubuntu/ljt_test/jupyterhub_config.py --no-ssl > jupyterhub.log &

nohup jupyter lab --config=/home/ubuntu/.jupyter/jupyter_notebook_config.py   > jupyter.log &

Scala

https://medium.com/@bogdan.cojocar/how-to-run-scala-and-spark-in-the-jupyter-notebook-328a80090b3b

Step1: install the package

pip install spylon-kernel

Step2: create a kernel spec

This will allow us to select the scala kernel in the notebook.

python -m spylon_kernel install

Step3: start the jupyter notebook

ipython notebook

And in the notebook we select New -> spylon-kernel . This will start our scala kernel.

Step4: testing the notebook

Let’s write some scala code:

val x = 2
val y = 3
x+y

 r kernel

conda install -c r r-irkernel 

https://anaconda.org/chdoig/jupyter-and-conda-for-r/notebook

conda install -c r r-essentials

conda create -n my-r-env -c r r-essentials

https://github.com/rstudio/sparklyr

http://spark.apache.org/docs/2.2.0/sparkr.html

https://www.cnblogs.com/lalabola/p/7423902.html

 

export LD_LIBRARY_PATH="/usr/java/jdk1.8.0_181/jre/lib/amd64/server"

rJAVA

https://github.com/hannarud/r-best-practices/wiki/Installing-RJava-(Ubuntu)

java 环境需要在/user/lib/jvm 下面

SparkR on Yarn 安装配制

https://www.jianshu.com/p/9a144c508059

conda create -p /home/ubuntu/anaconda3/envs/r_env --copy -y -q r-essentials -c r

 

 

sparkR --executor-memory 2g --total-executor-cores 10 --master spark://node1.sc.com:7077

 

http://cleverowl.uk/2016/10/15/installing-jupyter-with-the-pyspark-and-r-kernels-for-spark-development/

https://spark.rstudio.com/

https://spark.rstudio.com/examples/cloudera-aws/

 

Sys.setenv(SPARK_HOME=' /opt/cloudera/parcels/CDH/lib/spark')

.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))

 

library(SparkR)

 

sc <- sparkR.init(master='yarn-client', sparkPackages="com.databricks:spark-csv_2.11:1.5.0")

sqlContext <- sparkRSQL.init(sc)

 

df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")

head(df)

 

 

configurable-http-proxy --ip **.*.6.110 --port 8000 --log-level=debug
openssl rand -hex 32
0b042f8d651fb8126537d1ec98507b093653d1ffe4b909f053616062184b1db3

 

 

Sparkrly:

https://spark.rstudio.com/guides/connections/#yarn

https://www.r-bloggers.com/sparklyr-a-test-drive-on-yarn/

https://blog.csdn.net/The_One_is_all/article/details/75307270

# .R script showing capabilities of sparklyr R package
# Prerequisites before running this R script: 
# Ubuntu 16.04.3 LTS 64-bit, r-base (version 3.4.1 or newer), 
# RStudio 64-bit version, libssl-dev, libcurl4-openssl-dev, libxml2-dev
install.packages("httr")
install.packages("xml2")
# New features in sparklyr 0.6:
# https://blog.rstudio.com/2017/07/31/sparklyr-0-6/
install.packages("sparklyr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
library(sparklyr)
library(dplyr)
library(ggplot2)
library(tidyr)
set.seed(100)
# sparklyr cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/sparklyr.pdf
# dplyr+tidyr: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
# sparklyr currently (2017-08-19) only supports Apache Spark version 2.2.0 or older
# Install Spark locally:
sc_version <- "2.2.0"
spark_install(sc_version)
config <- spark_config()
# number of CPU cores to use:
config$spark.executor.cores <- 6
# amount of RAM to use for Apache Spark executors:
config$spark.executor.memory <- "4G"
# Connect to local version:
sc <- spark_connect (master = "local",
 config = config, version = sc_version)
# Copy data to Spark memory:
import_iris <- sdf_copy_to(sc, iris, "spark_iris", overwrite = TRUE) 
# partition data:
partition_iris <- sdf_partition(import_iris,training=0.5, testing=0.5) 
# Create a hive metadata for each partition:
sdf_register(partition_iris,c("spark_iris_training","spark_iris_test")) 
# Create reference to training data in Spark table
tidy_iris <- tbl(sc,"spark_iris_training") %>% select(Species, Petal_Length, Petal_Width) 
# Spark ML Decision Tree Model
model_iris <- tidy_iris %>% ml_decision_tree(response="Species", features=c("Petal_Length","Petal_Width")) 
# Create reference to test data in Spark table
test_iris <- tbl(sc,"spark_iris_test") 
# Bring predictions data back into R memory for plotting:
pred_iris <- sdf_predict(model_iris, test_iris) %>% collect
pred_iris %>%
 inner_join(data.frame(prediction=0:2,
 lab=model_iris$model.parameters$labels)) %>%
 ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
 geom_point() 
spark_disconnect(sc)

你可能感兴趣的:(分布式系统,bigdata)