在anaconda环境下搭建python3.5 + jupyter sparkR,scala,pyspark
多用户jupyterhub+kubernetes 认证:https://my.oschina.net/u/2306127/blog/1837196
https://ojerk.cn/Ubuntu%E4%B8%8B%E5%A4%9A%E7%94%A8%E6%88%B7%E7%89%88jupyterhub%E9%83%A8%E7%BD%B2/
https://repo.continuum.io/archive/
ubuntu16.4
curl -O https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
python3.5.2
We can now verify the data integrity of the installer with cryptographic hash verification through the SHA-256 checksum. We’ll use the sha256sum
command along with the filename of the script:
sha256sum Anaconda3-4.4.0-Linux-x86_64.sh
You’ll receive output that looks similar to this:
73b51715a12b6382dd4df3dd1905b531bd6792d4aa7273b2377a0436d45f0e78 Anaconda3-4.2.0-Linux-x86_64.sh
You should check the output against the hashes available at the Anaconda with Python 3 on 64-bit Linux page for your appropriate Anaconda version. As long as your output matches the hash displayed in the sha2561
row then you’re good to go.
Now we can run the script:
bash Anaconda3-4.2.0-Linux-x86_64.sh
Anaconda3 will now be installed into this location:
/opt/anaconda3
[/opt/anaconda3] >>>
PATH=/opt/anaconda3/bin
conda create -n jupyter_py352_env python=3.5.2
首先安装python3以上版本。
source activate jupyter_env35
执行以下命令
sudo apt-get install gcc
sudo apt-get install openssl
sudo apt-get install libssl-dev
centos:
sudo yum install openssl-devel
# Installing Python3 (dependency of jupyterhub is on python3)
$ sudo apt-get install python3-pip
# 安装最新版本的npm/nodejs-legacy
sudo apt-get install npm nodejs-legacy
nodejs
和 npm
的安装:
apt install nodejs-legacy
apt install npm
更新(推荐更新到新版,apt 安装的版本太旧,会导致很多错误):
npm install -g n # 安装
n stable # 更新 nodejs
npm install -g npm
# 安装hub和代理
conda install jupyterhub
npm install -g configurable-http-proxy
# needed if running the notebook servers locally
pip install jupyter-conda
conda install notebook
判断是否安装成功:
jupyterhub -h
configurable-http-proxy -h
jupyterhub --no-ssl
增加用户用于登录:
添加用户和组
useradd jupyter_user2
passwd jupyter_user2 123456
例:将用户 user1 加入到 users组中,
usermod –g jupyter_user2 jupyter_usergroup
GID=`grep 'jupyterhub' /etc/group|awk -F':' '{print $3}'`
awk -F":" '{print $1"\t"$4}' /etc/passwd |grep $GID
运行以下命令生成配置文件
jupyterhub --generate-config
修改配置文件:
## The number of threads to allocate for encryption
#c.CryptKeeper.n_threads =8
c.JupyterHub.ip = '0.0.0.0'
c.JupyterHub.port = 8000
c.PAMAuthenticator.encoding = 'utf-8'
c.LocalAuthenticator.create_system_users = True
c.LocalAuthenticator.group_whitelist = {'jupyterhub'}
c.Authenticator.whitelist = {'ubuntu','jupyter_user1','jupyter_user2','test01'}
c.JupyterHub.admin_users = {'ubuntu'}
c.JupyterHub.statsd_prefix = 'jupyterhub'
增加白名单及管理员用户
登录 serverip:8000
参考:https://blog.csdn.net/ds19991999/article/details/83663349
(ljt_env) ubuntu@node1:~/ljt_test$ python
Python 3.5.4 |Anaconda, Inc.| (default, Feb 19 2018, 10:59:04)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from notebook.auth import passwd
>>> passwd()
Enter password:
Verify password:
'sha1:25f46ecf43f0:8f778092033e870fec6718189eaeba118aec807a'
vim /home/ubuntu/.jupyter/jupyter_notebook_config.py
在文件末尾添加
c.NotebookApp.allow_root = True
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.notebook_dir = u'/home/ubuntu/ljt_test/jupyterhubHome'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u' sha1:25f46ecf43f0:8f778092033e870fec6718189eaeba118aec807a '
c.NotebookApp.port = 8000
(ljt_env) ubuntu@node1:~/ljt_test$ jupyter-lab --version
0.34.9
开启 jupyterhub
1.激活python3.5 虚拟环境
cd /home/ubuntu/ljt_test
source activate ljt_env
2. 检查jupyterhub 服务
ps -ef|grep jupyterhub
lsof -i:8000
kill -9 45654
tail -fn200 /home/ubuntu/ljt_test/jupyterhub.log
3.若没有开启jupyter服务,开启服务
nohup jupyterhub --no-ssl > jupyterhub.log &
nohup jupyterhub -f /etc/jupyterhub_config.py --no-ssl > jupyterhub.log &
nohup jupyterhub -f /home/ubuntu/ljt_test/jupyterhub_config.py --ssl-key /home/ubuntu/ljt_test/mykey.key --ssl-cert /home/ubuntu/ljt_test/mycert.pem > jupyterhub.log &
4、测试访问
用IP+端口测试访问
5、用户管理
用户白名单的用户会自动添加,但无密码,需要修改密码才能登录;
新添加用户:sudo useradd jupyter_user2 -d /home/ubuntu/jupyter_user2 -m
用户添加组:sudo adduser jupyter_user2 jupyterhub
修改用户密码:echo crxis:crxis|chpasswd
问题:
PAM Authentication failed ([email protected]): [PAM Error 7] Authentication failure
验证器说明
PAMAuthenticator | 默认,内置身份验证器 |
OAuthenticator | OAuth + JupyterHub身份验证器= OAuthenticator |
LdapAuthenticator | 用于JupyterHub的简单LDAP身份验证器插件 |
kdcAuthenticator | JupyterHub的Kerberos身份验证器插件 |
https://blog.chairco.me/posts/2018/06/how%20to%20build%20a%20jupytre-hub%20for%20team.html
注意:我的服务器登陆后默认就是管理员,所以下面过程都是在管理员root身份下安装的,如果当前用户不是管理员那会出现乱七八糟的问题(猜测是因为认证的问题,而上面那个博客里详细记录了認證方式(PAM)的配置过程。可以试试)
https://jupyterhub.readthedocs.io/en/latest/quickstart.html
https://cloud.tencent.com/developer/article/1349526
https://cloud.tencent.com/developer/article/1349531
参考:
https://www.digitalocean.com/community/tutorials/how-to-install-the-anaconda-python-distribution-on-ubuntu-16-04
https://github.com/jupyterhub/jupyterhub/wiki/Installation-of-Jupyterhub-on-remote-server
https://www.cnblogs.com/crxis/p/9078278.html
https://blog.csdn.net/papageno_xue/article/details/79710708
https://49.4.6.110:8000/
云服务器搭建神器JupyterLab
https://blog.csdn.net/ds19991999/article/details/83663349
记一次在服务器上配置 Jupyterhub 作为系统服务
https://blog.huzicheng.com/2018/01/04/jupyter-as-a-service/
https://www.brothereye.cn/ubuntu/431/
conda install -c conda-forge pyspark==2.2.0
vim /etc/profile
export JAVA_HOME=/usr/java/jdk1.8.0_181 export JRE_HOME=$JAVA_HOME/jre export CLASSPATH=.:${JAVA_HOME}/lib export PATH=$PATH:$HADOOP_HOME/bin export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark export PATH=${SPARK_HOME}/bin:$PATH export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH |
source /etc/profile
How to integrate JupyterHub with the existing Cloudera cluster
https://github.com/jupyterhub/jupyterhub/issues/2116
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook' |
Vim /home/ubuntu/anaconda3/share/jupyter/kernels/python3/kernel.json
{
"argv": ["/home/ljt/anaconda3/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}"],
"display_name": "Python 3.6+Pyspark2.4.3",
"language": "python",
"env": {
"HADOOP_CONF_DIR": "/mnt/e/hadoop/3.1.1/conf",
"PYSPARK_PYTHON": "/home/ljt/anaconda3/bin/python",
"SPARK_HOME": "/mnt/f/spark/2.4.3",
"WRAPPED_SPARK_HOME": "/etc/spark",
"PYTHONPATH": "/mnt/f/spark/2.4.3/python/:/mnt/f/spark/2.4.3/python/lib/py4j*src.zip",
"PYTHONSTARTUP": "/mnt/f/spark/2.4.3/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master yarn --deploy-mode client pyspark-shell"
}
}
jupyterhub -f jupyterhub_config.py
nohup jupyterhub -f /home/ubuntu/ljt_test/jupyterhub_config.py --no-ssl > jupyterhub.log &
nohup jupyter lab --config=/home/ubuntu/.jupyter/jupyter_notebook_config.py > jupyter.log &
https://medium.com/@bogdan.cojocar/how-to-run-scala-and-spark-in-the-jupyter-notebook-328a80090b3b
Step1: install the package
pip install spylon-kernel
Step2: create a kernel spec
This will allow us to select the scala kernel in the notebook.
python -m spylon_kernel install
Step3: start the jupyter notebook
ipython notebook
And in the notebook we select New -> spylon-kernel
. This will start our scala kernel.
Step4: testing the notebook
Let’s write some scala code:
val x = 2
val y = 3
x+y
conda install -c r r-irkernel
https://anaconda.org/chdoig/jupyter-and-conda-for-r/notebook
conda install -c r r-essentials
conda create -n my-r-env -c r r-essentials
https://github.com/rstudio/sparklyr
http://spark.apache.org/docs/2.2.0/sparkr.html
https://www.cnblogs.com/lalabola/p/7423902.html
export LD_LIBRARY_PATH="/usr/java/jdk1.8.0_181/jre/lib/amd64/server"
rJAVA
https://github.com/hannarud/r-best-practices/wiki/Installing-RJava-(Ubuntu)
java 环境需要在/user/lib/jvm 下面
https://www.jianshu.com/p/9a144c508059
conda create -p /home/ubuntu/anaconda3/envs/r_env --copy -y -q r-essentials -c r
sparkR --executor-memory 2g --total-executor-cores 10 --master spark://node1.sc.com:7077
http://cleverowl.uk/2016/10/15/installing-jupyter-with-the-pyspark-and-r-kernels-for-spark-development/
https://spark.rstudio.com/
https://spark.rstudio.com/examples/cloudera-aws/
Sys.setenv(SPARK_HOME=' /opt/cloudera/parcels/CDH/lib/spark')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
library(SparkR)
sc <- sparkR.init(master='yarn-client', sparkPackages="com.databricks:spark-csv_2.11:1.5.0")
sqlContext <- sparkRSQL.init(sc)
df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
head(df)
configurable-http-proxy --ip **.*.6.110 --port 8000 --log-level=debug
openssl rand -hex 32
0b042f8d651fb8126537d1ec98507b093653d1ffe4b909f053616062184b1db3
Sparkrly:
https://spark.rstudio.com/guides/connections/#yarn
https://www.r-bloggers.com/sparklyr-a-test-drive-on-yarn/
https://blog.csdn.net/The_One_is_all/article/details/75307270
# .R script showing capabilities of sparklyr R package
# Prerequisites before running this R script:
# Ubuntu 16.04.3 LTS 64-bit, r-base (version 3.4.1 or newer),
# RStudio 64-bit version, libssl-dev, libcurl4-openssl-dev, libxml2-dev
install.packages("httr")
install.packages("xml2")
# New features in sparklyr 0.6:
# https://blog.rstudio.com/2017/07/31/sparklyr-0-6/
install.packages("sparklyr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
library(sparklyr)
library(dplyr)
library(ggplot2)
library(tidyr)
set.seed(100)
# sparklyr cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/sparklyr.pdf
# dplyr+tidyr: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
# sparklyr currently (2017-08-19) only supports Apache Spark version 2.2.0 or older
# Install Spark locally:
sc_version <- "2.2.0"
spark_install(sc_version)
config <- spark_config()
# number of CPU cores to use:
config$spark.executor.cores <- 6
# amount of RAM to use for Apache Spark executors:
config$spark.executor.memory <- "4G"
# Connect to local version:
sc <- spark_connect (master = "local",
config = config, version = sc_version)
# Copy data to Spark memory:
import_iris <- sdf_copy_to(sc, iris, "spark_iris", overwrite = TRUE)
# partition data:
partition_iris <- sdf_partition(import_iris,training=0.5, testing=0.5)
# Create a hive metadata for each partition:
sdf_register(partition_iris,c("spark_iris_training","spark_iris_test"))
# Create reference to training data in Spark table
tidy_iris <- tbl(sc,"spark_iris_training") %>% select(Species, Petal_Length, Petal_Width)
# Spark ML Decision Tree Model
model_iris <- tidy_iris %>% ml_decision_tree(response="Species", features=c("Petal_Length","Petal_Width"))
# Create reference to test data in Spark table
test_iris <- tbl(sc,"spark_iris_test")
# Bring predictions data back into R memory for plotting:
pred_iris <- sdf_predict(model_iris, test_iris) %>% collect
pred_iris %>%
inner_join(data.frame(prediction=0:2,
lab=model_iris$model.parameters$labels)) %>%
ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
geom_point()
spark_disconnect(sc)