目录
基础环境:
安装kubeflow
系统:centos7.6
kubernetes:1.14
内存需求:
kubernetes > 1.11
cpu > 4
storage > 50G
memory > 12G
安装 kfctl
wget https://github.com/kubeflow/kubeflow/releases/download/v0.6.2/kfctl_v0.6.2_linux.tar.gz
tar -zxvf kfctl_v0.6.2_linux.tar.gz
mv kfctl /k8s/kubernetes/bin
安装ksconnect
wget https://github.com/ksonnet/ksonnet/releases/download/v0.13.1/ks_0.13.1_linux_amd64.tar.gz
tar -xaf ks_0.13.1_linux_amd64.tar.gz
mv ks_0.13.1_linux_amd64/ks /k8s/kubernetes/bin/ks
kubeflow配置
export KFAPP=kfapp
export CONFIG="https://raw.githubusercontent.com/kubeflow/kubeflow/v0.6-branch/bootstrap/config/kfctl_k8s_istio.0.6.2.yaml"
kfctl init ${KFAPP} --config=${CONFIG} -V
cd ${KFAPP}
kfctl generate all -V
注:环境变量CONFIG中配置的是安装istio和istio与kubeflow组件的配置
由于默认的镜像仓库是gcr.io,服务器在国外,无法正常拉取,微软的镜像仓库提供了kubeflow的镜像,这里我将需要去gcr.io去下载的镜像全部找出来,然后下载下来并上传到我自己的镜像仓库。
在服务器192.168.2.10搭建本地镜像仓库
sudo docker run -d -p 5000:5000 --restart always --name registry registry:2
修改所有机器的docker配置:
cat /etc/docker/daemon.json
{
"insecure-registries": [
"192.168.2.10:5000"
]
}
重启docker
sudo systemctl restart docker
自动下载并上传至本地镜像仓库
#! /bin/bash
images=(
gcr.io/kubeflow-images-public/admission-webhook:v20190520-v0-139-gcee39dbc-dirty-0d8f4c
gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta
gcr.io/ml-pipeline/api-server:0.1.23
gcr.io/kubeflow-images-public/ingress-setup:latest
gcr.io/kubeflow-images-public/centraldashboard:v20190823-v0.6.0-rc.0-69-gcb7dab59
gcr.io/kubeflow-images-public/jupyter-web-app:9419d4d
gcr.io/kubeflow-images-public/katib/v1alpha2/katib-controller:v0.6.0-rc.0
gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager:v0.6.0-rc.0
gcr.io/kubeflow-images-public/katib/v1alpha2/katib-manager-rest:v0.6.0-rc.0
gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-bayesianoptimization:v0.6.0-rc.0
gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-grid:v0.6.0-rc.0
gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-hyperband:v0.6.0-rc.0
gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-nasrl:v0.6.0-rc.0
gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-random:v0.6.0-rc.0
gcr.io/kubeflow-images-public/katib/v1alpha2/katib-ui:v0.6.0-rc.0
gcr.io/kubeflow-images-public/metadata:v0.1.8
gcr.io/kubeflow-images-public/metadata-frontend:v0.1.8
gcr.io/ml-pipeline/persistenceagent:0.1.23
gcr.io/ml-pipeline/scheduledworkflow:0.1.23
gcr.io/ml-pipeline/frontend:0.1.23
gcr.io/ml-pipeline/viewer-crd-controller:0.1.23
gcr.io/kubeflow-images-public/notebook-controller:v20190603-v0-175-geeca4530-e3b0c4
gcr.io/kubeflow-images-public/profile-controller:v20190619-v0-219-gbd3daa8c-dirty-1ced0e
gcr.io/kubeflow-images-public/kfam:v20190612-v0-170-ga06cdb79-dirty-a33ee4
gcr.io/kubeflow-images-public/pytorch-operator:v1.0.0-rc.0
gcr.io/google_containers/spartakus-amd64:v1.1.0
gcr.io/kubeflow-images-public/tf_operator:v0.6.0.rc0
)
download="sudo docker pull "
tag="sudo docker tag "
push="sudo docker push "
gcr="gcr.io"
aws="gcr.azk8s.cn"
ip="192.168.2.10:5000"
del_image="sudo docker rmi "
for image in ${images[*]};
do
awstag=$(echo $image |sed "s/${gcr}/${aws}/")
localtag=$(echo $image |sed "s/${gcr}/${ip}/")
downloadaws="${download}${awstag}"
tagx="${tag}${awstag} ${localtag}"
pushx="${push}${localtag}"
delete_aws_image="${del_image}${awstag}"
${downloadaws}
${tagx}
${pushx}
${delete_aws_image}
done
镜像准备完成后,我们来修改编排文件中的镜像,查找每个组件中的deployment.yaml中的镜像,遇到gcr.io的镜像就修改为192.168.2.10:5000,同时要注意镜像版本,以下镜像和我下载的镜像版本有区别
gcr.io/kubeflow-images-public/jupyter-web-app:v0.5.0
gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-bayesianoptimization:v0.1.2-alpha-289-g14dad8b
gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-hyperband:v0.1.2-alpha-289-g14dad8b
gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-nasrl:v0.1.2-alpha-289-g14dad8b
gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-random:v0.1.2-alpha-289-g14dad8b
gcr.io/kubeflow-images-public/katib/v1alpha2/suggestion-grid:v0.1.2-alpha-289-g14dad8b
gcr.io/kubeflow-images-public/kubernetes-sigs/application
gcr.io/kubeflow-images-public/katib/v1alpha2/katib-controller:v0.1.2-alpha-289-g14dad8b
gcr.io/kubeflow-images-public/katib/v1alpha2/katib-ui:v0.1.2-alpha-289-g14dad8b
gcr.io/kubeflow-images-public/notebook-controller:v20190614-v0-160-g386f2749-e3b0c4
刚开始我下载镜像后直接将gcr.azk8s.cn 打包成 gcr.io,但是发现有好几镜像始终无法正常下载,所以我采取了本地仓库的策略,最后发现是镜像版本有差异。
修改完所有的组件的镜像后我们创建kubeflow所需的pv
vim katib-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: katib-pv2
spec:
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/data3/katib"
vim metadata-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: metadata-pv
spec:
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/data3/metadata"
vim minmo-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: minmo-pv
spec:
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/data3/minmo"
vim mysql-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: mysql-pv
spec:
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/data3/mysql"
以上pv都是直接挂载在本地文件系统中,你也可以搭建NFS来作为文件存储系统,
kubectl apply -f katib-pv.yaml
kubectl apply -f metadata-pv.yaml
kubectl apply -f minmo-pv.yaml
kubectl apply -f mysql-pv.yaml
创建namespace kubeflow-anonymous.org,因为在安装的过程中会因这个kubeflow中默认的namespace没有创建而报错,
kubectl create ns kubeflow-anonymous.org
接下来安装kubeflow
kfctl apply all -V
执行结束后可查看pod创建情况
kubectl -n kubeflow get pod
NAME READY STATUS RESTARTS AGE
admission-webhook-bootstrap-stateful-set-0 1/1 Running 0 24h
admission-webhook-deployment-75bb567b88-vm4f8 1/1 Running 0 24h
application-controller-stateful-set-0 1/1 Running 0 24h
argo-ui-5dcf5d8b4f-482p8 1/1 Running 0 2d
centraldashboard-cf4874ddc-dkdqg 1/1 Running 0 24h
jupyter-web-app-deployment-58ddb666cf-48v2h 1/1 Running 0 24h
katib-controller-847bf97f4c-xtfw8 1/1 Running 1 2d
katib-db-8598468fd8-vd2gh 1/1 Running 0 24h
katib-manager-84fb8cd98f-zwstg 1/1 Running 1 2d
katib-manager-rest-778857c989-n9wkv 1/1 Running 0 24h
katib-suggestion-bayesianoptimization-7797c8fc95-7kbzh 1/1 Running 0 2d
katib-suggestion-grid-b465d99bc-jzvnd 1/1 Running 0 2d
katib-suggestion-hyperband-8d49fdc49-mwhd5 1/1 Running 0 24h
katib-suggestion-nasrl-5b96f5db8f-pbwpb 1/1 Running 0 2d
katib-suggestion-random-6fbd7d697f-lldhl 1/1 Running 0 2d
katib-ui-594cb5f779-frl98 1/1 Running 0 2d
metacontroller-0 1/1 Running 0 2d
metadata-db-5dd459cc-r529k 1/1 Running 0 2d
metadata-deployment-7f698d8f8d-8m4sp 1/1 Running 0 24h
metadata-deployment-7f698d8f8d-m2t4h 1/1 Running 0 2d
metadata-deployment-7f698d8f8d-t5sw9 1/1 Running 0 24h
metadata-ui-7b85b56578-rd9nj 1/1 Running 0 2d
minio-758b769d67-55gbg 1/1 Running 0 2d
ml-pipeline-7bdb8985fc-jglwq 1/1 Running 0 2d
ml-pipeline-persistenceagent-867856dfbb-8ltn2 1/1 Running 0 2d
ml-pipeline-scheduledworkflow-788465ccd8-ftx5c 1/1 Running 0 24h
ml-pipeline-ui-7c8875b796-f7ks6 1/1 Running 0 24h
ml-pipeline-viewer-controller-deployment-7b664cb7d4-hpd4w 1/1 Running 0 2d
mysql-657f87857d-lfnkj 1/1 Running 0 2d
notebook-controller-deployment-5b7975d5bf-99x9d 1/1 Running 0 24h
profiles-deployment-7bbb9586d9-lrpnf 2/2 Running 0 2d
pytorch-operator-7547865bd5-fcqzl 1/1 Running 0 2d
seldon-operator-controller-manager-0 1/1 Running 1 2d
spartakus-volunteer-7df8bfcc5c-d5xfl 1/1 Running 0 24h
tensorboard-6544748d94-vwcfj 1/1 Running 0 24h
tf-job-dashboard-cfd947d4b-8289s 1/1 Running 0 2d
tf-job-operator-657cbb8d9c-wjvnk 1/1 Running 0 2d
workflow-controller-db644d554-6s5jg 1/1 Running 0 24h
查看网关istio部署情况
kubectl -n istio-system get pod
NAME READY STATUS RESTARTS AGE
grafana-67c69bb567-fhnp9 1/1 Running 0 2d
istio-citadel-67697b6697-njcfw 1/1 Running 0 24h
istio-egressgateway-7dbbb87698-vn67l 1/1 Running 0 24h
istio-galley-7474d97954-7wmzk 1/1 Running 0 24h
istio-grafana-post-install-1.1.6-5425v 0/1 Completed 0 2d
istio-ingressgateway-565b894b5f-9m4c8 1/1 Running 0 2d
istio-pilot-6dd5b8f74c-2cl24 2/2 Running 0 24h
istio-policy-7f8bb87857-kpwxv 2/2 Running 2 2d
istio-sidecar-injector-fd5875568-cbb87 1/1 Running 0 2d
istio-telemetry-8759dc6b7-qr4p8 2/2 Running 1 2d
istio-tracing-5d8f57c8ff-m8tvc 1/1 Running 0 2d
kiali-d4d886dd7-6smvx 1/1 Running 0 24h
prometheus-d8d46c5b5-6285p 1/1 Running 0 2d
利用port-forward 访问kubeflow ui
kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80
新开一个终端:
curl http://localhost:8080/
curl: (52) Empty reply from server
请求失败,同时在port-forward 下出现如下报错
E1014 12:10:30.503595 26696 portforward.go:400] an error occurred forwarding 8080 -> 80: error forwarding port 80 to pod 2620eb8aad30d22cd611496c628c04e934e3ad1b33ae8ab640d92e34c712b26f, uid : unable to do port forwarding: socat not found.
这是因为没有安装socat,下载socat
wget http://www.dest-unreach.org/socat/download/socat-1.7.3.3.tar.gz
tar -zxvf socat-1.7.3.3.tar.gz
cd socat-1.7.3.3
编译安装时发生如下错误
./configure
checking which defines needed for makedepend...
checking for a BSD-compatible install... /usr/bin/install -c
checking for gcc... no
checking for cc... no
checking for cl.exe... no
configure: error: in `/home/kube/socat-1.7.3.3':
configure: error: no acceptable C compiler found in $PATH
See `config.log' for more details
原因是没有安装gcc
sudo yum install gcc
重新编译安装
./configure
sudo make & sudo make install
重新运行port-forward
kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80
访问UI
curl http://localhost:8080/
Kubeflow Central Dashboard
安装部署已经全部完成,剩余的问题就是如何利用istio来访问UI,
删除kubeflow
kfctl delete all -V
kubectl delete ns istio-system
kubeflow0.6.2的搭建已经全部完成了,但是学习才刚刚开始,对kubeflow有兴趣的朋友可以加群526855734,也可以扫码加群