kubeflow安装在能访问外网的情况下十分简单,但是如果不能就会很困难很多,下面首先在能访问外网的情况下进行,再试着在局域网(公司网络)内进行安装。这里机器使用腾讯云服务器,选择2台香港地区的竞价实例,CentOS7.6,配置为8核16G,根目录磁盘空间300G,网络为1Mbps的按带宽计费模式,2台机器共计每小时¥1.23。安全组选择放通所有端口,登录方式为自定义密码登录
首先在两台机器上安装docker
yum -y install yum-utils && \
yum-config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo && \
yum install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.4.3-3.1.el7.x86_64.rpm && \
yum install docker-ce -y
systemctl --now enable docker
yum install -y yum-utils device-mapper-persistent-data lvm2 git
如果因为网络问题无法使用yum安装的话,可以到https://download.docker.com/linux/static/stable/x86_64/ 下载tar.gz文件,解压后将docker/目录下全部文件复制到 /usr/bin 目录下,创建启动文件 /etc/systemd/system/docker.service ,内容如下
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target
[Service]
Type=notify
ExecStart=/usr/bin/dockerd --selinux-enabled=false
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TimeoutStartSec=0
Delegate=yes
KillMode=process
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s
[Install]
WantedBy=multi-user.target
执行 systemctl daemon-reload && systemctl restart docker 完成docker的安装
PS:默认情况下,docker会将镜像文件存放在 /var 目录下,如果该目录空间不足,docker会自动删除某些镜像,可以在启动中指定 --graph=/root/images 来自定义镜像文件存放的位置
在kubeflow的使用过程中,我们需要一个私有的镜像仓库来存放执行过程中生成的镜像文件,这里我们选择部署一套harbor作为镜像仓库,也可以选择docker registry,选择一台机器操作即可
安装docker-compose:
wget https://github.com/docker/compose/releases/download/1.29.2/docker-compose-Linux-x86_64
mv docker-compose-Linux-x86_64 docker-compose
chmod +x docker-compose
mv docker-compose /usr/bin/
可执行 docker-compose version 验证是否安装成功
wget https://github.com/goharbor/harbor/releases/download/v1.9.2/harbor-offline-installer-v1.9.2.tgz
tar -zxvf harbor-offline-installer-v1.9.2.tgz
cd harbor
vim harbor.yaml
./install.sh
修改harbor.yaml中的hostname和port
访问 http://43.128.14.116:86/,即可打开harbor页面,43.128.14.116为服务器外网地址、172.19.0.14为其内网地址,默认账号admin,密码为 Harbor12345。登录之后创建一个名为kubeflow的公开项目作为我们的私有仓库。
修改docker配置文件,/etc/docker/daemon.json,加入如下内容,之后重启docker
{ "insecure-registries": ["172.19.0.14:86"] }
docker login -u admin -p Harbor12345 172.19.0.14:86
harbor搭建完成 ,docker-compose up -d 重启
使用kubeadm来安装k8s,kubeflow1.3要求k8s的版本最低为1.15,推荐1.17+,这里选择1.19版本进行安装
sudo setenforce 0 #关闭selinux
sudo swapoff -a #关闭内存交换
#开启ipvs
modprobe br_netfilter
cat > /etc/sysconfig/modules/ipvs.modules << EOF
#!/bin/bash
modprobe -- ip_vs
modprobe -- ip_vs_rr
modprobe -- ip_vs_wrr
modprobe -- ip_vs_sh
modprobe -- nf_conntrack_ipv4
EOF
chmod 755 /etc/sysconfig/modules/ipvs.modules && bash /etc/sysconfig/modules/ipvs.modules && lsmod | grep -e ip_vs
#更新yum源,创建 /etc/yum.repos.d/kubernetes.repo(所有节点)
cat > /etc/yum.repos.d/kubernetes.repo << EOF
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
EOF
yum makecache fast -y
#调整iptables
echo 1 >/proc/sys/net/bridge/bridge-nf-call-iptables
echo 1 > /proc/sys/net/ipv4/ip_forward
#修改环境变量/etc/profile,加入
export KUBECONFIG=/etc/kubernetes/admin.conf
export GODEBUG=x509ignoreCN=0
source /etc/profile
yum install -y kubelet-1.19.1 kubeadm-1.19.1 kubectl-1.19.1
systemctl enable kubelet && systemctl restart kubelet
选择一台机器作为k8s集群master,进行初始化设置
kubeadm config print init-defaults > kubeadm-config.yaml
编辑 kubeadm-config.yaml ,修改advertiseAddress为本机内网ip,在 networking 模块下加入
podSubnet: 10.244.0.0/16
kubeadm init --config=kubeadm-config.yaml --upload-certs |tee kubeadm-init.log
在另一台机器上执行 kubeadm join 加入集群,之后进行k8s集群的配置工作
#配置flannel
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
kubectl apply -f kube-flannel.yml
#重启pod
kubectl get pod -n kube-system | grep kube-proxy |awk '{system("kubectl delete pod "$1" -n kube-system")}'
#使master也能当做node使用
kubectl taint node vm-0-14-centos node-role.kubernetes.io/master-
修改k8s配置文件,进入 /etc/kubernetes/manifests/
修改 kube-apiserver.yaml,加入如下内容
- --service-account-signing-key-file=/etc/kubernetes/pki/sa.key
- --service-account-issuer=kubernetes.default.svc
修改 kube-controller-manager.yaml 、kube-scheduler.yaml,将 - --port=0 注释掉
重启kubelet:systemctl restart kubelet
在安装kubeflow之前,首先设置nfs并创建pv,安装kubeflow时mysq、katib等组件时需要使用pvc,如果不提前创建,会导致这些节点状态为pending
yum install -y nfs-utils rpcbind 选择一台机器作为nfs服务器
#创建共享文件夹
mkdir /root/nfs-kubeflow
cd /root/nfs-kubeflow
mkdir v1
mkdir v2
mkdir v3
mkdir v4
mkdir v5
#配置nfs
vim /etc/exports
#加入如下内容:/root/nfs-kubeflow *(insecure,rw,no_root_squash,no_all_squash,sync)
#使配置生效
exportfs -r
#启动服务rpcbind、nfs服务
service rpcbind start
service nfs start
配置pv文件 pv.yaml,加入如下内容,注意替换path与server
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv001
labels:
name: pv001
spec:
nfs:
path: /root/nfs-kubeflow/v1
server: 172.19.0.14
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 15Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv002
labels:
name: pv002
spec:
nfs:
path: /root/nfs-kubeflow/v2
server: 172.19.0.14
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 25Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv003
labels:
name: pv003
spec:
nfs:
path: /root/nfs-kubeflow/v3
server: 172.19.0.14
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 25Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv004
labels:
name: pv004
spec:
nfs:
path: /root/nfs-kubeflow/v4
server: 172.19.0.14
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 25Gi
执行 kubectl apply -f pv.yaml 创建 pv
wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.3.0.zip
unzip v1.3.0.zip
wget https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
mv kustomize_3.2.0_linux_amd64 kustomize
chmod +x kustomize
mv kustomize /usr/bin/
cd manifests-1.3.0/
#安装istio
kustomize build common/istio-1-9-0/istio-crds/base | kubectl apply -f -
kustomize build common/istio-1-9-0/istio-namespace/base | kubectl apply -f -
kustomize build common/istio-1-9-0/istio-install/base | kubectl apply -f -
#安装cert-manager
kustomize build common/cert-manager/cert-manager-kube-system-resources/base | kubectl apply -f -
kustomize build common/cert-manager/cert-manager-crds/base | kubectl apply -f -
kustomize build common/cert-manager/cert-manager/overlays/self-signed | kubectl apply -f -
#安装dex
kustomize build common/dex/overlays/istio | kubectl apply -f -
#安装oidc
kustomize build common/oidc-authservice/base | kubectl apply -f -
#安装knative-serving
kustomize build common/knative/knative-serving-crds/base | kubectl apply -f -
kustomize build common/knative/knative-serving-install/base | kubectl apply -f -
kustomize build common/istio-1-9-0/cluster-local-gateway/base | kubectl apply -f -
#安装knative-eventing
kustomize build common/knative/knative-eventing-crds/base | kubectl apply -f -
kustomize build common/knative/knative-eventing-install/base | kubectl apply -f -
#创建kubeflow namespace
kustomize build common/kubeflow-namespace/base | kubectl apply -f -
#创建kubeflow-roles
kustomize build common/kubeflow-roles/base | kubectl apply -f -
#创建istio-resource
kustomize build common/istio-1-9-0/kubeflow-istio-resources/base | kubectl apply -f -
#安装pipeline
kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user | kubectl apply -f -
#安装kfserving
kustomize build apps/kfserving/upstream/overlays/kubeflow | kubectl apply -f -
#安装katib
kustomize build apps/katib/upstream/installs/katib-with-kubeflow | kubectl apply -f -
#安装kubeflow dashboard
kustomize build apps/centraldashboard/upstream/overlays/istio | kubectl apply -f -
#安装admission-webhook
kustomize build apps/admission-webhook/upstream/overlays/cert-manager | kubectl apply -f -
#安装notebook
kustomize build apps/jupyter/notebook-controller/upstream/overlays/kubeflow | kubectl apply -f -
#安装jupyter
kustomize build apps/jupyter/jupyter-web-app/upstream/overlays/istio | kubectl apply -f -
#安装kfam
kustomize build apps/profiles/upstream/overlays/kubeflow | kubectl apply -f -
#安装volume
kustomize build apps/volumes-web-app/upstream/overlays/istio | kubectl apply -f -
#安装tensorboard
kustomize build apps/tensorboard/tensorboards-web-app/upstream/overlays/istio | kubectl apply -f -
kustomize build apps/tensorboard/tensorboard-controller/upstream/overlays/kubeflow | kubectl apply -f -
#安装各种operator
kustomize build apps/tf-training/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/pytorch-job/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/mpi-job/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/mxnet-job/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/xgboost-job/upstream/overlays/kubeflow | kubectl apply -f -
#创建namespace
kustomize build common/user-namespace/base | kubectl apply -f -
等待所有pod均为running状态
暴露指定端口30000,编辑kubeflow-ui-nodeport.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: istio-ingressgateway
install.operator.istio.io/owning-resource: unknown
istio: ingressgateway
istio.io/rev: default
operator.istio.io/component: IngressGateways
release: istio
name: istio-ingressgateway
namespace: istio-system
spec:
ports:
- name: status-port
port: 15021
protocol: TCP
targetPort: 15021
- name: http2
port: 80
protocol: TCP
targetPort: 8080
nodePort: 30000
- name: https
port: 443
protocol: TCP
targetPort: 8443
- name: tcp
port: 31400
protocol: TCP
targetPort: 31400
- name: tls
port: 15443
protocol: TCP
targetPort: 15443
selector:
app: istio-ingressgateway
istio: ingressgateway
type: NodePort
执行 kubectl apply -f kubeflow-nodeport.yaml
到这里kubeflow1.3的部署就完成了,可通过 http://ip:30000 来访问,默认账号为 [email protected],密码为12341234
默认情况下,创建notebook server会报如下错误,原因是没有配置默认的StorageClass
编辑storage-nfs.yaml,注意替换NFS_SERVER、NFS_PATH内容
apiVersion: v1
kind: ServiceAccount
metadata:
name: nfs-client-provisioner
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: nfs-provisioner
spec:
replicas: 1
strategy:
type: Recreate
template:
metadata:
labels:
app: nfs-provisioner
spec:
serviceAccount: nfs-client-provisioner
containers:
- name: nfs-provisioner
image: registry.cn-hangzhou.aliyuncs.com/open-ali/nfs-client-provisioner
volumeMounts:
- name: nfs-client-root
mountPath: /persistentvolumes
env:
- name: PROVISIONER_NAME
value: kubeflow/nfs
- name: NFS_SERVER
value: 172.19.0.14
- name: NFS_PATH
value: /root/nfs-kubeflow/v5
volumes:
- name: nfs-client-root
nfs:
server: 172.19.0.14
path: /root/nfs-kubeflow/v5
编辑storage-rbac.yaml
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: run-nfs-client-provisioner
subjects:
- kind: ServiceAccount
name: nfs-client-provisioner
namespace: kubeflow-user-example-com
roleRef:
kind: ClusterRole
name: nfs-client-provisioner-runner
apiGroup: rbac.authorization.k8s.io
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: nfs-client-provisioner-runner
rules:
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["list", "watch", "create", "update", "patch"]
编辑storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: kubeflow-nfs-storage
provisioner: kubeflow/nfs
依次执行如下命令
kubectl apply -f storage-nfs.yaml
kubectl apply -f storage-rbac.yaml
kubectl apply -f storage-class.yaml
kubectl patch storageclass kubeflow-nfs-storage -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
刷新页面之后,不再报错
如果要通过 NodePort / LoadBalancer / Ingress 暴露服务到非 localhost 网络,那么必须使用 https。否则能打开主页面但是 notebook、ssh等均无法连接。创建生成 ssl 自签名证书脚本 create_self-signed-cert.sh
#!/bin/bash -e
help ()
{
echo ' ================================================================ '
echo ' --ssl-domain: 生成ssl证书需要的主域名,如不指定则默认为www.rancher.local,如果是ip访问服务,则可忽略;'
echo ' --ssl-trusted-ip: 一般ssl证书只信任域名的访问请求,有时候需要使用ip去访问server,那么需要给ssl证书添加扩展IP,多个IP用逗号隔开;'
echo ' --ssl-trusted-domain: 如果想多个域名访问,则添加扩展域名(SSL_TRUSTED_DOMAIN),多个扩展域名用逗号隔开;'
echo ' --ssl-size: ssl加密位数,默认2048;'
echo ' --ssl-cn: 国家代码(2个字母的代号),默认CN;'
echo ' 使用示例:'
echo ' ./create_self-signed-cert.sh --ssl-domain=www.test.com --ssl-trusted-domain=www.test2.com \ '
echo ' --ssl-trusted-ip=1.1.1.1,2.2.2.2,3.3.3.3 --ssl-size=2048 --ssl-date=3650'
echo ' ================================================================'
}
case "$1" in
-h|--help) help; exit;;
esac
if [[ $1 == '' ]];then
help;
exit;
fi
CMDOPTS="$*"
for OPTS in $CMDOPTS;
do
key=$(echo ${OPTS} | awk -F"=" '{print $1}' )
value=$(echo ${OPTS} | awk -F"=" '{print $2}' )
case "$key" in
--ssl-domain) SSL_DOMAIN=$value ;;
--ssl-trusted-ip) SSL_TRUSTED_IP=$value ;;
--ssl-trusted-domain) SSL_TRUSTED_DOMAIN=$value ;;
--ssl-size) SSL_SIZE=$value ;;
--ssl-date) SSL_DATE=$value ;;
--ca-date) CA_DATE=$value ;;
--ssl-cn) CN=$value ;;
esac
done
# CA相关配置
CA_DATE=${CA_DATE:-3650}
CA_KEY=${CA_KEY:-cakey.pem}
CA_CERT=${CA_CERT:-cacerts.pem}
CA_DOMAIN=cattle-ca
# ssl相关配置
SSL_CONFIG=${SSL_CONFIG:-$PWD/openssl.cnf}
SSL_DOMAIN=${SSL_DOMAIN:-'www.rancher.local'}
SSL_DATE=${SSL_DATE:-3650}
SSL_SIZE=${SSL_SIZE:-2048}
## 国家代码(2个字母的代号),默认CN;
CN=${CN:-CN}
SSL_KEY=$SSL_DOMAIN.key
SSL_CSR=$SSL_DOMAIN.csr
SSL_CERT=$SSL_DOMAIN.crt
echo -e "\033[32m ---------------------------- \033[0m"
echo -e "\033[32m | 生成 SSL Cert | \033[0m"
echo -e "\033[32m ---------------------------- \033[0m"
if [[ -e ./${CA_KEY} ]]; then
echo -e "\033[32m ====> 1. 发现已存在CA私钥,备份"${CA_KEY}"为"${CA_KEY}"-bak,然后重新创建 \033[0m"
mv ${CA_KEY} "${CA_KEY}"-bak
openssl genrsa -out ${CA_KEY} ${SSL_SIZE}
else
echo -e "\033[32m ====> 1. 生成新的CA私钥 ${CA_KEY} \033[0m"
openssl genrsa -out ${CA_KEY} ${SSL_SIZE}
fi
if [[ -e ./${CA_CERT} ]]; then
echo -e "\033[32m ====> 2. 发现已存在CA证书,先备份"${CA_CERT}"为"${CA_CERT}"-bak,然后重新创建 \033[0m"
mv ${CA_CERT} "${CA_CERT}"-bak
openssl req -x509 -sha256 -new -nodes -key ${CA_KEY} -days ${CA_DATE} -out ${CA_CERT} -subj "/C=${CN}/CN=${CA_DOMAIN}"
else
echo -e "\033[32m ====> 2. 生成新的CA证书 ${CA_CERT} \033[0m"
openssl req -x509 -sha256 -new -nodes -key ${CA_KEY} -days ${CA_DATE} -out ${CA_CERT} -subj "/C=${CN}/CN=${CA_DOMAIN}"
fi
echo -e "\033[32m ====> 3. 生成Openssl配置文件 ${SSL_CONFIG} \033[0m"
cat > ${SSL_CONFIG} <> ${SSL_CONFIG} <> ${SSL_CONFIG}
done
if [[ -n ${SSL_TRUSTED_IP} ]]; then
ip=(${SSL_TRUSTED_IP})
for i in "${!ip[@]}"; do
echo IP.$((i+1)) = ${ip[$i]} >> ${SSL_CONFIG}
done
fi
fi
echo -e "\033[32m ====> 4. 生成服务SSL KEY ${SSL_KEY} \033[0m"
openssl genrsa -out ${SSL_KEY} ${SSL_SIZE}
echo -e "\033[32m ====> 5. 生成服务SSL CSR ${SSL_CSR} \033[0m"
openssl req -sha256 -new -key ${SSL_KEY} -out ${SSL_CSR} -subj "/C=${CN}/CN=${SSL_DOMAIN}" -config ${SSL_CONFIG}
echo -e "\033[32m ====> 6. 生成服务SSL CERT ${SSL_CERT} \033[0m"
openssl x509 -sha256 -req -in ${SSL_CSR} -CA ${CA_CERT} \
-CAkey ${CA_KEY} -CAcreateserial -out ${SSL_CERT} \
-days ${SSL_DATE} -extensions v3_req \
-extfile ${SSL_CONFIG}
echo -e "\033[32m ====> 7. 证书制作完成 \033[0m"
echo
echo -e "\033[32m ====> 8. 以YAML格式输出结果 \033[0m"
echo "----------------------------------------------------------"
echo "ca_key: |"
cat $CA_KEY | sed 's/^/ /'
echo
echo "ca_cert: |"
cat $CA_CERT | sed 's/^/ /'
echo
echo "ssl_key: |"
cat $SSL_KEY | sed 's/^/ /'
echo
echo "ssl_csr: |"
cat $SSL_CSR | sed 's/^/ /'
echo
echo "ssl_cert: |"
cat $SSL_CERT | sed 's/^/ /'
echo
echo -e "\033[32m ====> 9. 附加CA证书到Cert文件 \033[0m"
cat ${CA_CERT} >> ${SSL_CERT}
echo "ssl_cert: |"
cat $SSL_CERT | sed 's/^/ /'
echo
echo -e "\033[32m ====> 10. 重命名服务证书 \033[0m"
echo "cp ${SSL_DOMAIN}.key tls.key"
cp ${SSL_DOMAIN}.key tls.key
echo "cp ${SSL_DOMAIN}.crt tls.crt"
cp ${SSL_DOMAIN}.crt tls.crt
chmod +x create_self-signed-cert.sh
./create_self-signed-cert.sh --ssl-domain=kubeflow.cn
kubectl create --namespace istio-system secret tls kf-tls-cert --key /root/ssl/kubeflow.cn.key --cert /root/ssl/kubeflow.cn.crt
kubectl edit cm config-domain --namespace knative-serving
#在 data 下面添加:kubeflow.cn: ""
编辑kubeflow-https.yaml
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: kubeflow-gateway
namespace: kubeflow
spec:
selector:
istio: ingressgateway
servers:
- hosts:
- '*'
port:
name: http
number: 80
protocol: HTTP
- hosts:
- '*'
port:
name: https
number: 443
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: kf-tls-cert
执行 kubectl apply -f kubeflow-https.yaml
执行 kubectl -n istio-system get service istio-ingressgateway 获取https端口为32143
可通过 https://ip:32143 来进行 https 访问
kubeflow安装结束后,默认情况下只能通过 [email protected] 一个账号登录操作,下面进行多账户号设置,设置完成后,每个账户拥有独立的 namespace与profile
首先创建用户的 profile,下面的yaml文件创建了2个profile,namespace分别为test1与test2
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
name: test1
spec:
owner:
kind: User
name: [email protected]
---
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
name: test2
spec:
owner:
kind: User
name: [email protected]
执行 kubectl get configmap dex -n auth -o jsonpath='{.data.config\.yaml}' > dex-yaml.yaml 将dex的用户信息输出到dex-yaml.yaml文件,其中[email protected] 部署是按照时创建的,编辑该文件,添加test1与test2部分
执行 kubectl create configmap dex --from-file=config.yaml=dex-yaml.yaml -n auth --dry-run=client -o yaml | kubectl apply -f - 使新用户生效
执行 kubectl rollout restart deployment dex -n auth 重启dex,这样就完成了多用户的设置