kubeflow搭建(外网条件)

系统清单

k8s:1.23+ #kubeflow最低要求1.22
centos:8  #如kernal版本低于3.10可能某些容器无法运行

1.安装nvidia驱动

wget https://cn.download.nvidia.cn/tesla/535.54.03/NVIDIA-Linux-x86_64-535.54.03.run
chmod +x NVIDIA-Linux-x86_64-535.54.03.run
sh NVIDIA-Linux-x86_64-535.54.03.run

2.安装k8s

yum install conntrack socat git -y
curl -sfL https://get-kk.kubesphere.io | VERSION=v3.0.7 sh -
chmod +x ./kk
./kk create cluster --with-kubesphere v3.3.2

如all in one安装需要将master可调度命令如下:
kubectl taint nodes vm-0-14-centos node-role.kubernetes.io/master-

3.安装nvidia-docker

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)    && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum-config-manager --enable libnvidia-container-experimental
yum clean expire-cache
yum install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker

修改如下内容

/etc/docker/daemon.json
{
    "default-runtime": "nvidia",#添加这一行
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

重启docker并测试

systemctl daemon-reload 
systemctl restart docker
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

4.安装k8s-device-plugin

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
kubectl label nodes vm-0-14-centos accelerator=t4

5.安装kubeflow

git clone https://github.com/kubeflow/manifests.git
curl -s "https://raw.githubusercontent.com/\
kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash
mv kustomize /usr/bin
cd manifests
while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

你可能感兴趣的:(运维)