参考文档:
Install Kubeflow v1.3
注: 要在本地安装,您只需安装 MicroK8s 并启用 Kubeflow 插件。
本指南列出了在任何符合标准的 Kubernetes(包括 AKS、EKS、GKE、Openshift 和任何 kubeadm 部署的集群)上安装 Kubeflow 所需的步骤,前提是您可以通过 kubectl
访问它。
在 Linux 上,使用以下命令通过 snap 安装 juju:
snap install juju --classic
或者,在 macOS 上 brew install juju
或下载 Windows 安装程序。
为了使用 Juju 操作 Kubernetes 集群中的工作负载,您必须通过 add-k8s 命令将集群添加到 juju 的云列表中。
如果您的 Kubernetes 配置文件位于标准位置(Linux 上的 ~/.kube/config),并且您只有一个集群,则只需运行:
juju add-k8s myk8s
注,要按照前文的安装portainer的方法,获取配置文件和安装openebs。
如果您的 kubectl 配置文件包含多个集群,您可以按名称指定合适的集群:
juju add-k8s myk8s --cluster-name=foo
最后,要使用不同的配置文件,您可以将 KUBECONFIG 环境变量设置为指向相关文件。例如:
KUBECONFIG=path/to/file juju add-k8s myk8s
有关更多详细信息,请参阅 Juju 文档。
为了在 Kubernetes 集群上运行工作负载,Juju 使用控制器。您可以使用 bootstrap 命令创建控制器:
juju bootstrap myk8s my-controller
此命令将在 my-controller 命名空间下创建几个 pod。您可以使用juju controllers
命令查看您的控制器。
您可以在 Juju 文档中阅读有关控制器的更多信息。
Juju 中的模型是一个空白画布,您的操作员将在其中部署,它与 Kubernetes 命名空间保持 1:1 的关系。
您可以创建一个模型并为其命名,例如kubeflow,使用 add-model 命令,您还将创建一个同名的 Kubernetes 命名空间:
juju add-model kubeflow
您可以使用 juju models
命令列出您的模型。
要求:
部署 kubeflow 所需的最低资源是:50Gb 磁盘空间、14Gb RAM 和 2 个可用于 Linux 机器或 VM 的 CPU。
如果您的资源较少,请部署 kubeflow-lite 或 kubeflow-edge.
拥有模型后,您可以简单地 juju 将任何提供的 Kubeflow 包部署到您的集群中,并在前面加上 cs。
例如,对于 Kubeflow lite 包,运行:
juju deploy cs:kubeflow-lite
恭喜,Kubeflow 正在安装!
您可以使用以下命令观察您的 Kubeflow 部署:
watch -c juju status --color
启用 Kubeflow 仪表板访问权限的最后一步是通过以下命令将仪表板公共 URL 提供给 dex-auth 和 oidc-gatekeeper:
juju config dex-auth public-url=http://
juju config oidc-gatekeeper public-url=http://
其中 是 Kubeflow 仪表板响应的主机名。例如,在典型的 MicroK8s 安装中,此 URL 是 http://10.64.140.43.nip.io。请注意,当您设置 DNS 时,您应该使用 istio-ingressgateway 使用的可解析地址。
目前,为了在启用 RBAC 时正确设置 Kubeflow 和 Istio,您需要提供 istio-ingressgateway 操作员对 Kubernetes 资源的访问权限。以下命令将创建适当的角色:
kubectl patch role -n kubeflow istio-ingressgateway-operator -p '{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"name":"istio-ingressgateway-operator"},"rules":[{"apiGroups":["*"],"resources":["*"],"verbs":["*"]}]}'
有问题?
如果您在遵循这些说明时遇到任何困难,请在[此处](https://github.com/juju-solutions/bundle-kubeflow/issues)
创建问题。
以下是实际过程:
由于 Kubernetes 配置文件位于标准位置(Linux 上的 ~/.kube/config),并且只有一个集群,所以使用以下命令添加模型:
juju add-k8s myk8s
juju bootstrap myk8s my-controller --debug
不幸的是,出现了以下错误:
ERROR juju.cmd.juju.commands bootstrap.go:883 failed to bootstrap model: creating controller stack: creating statefulset for controller: timed out waiting for controller pod: pending: -
13:53:22 DEBUG juju.cmd.juju.commands bootstrap.go:884 (error details: [{/build/snapcraft-juju-35d6cf/parts/juju/src/cmd/juju/commands/bootstrap.go:983: failed to bootstrap model} {/build/snapcraft-juju-35d6cf/parts/juju/src/environs/bootstrap/bootstrap.go:667: } {/build/snapcraft-juju-35d6cf/parts/juju/src/environs/bootstrap/bootstrap.go:298: } {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/k8s.go:493: creating controller stack} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:502: creating statefulset for controller} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:917: } {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:1051: timed out waiting for controller pod} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:1008: pending: - }])
13:53:22 DEBUG juju.cmd.juju.commands bootstrap.go:1634 cleaning up after failed bootstrap
怀疑是由于使用了标准的Charmed Kubernetes #679部署kubernetes,其中的kubernetes模块配置是cores=4 mem=4G root-disk=16G,怀疑硬盘配置过少,出错。故重新部署了三个100G硬盘的worker节点
处理办法:
1 先在maas上部署的三个虚机”cores=4 mem=4G root-disk=100G“ #建议更改为内存8G,因为前文的portainer对集群的内存比较高。
2 根据前文ubuntu20.04下使用juju+maas环境部署k8s-9-缩放节点:
2.1停止硬盘16G的节点kubernetes-worker/0
juju run-action kubernetes-worker/0 pause --wait
2.2删除此节点。
juju remove-unit kubernetes-worker/0
2.3 增加100G单元
juju add-unit kubernetes-worker
2.4 重复上述步骤两次。
3 再次重新部署控制器,
juju bootstrap myk8s my-controller --debug
成功。
juju add-model kubeflow
juju deploy kubeflow --debug
安装完毕后,运行一段时间,查看状态,会发现出现很多错误,不用担心,是由于国际线路太忙,映像下载不下来造成的,国际线路闲时一般是凌晨2-7点,建议使用at命令定时安装,可以比较顺畅的安装,不用多次执行juju deploy kubeflow --debug
。
at 3:00 #定时在3点
>juju deploy kubeflow --debug
ctrl+d #保存
at -c 查看
下午查看大概类似这个状态:
juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow my-controller myk8s 2.9.14 unsupported 14:00:40+08:00
App Version Status Scale Charm Store Channel Rev OS Address Message
admission-webhook res:oci-image@1abb127 active 1 admission-webhook charmstore stable 10 kubernetes 10.152.183.253
argo-controller res:oci-image@c1746ae waiting 1 argo-controller charmstore stable 51 kubernetes
dex-auth res:oci-image@af9c1b3 active 1 dex-auth charmstore stable 60 kubernetes 10.152.183.133
istio-ingressgateway waiting 1 istio-ingressgateway charmstore stable 20 kubernetes Waiting for istio-pilot relation data
istio-pilot res:oci-image@e3e03b3 waiting 1 istio-pilot charmstore stable 20 kubernetes 10.152.183.223
jupyter-controller res:oci-image@8c7be42 active 1 jupyter-controller charmstore stable 56 kubernetes
jupyter-ui res:oci-image@af3b8ce active 1 jupyter-ui charmstore stable 10 kubernetes 10.152.183.134
kfp-api res:oci-image@8e60840 waiting 1 kfp-api charmstore stable 12 kubernetes 10.152.183.121
kfp-db mariadb/server:10.3 active 1 mariadb-k8s charmstore stable 35 kubernetes 10.152.183.137
kfp-persistence res:oci-image@9338d08 waiting 1 kfp-persistence charmstore stable 9 kubernetes
kfp-schedwf res:oci-image@4ab6488 waiting 1 kfp-schedwf charmstore stable 9 kubernetes
kfp-ui res:oci-image@04a4348 waiting 1 kfp-ui charmstore stable 12 kubernetes 10.152.183.153
kfp-viewer res:oci-image@bae62bf active 1 kfp-viewer charmstore stable 9 kubernetes
kfp-viz res:oci-image@c90a581 waiting 1 kfp-viz charmstore stable 8 kubernetes 10.152.183.233
kubeflow-dashboard res:oci-image@126c9a9 waiting 1 kubeflow-dashboard charmstore stable 56 kubernetes 10.152.183.32
kubeflow-profiles res:profile-image@582b8eb active 1 kubeflow-profiles charmstore stable 52 kubernetes 10.152.183.182
kubeflow-volumes res:oci-image@a325e90 active 1 kubeflow-volumes charmstore stable 0 kubernetes 10.152.183.164
minio res:oci-image@4707912 waiting 1 minio charmstore stable 55 kubernetes 10.152.183.215
mlmd res:oci-image@78eb66d active 1 mlmd charmstore stable 5 kubernetes 10.152.183.46
oidc-gatekeeper res:oci-image@9bb01f7 active 1 oidc-gatekeeper charmstore stable 54 kubernetes 10.152.183.183
pytorch-operator res:oci-image@08c3373 waiting 1 pytorch-operator charmstore stable 53 kubernetes
seldon-controller-manager res:oci-image@82fd029 active 1 seldon-core charmstore stable 50 kubernetes 10.152.183.113
tfjob-operator res:oci-image@3fabaf3 active 1 tfjob-operator charmstore stable 1 kubernetes
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.1.44.32 443/TCP
argo-controller/0* error idle 10.1.20.90 OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/argo-charmers/argo-controller/oci-image@sha256:c1746aec607fac57e7e5006329b58c7a566f042c5bf0cf3cbae192adc5b06bb5": failed commit on ref "layer-sha256:2e2462c07d2af70a0af7ef14ba643c28c1d854336996c534e193e69dcd32df64": "layer-sha256:2e2462c07d2af70a0af7ef14ba643c28c1d854336996c534e193e69dcd32df64" failed size validation: 3920502 != 24609777: failed precondition
dex-auth/0* active idle 10.1.20.39 5556/TCP
istio-ingressgateway/0* waiting idle Waiting for istio-pilot relation data
istio-pilot/0* error idle 10.1.20.62 8080/TCP,15010/TCP,15012/TCP,15017/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/istio-charmers/istio-pilot/oci-image@sha256:e3e03b31cebfc4c73d4788b83af3339685673970a5c3bf3167db399d39696ed8": failed commit on ref "layer-sha256:64d67ae6b2e3b0799483b95c62b8594afffe04a615e5420a552c3ab25766c17e": "layer-sha256:64d67ae6b2e3b0799483b95c62b8594afffe04a615e5420a552c3ab25766c17e" failed size validation: 3806023 != 29905420: failed precondition
jupyter-controller/0* active idle 10.1.20.48
jupyter-ui/0* active idle 10.1.20.50 5000/TCP
kfp-api/0* error idle 10.1.20.93 8888/TCP,8887/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-api/oci-image@sha256:8e608409f50a332e787923dda2ea4eb5c9f0839a4c9ff3f77d535efa03eac9e9": failed commit on ref "layer-sha256:16cf3fa6cb1190b4dfd82a5319faa13e2eb6e69b7b4828d4d98ba1c0b216e446": "layer-sha256:16cf3fa6cb1190b4dfd82a5319faa13e2eb6e69b7b4828d4d98ba1c0b216e446" failed size validation: 5028131 != 45380216: failed precondition
kfp-db/0* active idle 10.1.20.63 3306/TCP ready
kfp-persistence/0* error idle 10.1.20.91 crash loop backoff: back-off 5m0s restarting failed container=ml-pipeline-persistenceagent pod=kfp-persistence-864dc895d5-xshwz_kubeflow(384f9b66-dcc5-45c4-9f23-83592dfbc228)
kfp-schedwf/0* error idle 10.1.20.57 OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-schedwf/oci-image@sha256:4ab648890dad76ea51fdfb432d95992136127340b832e3d345207b839c6db23e": failed commit on ref "layer-sha256:1c3b653ff1c285f8579579c2729c7b84b3e8a14153ed7bc076316f90dda1e41c": "layer-sha256:1c3b653ff1c285f8579579c2729c7b84b3e8a14153ed7bc076316f90dda1e41c" failed size validation: 3804543 != 21611777: failed precondition
kfp-ui/0* error idle 10.1.20.92 3000/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-ui/oci-image@sha256:04a4348d6b2ec8142cc0a1dd45f738b719fef7cca5c2585ec5b935d43eab1aa8": failed commit on ref "layer-sha256:f28e01f8f11f1d6aa71000847f46725c1ad868057963d5c72b6fffedbbdec85f": "layer-sha256:f28e01f8f11f1d6aa71000847f46725c1ad868057963d5c72b6fffedbbdec85f" failed size validation: 4635725 != 28057227: failed precondition
kfp-viewer/0* active idle 10.1.20.65
kfp-viz/0* error idle 10.1.20.68 8888/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-viz/oci-image@sha256:c90a5818043da47448c4230953b265a66877bd143e4bdd991f762cf47e2a16d6": failed commit on ref "layer-sha256:08d3fb8816994acdeef83d6a1181b92e447d6d3bbcb737c93b16cdd0f28a6fbf": "layer-sha256:08d3fb8816994acdeef83d6a1181b92e447d6d3bbcb737c93b16cdd0f28a6fbf" failed size validation: 3811546 != 3978030: failed precondition
kubeflow-dashboard/0* error idle 10.1.20.95 8082/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kubeflow-dashboard/oci-image@sha256:126c9a9f0b56c9eaa614cc24f1989f9aa2d47e9cfdce70373f5ce0937a7820e2": failed commit on ref "layer-sha256:ce95b9be2a82bcdc673694e30eaecff34d6144bf4c0ca3116d949ccd6b33e231": "layer-sha256:ce95b9be2a82bcdc673694e30eaecff34d6144bf4c0ca3116d949ccd6b33e231" failed size validation: 4150633 != 29259154: failed precondition
kubeflow-profiles/0* active idle 10.1.20.71 8080/TCP,8081/TCP
kubeflow-volumes/0* active idle 10.1.20.72 5000/TCP
minio/0* error idle 10.1.20.77 9000/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/minio-charmers/minio/oci-image@sha256:4707912566436c2c1faeedb8c085a8d40b99cdf4bb0e2414295a8936e573866e": failed commit on ref "layer-sha256:a9386ba5687108909fb6a6d0155ba5bb2eea96a6d2672a372ee9e743d685d561": "layer-sha256:a9386ba5687108909fb6a6d0155ba5bb2eea96a6d2672a372ee9e743d685d561" failed size validation: 3872649 != 28593534: failed precondition
mlmd/0* active idle 10.1.20.84 8080/TCP
oidc-gatekeeper/0* active idle 10.1.20.94 8080/TCP
pytorch-operator/0* error idle 10.1.20.87 8443/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/pytorch-charmers/pytorch-operator/oci-image@sha256:08c3373247c853e804d74041366a3b161334d25b953e233776884ffab9012fc4": failed commit on ref "layer-sha256:d1c6fde2f5dd9deb582e4ed7df95242dae916742dc6b51772ecfb51fa4b7aaa6": "layer-sha256:d1c6fde2f5dd9deb582e4ed7df95242dae916742dc6b51772ecfb51fa4b7aaa6" failed size validation: 3775605 != 17524427: failed precondition
seldon-controller-manager/0* active idle 10.1.20.88 8080/TCP,4443/TCP
tfjob-operator/0* active idle 10.1.20.89 8443/TCP
早上8点之后再次查询类似这个结果:
juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow my-controller myk8s 2.9.14 unsupported 09:48:06+08:00
App Version Status Scale Charm Store Channel Rev OS Address Message
admission-webhook res:oci-image@1abb127 active 1 admission-webhook charmstore stable 10 kubernetes 10.152.183.157
argo-controller res:oci-image@c1746ae active 1 argo-controller charmstore stable 51 kubernetes
dex-auth res:oci-image@af9c1b3 active 1 dex-auth charmstore stable 60 kubernetes 10.152.183.6
istio-ingressgateway waiting 1 istio-ingressgateway charmstore stable 20 kubernetes Waiting for Istio Pilot information
istio-pilot res:oci-image@e3e03b3 active 1 istio-pilot charmstore stable 20 kubernetes 10.152.183.223
jupyter-controller res:oci-image@8c7be42 active 1 jupyter-controller charmstore stable 56 kubernetes
jupyter-ui res:oci-image@af3b8ce active 1 jupyter-ui charmstore stable 10 kubernetes 10.152.183.214
kfp-api res:oci-image@8e60840 active 1 kfp-api charmstore stable 12 kubernetes 10.152.183.174
kfp-db mariadb/server:10.3 active 1 mariadb-k8s charmstore stable 35 kubernetes 10.152.183.129
kfp-persistence res:oci-image@9338d08 active 1 kfp-persistence charmstore stable 9 kubernetes
kfp-schedwf res:oci-image@4ab6488 active 1 kfp-schedwf charmstore stable 9 kubernetes
kfp-ui res:oci-image@04a4348 active 1 kfp-ui charmstore stable 12 kubernetes 10.152.183.30
kfp-viewer res:oci-image@bae62bf active 1 kfp-viewer charmstore stable 9 kubernetes
kfp-viz res:oci-image@c90a581 active 1 kfp-viz charmstore stable 8 kubernetes 10.152.183.34
kubeflow-dashboard res:oci-image@126c9a9 active 1 kubeflow-dashboard charmstore stable 56 kubernetes 10.152.183.59
kubeflow-profiles res:profile-image@582b8eb active 1 kubeflow-profiles charmstore stable 52 kubernetes 10.152.183.48
kubeflow-volumes res:oci-image@a325e90 active 1 kubeflow-volumes charmstore stable 0 kubernetes 10.152.183.209
minio res:oci-image@4707912 active 1 minio charmstore stable 55 kubernetes 10.152.183.247
mlmd res:oci-image@78eb66d active 1 mlmd charmstore stable 5 kubernetes 10.152.183.167
oidc-gatekeeper res:oci-image@9bb01f7 active 1 oidc-gatekeeper charmstore stable 54 kubernetes 10.152.183.4
pytorch-operator res:oci-image@08c3373 active 1 pytorch-operator charmstore stable 53 kubernetes
seldon-controller-manager res:oci-image@82fd029 active 1 seldon-core charmstore stable 50 kubernetes 10.152.183.215
tfjob-operator res:oci-image@3fabaf3 active 1 tfjob-operator charmstore stable 1 kubernetes
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.1.29.18 443/TCP
argo-controller/0* active idle 10.1.29.67
dex-auth/0* active idle 10.1.29.45 5556/TCP
istio-ingressgateway/0* waiting idle Waiting for Istio Pilot information
istio-pilot/0* active idle 10.1.29.68 8080/TCP,15010/TCP,15012/TCP,15017/TCP
jupyter-controller/0* active idle 10.1.73.23
jupyter-ui/0* active idle 10.1.29.50 5000/TCP
kfp-api/0* active idle 10.1.29.71 8888/TCP,8887/TCP
kfp-db/0* active idle 10.1.29.66 3306/TCP ready
kfp-persistence/0* active idle 10.1.29.70
kfp-schedwf/0* active idle 10.1.29.47
kfp-ui/0* active idle 10.1.29.69 3000/TCP
kfp-viewer/0* active idle 10.1.29.24
kfp-viz/0* active idle 10.1.29.51 8888/TCP
kubeflow-dashboard/0* active idle 10.1.29.72 8082/TCP
kubeflow-profiles/0* active idle 10.1.29.56 8080/TCP,8081/TCP
kubeflow-volumes/0* active idle 10.1.29.44 5000/TCP
minio/0* active idle 10.1.29.55 9000/TCP
mlmd/0* active idle 10.1.29.41 8080/TCP
oidc-gatekeeper/0* active idle 10.1.29.73 8080/TCP
pytorch-operator/0* active idle 10.1.29.46 8443/TCP
seldon-controller-manager/0* active idle 10.1.29.48 8080/TCP,4443/TCP
tfjob-operator/0* active idle 10.1.29.49 8443/TCP