《OpenShift / RHEL / DevSecOps 汇总目录》
说明:本文已经在 OpenShift 4.14 + RHODS 2.50 的环境中验证
说明:请先根据《OpenShift AI - 部署 OpenShift AI 环境,运行 AI/ML 应用(视频)》一文完成 OpenShift AI 环境的安装。
注意:如无特殊说明,和 OpenShift AI 相关的 Blog 均无需 GPU。
RHOAI 使用 StatefulSet 来运行 Applications > Enabled 中的 Jupyter 环境,该 StatefulSet 和其他相关资源全部运行在 rhods-notebook 项目中。
$ oc get all -n rhods-notebooks
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME READY STATUS RESTARTS AGE
pod/jupyter-nb-user1-0 2/2 Running 0 14m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/jupyter-nb-user1 ClusterIP 172.31.145.240 <none> 80/TCP 14m
service/jupyter-nb-user1-tls ClusterIP 172.31.157.129 <none> 443/TCP 14m
NAME READY AGE
statefulset.apps/jupyter-nb-user1 1/1 14m
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
route.route.openshift.io/jupyter-nb-user1 jupyter-nb-user1-rhods-notebooks.apps.cluster-4gwc7.dynamic.redhatworkshops.io jupyter-nb-user1-tls oauth-proxy reencrypt/Redirect None
$ oc get pvc -n rhods-notebooks
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
jupyterhub-nb-user1-pvc Bound pvc-379c22b3-ddc8-4c7c-b6ae-59311f2a1383 20Gi RWO ocs-external-storagecluster-ceph-rbd 41m
$ oc get statefulset -n rhods-notebooks
NAME READY AGE
jupyter-nb-user1 0/0 16m
在 RHOAI 中是通过容器运行 Jupyter notebook 的。可以在 RHOAI 控制台中看到缺省提供的容器镜像类型,它们包括:Minimal Python、Standard Data Science、CUDA、Pytorch、TensorFlow、HabanaAI、TrustAI。另外用户也可以使用自己定制的容器镜像。
注意:每个镜像包含的预安装包见详细说明。
在创建 Jupyter notebook server 的界面中需要我们选择所使用的镜像。这些镜像是通过 redhat-ods-applications 项目中的 ImageStream 被访问到的。
$ oc get is -n redhat-ods-applications
NAME IMAGE REPOSITORY TAGS UPDATED
habana-notebook image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/habana-notebook 2023.2 9 minutes ago
minimal-gpu image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu 1.2,2023.1,2023.2 9 minutes ago
odh-trustyai-notebook image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/odh-trustyai-notebook 2023.1,2023.2 9 minutes ago
pytorch image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/pytorch 1.2,2023.1,2023.2 9 minutes ago
s2i-generic-data-science-notebook image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/s2i-generic-data-science-notebook 1.2,2023.1,2023.2 9 minutes ago
s2i-minimal-notebook image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/s2i-minimal-notebook 1.2,2023.1,2023.2 9 minutes ago
tensorflow image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/tensorflow 1.2,2023.1,2023.2 9 minutes ago
$ oc get is -n redhat-ods-applications tensorflow -ojsonpath={.spec.tags[2]} | jq
{
"annotations": {
"opendatahub.io/notebook-build-commit": "d0ce8b0",
"opendatahub.io/notebook-python-dependencies": "[{\"name\":\"TensorFlow\",\"version\":\"2.13\"},{\"name\":\"Tensorboard\",\"version\":\"2.13\"}, {\"name\":\"Boto3\",\"version\":\"1.28\"},{\"name\":\"Kafka-Python\",\"version\":\"2.0\"},{\"name\":\"Kfp-tekton\",\"version\":\"1.5\"},{\"name\":\"Matplotlib\",\"version\":\"3.6\"},{\"name\":\"Numpy\",\"version\":\"1.24\"},{\"name\":\"Pandas\",\"version\":\"1.5\"},{\"name\":\"Scikit-learn\",\"version\":\"1.3\"},{\"name\":\"Scipy\",\"version\":\"1.11\"},{\"name\":\"Elyra\",\"version\":\"3.15\"},{\"name\":\"PyMongo\",\"version\":\"4.5\"},{\"name\":\"Pyodbc\",\"version\":\"4.0\"}, {\"name\":\"Codeflare-SDK\",\"version\":\"0.12\"}, {\"name\":\"Sklearn-onnx\",\"version\":\"1.15\"}, {\"name\":\"Psycopg\",\"version\":\"3.1\"}, {\"name\":\"MySQL Connector/Python\",\"version\":\"8.0\"}]",
"opendatahub.io/notebook-software": "[{\"name\":\"CUDA\",\"version\":\"11.8\"},{\"name\":\"Python\",\"version\":\"v3.9\"},{\"name\":\"TensorFlow\",\"version\":\"2.13\"}]",
"opendatahub.io/workbench-image-recommended": "true",
"openshift.io/imported-from": "quay.io/modh/cuda-notebooks"
},
"from": {
"kind": "DockerImage",
"name": "quay.io/modh/cuda-notebooks@sha256:59d571d0d245c050eb9f79de5c3c40517a575d8fdfb41385a324ee45a42b597b"
},
"generation": 2,
"importPolicy": {
"importMode": "Legacy"
},
"name": "2023.2",
"referencePolicy": {
"type": "Source"
}
}
除了可以使用 OpenShift 默认提供的 Notebook 镜像外,用户可以定制自己的 Notebook 镜像。
在 RHOAI 控制台中每个有权限的用户都可在 Applications 中创建并运行一个 Notebook Server 环境。所有用户的 Notebook Server 都是以 StatefulSet 方式运行在 rhods-notebook 项目中。然而一个用户的多个 AI/ML 都在 Applications 中的这个 Notebook Server 环境中运行往往是不够的,因此用户可以在 Data Science Projects 中创建多个项目以运维不同的 AI/ML。
在训练 AI 模型的时候,对于使用数据量相对较小的简单项目,可以将文件上传到工作台的 Notebook 运行环境中。但更复杂的项目会涉及的更大的数据集,这时不建议将数据直接存储在 Jupyter 的代码库中,因为这会导致代码库超载,更好的办法是将数据存储在专用存储系统中,需要时再将数据下载到工作台。同样,训练好的模型在导出时可能会产生较大的文件,因此也建议将其存储在专用存储系统中。
OpenShift AI 的 Data connection 是一组用于访问外部存储系统的配置。
本文以 MinIO 为例创建对象存储。
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: minio-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
volumeMode: Filesystem
---
kind: Secret
apiVersion: v1
metadata:
name: minio-secret
stringData:
# change the username and password to your own values.
# ensure that the user is at least 3 characters long and the password at least 8
minio_root_user: minio
minio_root_password: minio123
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: minio
spec:
replicas: 1
selector:
matchLabels:
app: minio
template:
metadata:
creationTimestamp: null
labels:
app: minio
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: minio-pvc
containers:
- resources:
limits:
cpu: 250m
memory: 1Gi
requests:
cpu: 20m
memory: 100Mi
readinessProbe:
tcpSocket:
port: 9000
initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
name: minio
livenessProbe:
tcpSocket:
port: 9000
initialDelaySeconds: 30
timeoutSeconds: 1
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
env:
- name: MINIO_ROOT_USER
valueFrom:
secretKeyRef:
name: minio-secret
key: minio_root_user
- name: MINIO_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: minio-secret
key: minio_root_password
ports:
- containerPort: 9000
protocol: TCP
- containerPort: 9090
protocol: TCP
imagePullPolicy: IfNotPresent
volumeMounts:
- name: data
mountPath: /data
subPath: minio
terminationMessagePolicy: File
image: >-
quay.io/minio/minio:RELEASE.2023-06-19T19-52-50Z
args:
- server
- /data
- --console-address
- :9090
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
securityContext: {}
schedulerName: default-scheduler
strategy:
type: Recreate
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
---
kind: Service
apiVersion: v1
metadata:
name: minio-service
spec:
ipFamilies:
- IPv4
ports:
- name: api
protocol: TCP
port: 9000
targetPort: 9000
- name: ui
protocol: TCP
port: 9090
targetPort: 9090
internalTrafficPolicy: Cluster
type: ClusterIP
ipFamilyPolicy: SingleStack
sessionAffinity: None
selector:
app: minio
---
kind: Route
apiVersion: route.openshift.io/v1
metadata:
name: minio-api
spec:
to:
kind: Service
name: minio-service
weight: 100
port:
targetPort: api
wildcardPolicy: None
tls:
termination: edge
insecureEdgeTerminationPolicy: Redirect
---
kind: Route
apiVersion: route.openshift.io/v1
metadata:
name: minio-ui
spec:
to:
kind: Service
name: minio-service
weight: 100
port:
targetPort: ui
wildcardPolicy: None
tls:
termination: edge
insecureEdgeTerminationPolicy: Redirect
https://ai-on-openshift.io/odh-rhods/configuration/
https://github.com/opendatahub-io/notebooks/blob/main/docs/workbenches.md