kubeflow 创建tensorflow过程

 

online deployable ,install k8s

代码

Kubeflow有三个核心组件

TFJob Operator 和 Controller:

作为Kubernetes的扩展,来简化分布式TensorFlow工作负载的部署。 通过Operator,Kubeflow能够自动化的配置 master服务器,工作服务器和参数化服务器配置。 TFJob可用来部署工作负载。

OPeratpor

$ kubectl describe deploy tf-job-operator-v1alpha2 -n kubeflow
Name:                   tf-job-operator-v1alpha2
Namespace:              kubeflow
CreationTimestamp:      Mon, 03 Dec 2018 12:51:55 +0000
Labels:                 app.kubernetes.io/deploy-manager=ksonnet
                        ksonnet.io/component=tf-job-operator
Annotations:            deployment.kubernetes.io/revision: 1
                        ksonnet.io/managed:
                          {"pristine":"H4sIAAAAAAAA/5yRTY8TMQyG7/wMn+djB24jcajY5dalAgkJrarKk3q6oUkcJZ4BtMp/R95+IdFy4ObE9vO+tl8Ao/1KKVsO0AP9FAoa53buBhLsoIK9DVvo4Z6i4...
Selector:               name=tf-job-operator
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
Pod Template:
  Labels:           name=tf-job-operator
  Service Account:  tf-job-operator
  Containers:
   tf-job-operator:
    Image:      gcr.io/kubeflow-images-public/tf_operator:v0.3.0
    Port:       
    Host Port:  
    Command:
      /opt/kubeflow/tf-operator.v2
      --alsologtostderr
      -v=1
    Environment:
      MY_POD_NAMESPACE:   (v1:metadata.namespace)
      MY_POD_NAME:        (v1:metadata.name)
    Mounts:
      /etc/config from config-volume (rw)
  Volumes:
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tf-job-operator-config
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  
NewReplicaSet:   tf-job-operator-v1alpha2-644c5f7db7 (1/1 replicas created)
Events:          
View Code

 

$ kubectl -n kubeflow describe pod tf-job-operator
Name:               tf-job-operator-v1alpha2-644c5f7db7-dsttx
Namespace:          kubeflow
Priority:           0
PriorityClassName:  
Node:               kubeflow-1/10.0.0.43
Start Time:         Mon, 03 Dec 2018 12:51:58 +0000
Labels:             name=tf-job-operator
                    pod-template-hash=644c5f7db7
Annotations:        
Status:             Pending
IP:
Controlled By:      ReplicaSet/tf-job-operator-v1alpha2-644c5f7db7
Containers:
  tf-job-operator:
    Container ID:
    Image:         gcr.io/kubeflow-images-public/tf_operator:v0.3.0
    Image ID:
    Port:          
    Host Port:     
    Command:
      /opt/kubeflow/tf-operator.v2
      --alsologtostderr
      -v=1
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      MY_POD_NAMESPACE:  kubeflow (v1:metadata.namespace)
      MY_POD_NAME:       tf-job-operator-v1alpha2-644c5f7db7-dsttx (v1:metadata.name)
    Mounts:
      /etc/config from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from tf-job-operator-token-fr42l (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tf-job-operator-config
    Optional:  false
  tf-job-operator-token-fr42l:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tf-job-operator-token-fr42l
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                      From                 Message
  ----     ------                  ----                     ----                 -------
  Normal   SandboxChanged          2d9h (x1330 over 2d17h)  kubelet, kubeflow-1  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  2d9h (x1335 over 2d17h)  kubelet, kubeflow-1  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9c0524a3122c673a81d2652d6755e46c721f57d67990a20f636abf806dab9c0e" network for pod "tf-job-operator-v1alpha2-644c5f7db7-dsttx": NetworkPlugin cni failed to set up pod "tf-job-operator-v1alpha2-644c5f7db7-dsttx_kubeflow" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Warning  FailedMount             24m                      kubelet, kubeflow-1  MountVolume.SetUp failed for volume "config-volume" : couldn't propagate object cache: timed out waiting for the condition
  Warning  FailedCreatePodSandBox  24m                      kubelet, kubeflow-1  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "76b89d8178a3247f0c906d9eb541b73c847be4f863d5da15efd3a578487cee7c" network for pod "tf-job-operator-v1alpha2-644c5f7db7-dsttx": NetworkPlugin cni failed to set up pod "tf-job-operator-v1alpha2-644c5f7db7-dsttx_kubeflow" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
View Code

TF-job 

$ kubectl -n kubeflow describe deploy tf-job-dashboard
Name:                   tf-job-dashboard
Namespace:              kubeflow
CreationTimestamp:      Mon, 03 Dec 2018 12:51:56 +0000
Labels:                 app.kubernetes.io/deploy-manager=ksonnet
                        ksonnet.io/component=tf-job-operator
Annotations:            deployment.kubernetes.io/revision: 1
                        ksonnet.io/managed:
                          {"pristine":"H4sIAAAAAAAA/3yQzY4TMRCE7zxGn+dvxSXyLUD2AiwRCDisoqjH07NrxnZbds8AivzuyCE/HMjeym53les7AAbzjWIy7EEB/RLyRad2uetJ8A4qmIwfQME7CpZ/O...
Selector:               name=tf-job-dashboard
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
Pod Template:
  Labels:           name=tf-job-dashboard
  Service Account:  tf-job-dashboard
  Containers:
   tf-job-dashboard:
    Image:      gcr.io/kubeflow-images-public/tf_operator:v0.3.0
    Port:       8080/TCP
    Host Port:  0/TCP
    Command:
      /opt/tensorflow_k8s/dashboard/backend
    Environment:
      KUBEFLOW_NAMESPACE:   (v1:metadata.namespace)
    Mounts:                
  Volumes:                 
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  
NewReplicaSet:   tf-job-dashboard-7499d5cbcf (1/1 replicas created)
Events:          
View Code

 

$ kubectl -n kubeflow describe pod tf-job-dashboard
Name:               tf-job-dashboard-7499d5cbcf-hzhcf
Namespace:          kubeflow
Priority:           0
PriorityClassName:  
Node:               kubeflow-1/10.0.0.43
Start Time:         Mon, 03 Dec 2018 12:51:58 +0000
Labels:             name=tf-job-dashboard
                    pod-template-hash=7499d5cbcf
Annotations:        
Status:             Pending
IP:
Controlled By:      ReplicaSet/tf-job-dashboard-7499d5cbcf
Containers:
  tf-job-dashboard:
    Container ID:
    Image:         gcr.io/kubeflow-images-public/tf_operator:v0.3.0
    Image ID:
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /opt/tensorflow_k8s/dashboard/backend
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      KUBEFLOW_NAMESPACE:  kubeflow (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from tf-job-dashboard-token-452x9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  tf-job-dashboard-token-452x9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tf-job-dashboard-token-452x9
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                       From                 Message
  ----     ------                  ----                      ----                 -------
  Warning  FailedCreatePodSandBox  32m (x1084 over 3h31m)    kubelet, kubeflow-1  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b4e475c7cf75d4086f91547e3a88632ca34c494c5e454601711a7588c8a0a22c" network for pod "tf-job-dashboard-7499d5cbcf-hzhcf": NetworkPlugin cni failed to set up pod "tf-job-dashboard-7499d5cbcf-hzhcf_kubeflow" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Normal   SandboxChanged          2m29s (x1255 over 3h32m)  kubelet, kubeflow-1  Pod sandbox changed, it will be killed and re-created.
View Code

 

TF Hub:

JupyterHub的运行实例,通过该实例来使用Jupyter notebook。

$ kubectl -n kubeflow get pod tf-hub-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: 2018-12-03T12:51:35Z
  generateName: tf-hub-
  labels:
    app: tf-hub
    controller-revision-hash: tf-hub-755f7566bd
    statefulset.kubernetes.io/pod-name: tf-hub-0
  name: tf-hub-0
  namespace: kubeflow
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: tf-hub
    uid: 2a91c29b-f6fa-11e8-8183-fa163ed7118a
  resourceVersion: "319408"
  selfLink: /api/v1/namespaces/kubeflow/pods/tf-hub-0
  uid: 2a953b58-f6fa-11e8-8183-fa163ed7118a
spec:
  containers:
  - command:
    - jupyterhub
    - -f
    - /etc/config/jupyterhub_config.py
    env:
    - name: NOTEBOOK_PVC_MOUNT
      value: /home/jovyan
    - name: CLOUD_NAME
      value: "null"
    - name: REGISTRY
      value: gcr.io
    - name: REPO_NAME
      value: kubeflow-images-public
    - name: KF_AUTHENTICATOR
      value: "null"
    - name: DEFAULT_JUPYTERLAB
      value: "false"
    - name: KF_PVC_LIST
      value: "null"
    image: gcr.io/kubeflow/jupyterhub-k8s:v20180531-3bb991b1
    imagePullPolicy: IfNotPresent
    name: tf-hub
    ports:
    - containerPort: 8000
      protocol: TCP
    - containerPort: 8081
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/config
      name: config-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: jupyter-hub-token-hs8h8
      readOnly: true
  dnsPolicy: ClusterFirst
  hostname: tf-hub-0
  nodeName: kubeflow-1
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: jupyter-hub
  serviceAccountName: jupyter-hub
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: jupyterhub-config
    name: config-volume
  - name: jupyter-hub-token-hs8h8
    secret:
      defaultMode: 420
      secretName: jupyter-hub-token-hs8h8
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2018-12-03T12:51:35Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2018-12-03T12:51:35Z
    message: 'containers with unready status: [tf-hub]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2018-12-03T12:51:35Z
    message: 'containers with unready status: [tf-hub]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2018-12-03T12:51:35Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: gcr.io/kubeflow/jupyterhub-k8s:v20180531-3bb991b1
    imageID: ""
    lastState: {}
    name: tf-hub
    ready: false
    restartCount: 0
    state:
      waiting:
        reason: ContainerCreating
  hostIP: 10.0.0.43
  phase: Pending
  qosClass: BestEffort
  startTime: 2018-12-03T12:51:35Z
View Code

 

$ kubectl -n kubeflow get svc tf-hub-0 -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    ksonnet.io/managed: '{"pristine":"H4sIAAAAAAAA/2SOsU4DQQxEez5j6gscHdo/oEFISDSIwtk4ypI7e7X2BqHT/jsyETSUfprxvA1Uyys3KypIuNxjwrnIAQkv3C4lMyas7HQgJ6QNJKJOXlQsztp0ZT9xt9uid5YbVUaCt84YExba8/ITpFqDH3envo8NUxH2KGVdqwqLI+Gj1y/nFpkxQWjlv9JuxpVYpRz43Pd8XPQzklY5x0peujm3x2ckPKmEfNXmhvS2/b67CgRGepjnebxPMF44u7Z/pmOMm28AAAD//wEAAP//d9+FryQBAAA="}'
    prometheus.io/scrape: "true"
  creationTimestamp: 2018-12-03T12:51:34Z
  labels:
    app: tf-hub
    app.kubernetes.io/deploy-manager: ksonnet
    ksonnet.io/component: jupyterhub
  name: tf-hub-0
  namespace: kubeflow
  resourceVersion: "319385"
  selfLink: /api/v1/namespaces/kubeflow/services/tf-hub-0
  uid: 2a5a66bd-f6fa-11e8-8183-fa163ed7118a
spec:
  clusterIP: None
  ports:
  - name: hub
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: tf-hub
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
View Code

 

$ kubectl -n kubeflow get svc tf-hub-lb -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    getambassador.io/config: |-
      ---
      apiVersion: ambassador/v0
      kind:  Mapping
      name: tf-hub-lb-hub-mapping
      prefix: /hub/
      rewrite: /hub/
      timeout_ms: 300000
      service: tf-hub-lb.kubeflow
      use_websocket: true
      ---
      apiVersion: ambassador/v0
      kind:  Mapping
      name: tf-hub-lb-user-mapping
      prefix: /user/
      rewrite: /user/
      timeout_ms: 300000
      service: tf-hub-lb.kubeflow
      use_websocket: true
    ksonnet.io/managed: '{"pristine":"H4sIAAAAAAAA/6yRP4/bMAzF934Kg7Mdu+hSaO3UoUCAAl3qQ0A5tKOLTQkilVwQ+LsflL93wW13WgQ+kY8/4R0Bg/tHUZxnMLD7DiVsHa/BwF+KO9cRlDCR4hoVwRwBmb2iOs+Sy4EUJ4siuPZx4Xzdee7dAAaqqmr5bm6Ke1+9a1rOW0xR/MEQHA8tM05kCu2rTbLVaE/XdH0MkXr3Yop6k2zdcqR9dEq3Wt1EPulqElP8aPJpWc74bywX22SpH/2+5SS02pMV321JTaExUcufJU5C8QPkLL9jvghfAQ1zCSNaGk9hYAhg4DaaoxTPTHoOZgqeiRUMPKdwUIqbZLNB/sfDXJYkYJf1K0BulUBd3hR8VAHz/3gdzlblSQbzsylBMQ6ky0vdNPNTCUIjderjI2o21kPINr/GJErx9xLm+dsrAAAA//8BAAD//weR4mmcAgAA"}'
  creationTimestamp: 2018-12-03T12:51:34Z
  labels:
    app: tf-hub-lb
    app.kubernetes.io/deploy-manager: ksonnet
    ksonnet.io/component: jupyterhub
  name: tf-hub-lb
  namespace: kubeflow
  resourceVersion: "319392"
  selfLink: /api/v1/namespaces/kubeflow/services/tf-hub-lb
  uid: 2a712a94-f6fa-11e8-8183-fa163ed7118a
spec:
  clusterIP: 10.102.20.216
  ports:
  - name: hub
    port: 80
    protocol: TCP
    targetPort: 8000
  selector:
    app: tf-hub
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
View Code

模型服务器:

部署经过训练过的TensorFlow模型,供客户访问并用于预测。

这三个模型将用于部署不同的工作负载。

kubeflow 创建tensorflow过程_第1张图片

 

1. 部署Kubeflow

export GITHUB_TOKEN=99510f2ccf40e496d1e97dbec9f31cb16770b884

export KUBEFLOW_VERSION=0.2.5 curl https://raw.githubusercontent.com/kubeflow/kubeflow/v${KUBEFLOW_VERSION}/scripts/deploy.sh | bash

kubeflow 创建tensorflow过程_第2张图片

 

 

katacoda.yaml

kubeflow 创建tensorflow过程_第3张图片

 

2. 部署Kubeflow 需要的LoadBalancer 和 Persistent Volume

kubeflow/katacoda.yaml

kubeflow 创建tensorflow过程_第4张图片

 

kubectl apply -f ~/kubeflow/katacoda.yaml

kubeflow 创建tensorflow过程_第5张图片

 

3. 部署TensorFlow作业(TFJob)

TfJob提供了一个Kubeflow的custom resource,可以在Kubernetes上轻松运行分布式或非分布式TensorFlow作业。 TFJob控制器为master,parameter服务器和worker  提供了一个YAML规范(specification ),来运行分布式计算。

CRD(Custom Resource Definition)提供了创建和管理TFJob的能力, CRD与Kubernetes内置资源方式相同。 部署后,CRD可以配置TensorFlow Job,允许用户专注于机器学习而不是基础设施。

 

4. 创建一个 TFJob Deployment Definition

部署TensorFlow workload,Kubeflow需要定义TFJob。

可以查看:

cat example.yaml

 kubeflow 创建tensorflow过程_第6张图片

 

其定义了三个组件:

master:每个job必须有一个master。 master将协调workers之间的训练操作执行。

worker:一个job可以有0到N名workers。 每个worker进程都运行相同的model,为PS(Parameter Server)提供处理参数。

PS:一个job可以有0到N个Parameter Server。 Parameter Server可以跨多台计算机扩展model。

https://www.tensorflow.org/deploy/distributed

 

5. 部署TFJob

运行

kubectl apply -f example.yaml

 

通过部署TFJob,Kubernetes跨可用的节点进行调度, 执行负载。 作为部署的一部分,Kubeflow使用请求的设置来配置TensorFlow,以允许不同的组件进行通信。

6. 查看TFJob(用户资源) 的进度和结果

 kubectl get tfjob

TensorFlow作业完成后,master 服务会被标记为成功。

master 负责协调执行和汇总结果.

查看 已完成的pods:

kubectl get pods | grep Completed

 kubeflow 创建tensorflow过程_第7张图片

 

本示例,结果输出到STDOUT, 通过一下命令可以看到master,worker和PS负载的执行结果

kubectl logs $(kubectl get pods | grep Completed | tr -s ' ' | cut -d ' ' -f 1)

 kubeflow 创建tensorflow过程_第8张图片

 

7.  Deploy JupyterHub

 通过前面的Kubeflow,JupyterHub已经部署到Kubernetes的集群上。可以通过一下命令找到Load Balancer IP地址:

kubectl get svc

kubeflow 创建tensorflow过程_第9张图片

 

kubeflow 创建tensorflow过程_第10张图片

kubeflow 创建tensorflow过程_第11张图片

 


Jupyter Notebook 通过JupyterHub运行。Jupyter Notebook是经典的数据科学工具,用于在浏览器中运行记录 脚本和代码片段。

打开 JupyterHub:

用户名为 admin 密码为空。

进入之后,选择一个tensorflow image, gcr.io/kubeflow-images-public/tensorflow-1.8.0-notebook-cpu:v0.2.1 ,

点击“Spawn” 来启动server。这个会生成一个名为jupyter-admin的Kubernetes Pod来管理server。

kubectl get pods jupyter-admin

 

8. 使用 Jupyter Notebook

kubeflow 创建tensorflow过程_第12张图片

运行代码

from __future__ import print_function

import tensorflow as tf

hello = tf.constant('Hello TensorFlow!')
s = tf.Session()
print(s.run(hello))

其他代码片段如下:

for job_name in cluster_spec.keys():
  for i in range(len(cluster_spec[job_name])):
    d = "/job:{0}/task:{1}".format(job_name, i)
    with tf.device(d):
      a = tf.constant(range(width * height), shape=[height, width])
      b = tf.constant(range(width * height), shape=[height, width])
      c = tf.multiply(a, b) results.append(c)

 https://github.com/tensorflow/k8s/tree/master/examples/tf_sample

9. 部署训练好的Model Server

训练完成之后, model就可以用来执行新数据的预测。 Kubeflow tf-serving提供一个模板,用来服务TensorFlow model。

通过基于模型参数的定义, 并使用Ksonnet来定制和部署。

环境变量如下:

MODEL_COMPONENT=model-server
MODEL_NAME=inception
MODEL_PATH=/serving/inception-export

使用Ksonnet,扩展Kubeflow服务组件来匹配model的要求。

cd ~/kubeflow_ks_app
ks generate tf-serving ${MODEL_COMPONENT} --name=${MODEL_NAME}
ks param set ${MODEL_COMPONENT} modelPath $MODEL_PATH

ks param set ${MODEL_COMPONENT} modelServerImage katacoda/tensorflow_serving

通过一下命令来查看定义的参数:

ks param list

kubeflow 创建tensorflow过程_第13张图片

 

在Kubernete集群中部署这个model 服务

ks apply default -c ${MODEL_COMPONENT}

kubeflow 创建tensorflow过程_第14张图片

 

查看运行的pod状态:

kubeflow 创建tensorflow过程_第15张图片

 

kubectl get pods

 kubeflow 创建tensorflow过程_第16张图片

 

10. 图像分类

该例子使用预先训练的Inception V3模型。 这是基于ImageNet数据集上训练出来的架构。 ML任务是图像分类,Kubernetes处理model server及其客户端。

要使用已发布的模型,您需要设置客户端。 这与其他job实现方式相同。 用于部署客户端的YAML文件:

cat ~/model-client-job.yaml

kubeflow 创建tensorflow过程_第17张图片

 

使用以下命令部署:

kubectl apply -f ~/model-client-job.yaml

查看 model-client-job 的运行状态:

kubectl get pods

kubeflow 创建tensorflow过程_第18张图片

 

查看Katacoda log的分类状态.

kubectl logs $(kubectl get pods | grep Completed | tail -n1 | tr -s ' ' | cut -d ' ' -f 1)

kubeflow 创建tensorflow过程_第19张图片

 

https://github.com/kubeflow/kubeflow/tree/master/components/k8s-model-server

 REF:

 k8s install

master-isolation

Installing Calico for policy and networking (recommended)

Controlling ingress and egress traffic with network policy

Installing Calico for policy and flannel for networking

 各个组件介绍:

spartakus

这个是汇报各种信息的
including the operating system version, kubelet version, container runtime version, as well as CPU and memory capacity.

argo

Argo is an open source container-native workflow engine for getting work done on Kubernetes. Argo is implemented as a Kubernetes CRD (Custom Resource Definition).

 AMBASSADOR

大使是 Open Source Kubernetes-Native API Gateway built on the Envoy Proxy

katib

Hyperparameter Tuning on Kubernetes. This project is inspired by Google vizier. Katib is a scalable and flexible hyperparameter tuning framework and is tightly integrated with kubernetes. Also it does not depend on a specific Deep Learning framework e.g. TensorFlow, MXNet, and PyTorch).

modeldb 

modelDB 是用来管理ML的model的。

ModelDB is an end-to-end system to manage machine learning models. It ingests models and associated metadata as models are being trained, stores model data in a structured format, and surfaces it through a web-frontend for rich querying. ModelDB can be used with any ML environment via the ModelDB Light API. ModelDB native clients can be used for advanced support in spark.ml and scikit-learn.

转载于:https://www.cnblogs.com/shaohef/p/10033604.html

你可能感兴趣的:(网络,人工智能,开发工具)