基于Kubernetes集群的分布式TensorFlow平台的搭建

1. 环境

  • virtualbox linux虚拟机CentOS 7
  • docker-ce17.03
  • Kubernetes 1.13
  • kubernetes网络:Flannel
  • NFS
  • tensorflow:1.5.0

Kubernetes默认已搭建好,这里不进行赘述。



本环境仅是个人的学习与练习,多有不足。

1.1 hostname和ip对照表

主机名称 IP
k8s-master 172.20.10.2
k8s-node01 172.20.10.3
k8s-node02 172.20.10.4

2. Tensor Flow的部署

2.1 nfs

分布式Tensorflow需要一个能够被tensorflow节点共同访问使用的目录才能被使用,所以我在master节点上部署了一个nfs文件系统。在编写tensorflow的yaml文件的时候,将pod的存储卷挂载到我搭建的nfs文件系统上。这样,tensorflow节点便能共用一个目录。否则,通过kubernetes部署tensorflow的pod将会启动失败。

k8s-master:

yum install nfs-utils rpcbind -y
mkdir -p data/nfs
vim /etc/exports
/data/nfs 172.20.10.0/24(rw,no_root_squash,no_all_squash,sync)
/bin/systemctl start rpcbind.service
/bin/systemctl start nfs.service

2.1 Tensor Flow

本设计部署的分布式Tensorflow包含有一个ps节点和两个worker节点。Ps节点负责启动session.run(),并进行迭代训练。worker0负责读取并处理数据,worker1负责初始化TensorFlow集群所需要的参数。将ps节点的配置信息写在tf-ps.yaml文件中,将worker节点的配置信息写在tf-worker里面。

tf-ps.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tensorflow-ps
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: tensorflow-ps
        role: ps
    spec:
      containers:
      - name: ps
        image: tensorflow/tensorflow:1.5.0
        ports:
        - containerPort: 2222
        - containerPort: 8888
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 1
            memory: 500Mi
        volumeMounts:
        - mountPath: notebooks
          readOnly: false
          name: nfs
      volumes:
      - name: nfs
        nfs:
          server: 172.20.10.2
          path: "/data/nfs"
---
apiVersion: v1
kind: Service
metadata:
  name: tensorflow-ps-service
  labels:
    name: tensorflow-ps
    role: service
spec:
  type: NodePort
  ports:
  - port: 80
    targetPort: 8888
    nodePort: 30001
    name: tensorflow
  - port: 2222
    targetPort: 2222
    name: tf-ps
  selector:
    name: tensorflow-ps

tf-worker.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tensorflow-worker
spec:
  replicas: 2
  template:
    metadata:
      labels:
        name: tensorflow-worker
        role: worker
    spec:
      containers:
      - name: worker
        image: tensorflow/tensorflow:1.5.0
        ports:
        - containerPort: 2222
        resources:
          limits:
            cpu: 2
            memory: 1Gi
          requests:
            cpu: 1
            memory: 500Mi
        volumeMounts:
        - mountPath: /notebooks
          readOnly: false
          name: nfs
      volumes:
      - name: nfs
        nfs:
          server: 172.20.10.2
          path: "/data/nfs"
---
apiVersion: v1
kind: Service
metadata:
  name: tensorflow-wk-service
  labels:
    name: tensorflow-worker
spec:
  ports:
  - port: 2222
    targetPort: 2222
  selector:
    name: tensorflow-worker

使用kubectl apply -f tf-ps.yamlkubectl TensorFlow 部署之后:

TensorFlow.png

3. 简单的使用测试

在浏览器输入master的ip加端口,如:172.20.10.2:30001即可进入jupyter的界面。

jupyter.png

此时的jupyter需要输入token才能进去,使用kubectl logs tensorflow-ps或者在Dashboard中查看tensorflow-ps的日志便能看到相应的token值。
token.png

输入对应的token值进入jupyter界面,此时,可创建notebook,在上面进行机器学习实验。
notebook.png

在进行分布式TensorFlow测试的时候首先需要搭建起分布式环境,TensorFlow三个节点的ip地址可以通过kubectl describe svc相应的服务得到。然后编写部署分布式TensorFlow环境的代码如下:

import tensorflow as tf
 
tf.app.flags.DEFINE_string("ps_hosts", "10.244.3.4:2222", "ps hosts")
tf.app.flags.DEFINE_string("worker_hosts", "10.244.0.8:2222,10.244.4.4:2222", "worker hosts")
tf.app.flags.DEFINE_string("job_name", "worker", "'ps' or'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
 
FLAGS = tf.app.flags.FLAGS
 
def main(_):
    ps_hosts = FLAGS.ps_hosts.split(",")
    worker_hosts = FLAGS.worker_hosts.split(",")
    # create cluster
    cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
    # create the server
    server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
 
    server.join()
 
if __name__ == "__main__":
    tf.app.run()

接着通过kubectl exec进入三个TensorFlow节点,分别运行python distributed.py –job_name=ps –task_index=0、python distributed.py –job_name=worke r –task_index=1、python distributed.py –job_name=worker –task_index=1三条命令,执行成功后分布式TensorFlow环境便部署成功,如图5-22所示。



在Jupyter上新建一个jupyter nnotebook,用同样的代码测试方法,只不过ps节点负责启动session.run(),并进行迭代训练。worker0负责读取并处理数据,worker1负责初始化TensorFlow集群所需要的参数。


test.png

你可能感兴趣的:(基于Kubernetes集群的分布式TensorFlow平台的搭建)