openpai人工智能平台安装部署

1. 环境准备

准备3台ubuntu18虚拟机,分别充当manager、master、node1角色
硬件配置如下:manager 处理器4 内存8G
master 16核 43.3G
node1 16核 17.3G
(1)hostname配置
通过命令vi /etc/hostname,将三台虚拟机的hostname分别修改为manager、master、node1
(2)ip配置如下
各节点通过命令vi /etc/hosts,打开hosts文件,增加如下内容
192.168.18.150 manager
192.168.18.151 master
192.168.18.152 node1
(3)ssh安装与免密登录配置
各节点执行ssh安装命令
apt-get install openssh-server
service ssh start
执行命令打开ssh配置文件,vim /etc/ssh/sshd_config
修改配置文件内容
PermitRootLogin yes

配置免密登录,各节点分别执行如下命令
ssh-keygen -t rsa
ssh-copy-id node1
ssh-copy-id master
ssh-copy-id manager

(4)时间同步ntp安装
apt install ntp
(5)docker安装
docker安装参考
https://www.cnblogs.com/wt7018/p/11880666.html
各节点通过命令vi /etc/docker/daemon.json,打开daemon.json文件,增加如下内容
{"debug": true, "registry-mirrors": ["http://192.168.18.151:30500"], "insecure-registries": ["http://192.168.18.151:30500"]}

2. k8s与openpai安装

(1)修改配置文件
首先执行命令进入到部署文件夹cd /home/pai。
通过命令修改配置文件vi config/config.yaml,kube下载路径修改成国内路径

user: root
password: root
docker_image_tag: v1.8.0

gcr_image_repo: "registry.cn-hangzhou.aliyuncs.com"
kube_image_repo: "registry.cn-hangzhou.aliyuncs.com/google_containers"

openpai_kubespray_extra_var:
  pod_infra_image_repo: "registry.cn-hangzhou.aliyuncs.com/google_containers/pause-{{ image_arch }}"
  dnsautoscaler_image_repo: "docker.io/mirrorgooglecontainers/cluster-proportional-autoscaler-{{ image_arch }}"
  tiller_image_repo: "registry.cn-hangzhou.aliyuncs.com/google_containers/kubernetes-helm/tiller"
  registry_proxy_image_repo: "registry.cn-hangzhou.aliyuncs.com/google_containers/kube-registry-proxy"
  metrics_server_image_repo: "registry.cn-hangzhou.aliyuncs.com/google_containers/metrics-server-amd64"
  addon_resizer_image_repo: "registry.cn-hangzhou.aliyuncs.com/google_containers/addon-resizer"
  dashboard_image_repo: "registry.cn-hangzhou.aliyuncs.com/google_containers/kubernetes-dashboard-{{ image_arch }}"

通过命令修改配置文件vi config/layout.yaml

machine-sku:
  master-machine: # define a machine sku
    # the resource requirements for all the machines of this sku
    # We use the same memory format as Kubernetes, e.g. Gi, Mi
    # Reference: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory
    mem: 28Gi
    cpu:
      # the number of CPU vcores
      vcore: 16
  cpu-machine:
    computing-device:
      # For `type`, please follow the same format specified in device plugin.
      # For example, `nvidia.com/gpu` is for NVIDIA GPU, `amd.com/gpu` is for AMD GPU,
      # and `enflame.com/dtu` is for Enflame DTU.
      # Reference: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
      type: nvidia.com/cpu
      model: K80
      count: 4
    mem: 16Gi
    cpu:
      vcore: 16

machine-list:
  - hostname: master # name of the machine, **do not** use upper case alphabet letters for hostname
    hostip: 192.168.18.151
    machine-type: master-machine # only one master-machine supported
    pai-master: "true"
  - hostname: node1
    hostip: 192.168.18.152
    machine-type: cpu-machine
    pai-worker: "true"

(2)k8s安装命令
cd /home/pai/contrib/kubespray
/bin/bash quick-start-kubespray.sh -v
(3)openpai安装命令
/bin/bash quick-start-service.sh
安装成功,会输出如下内容:

OpenPAI is successfully deployed, please check the following information:
Kubernetes cluster config :     ~/pai-deploy/kube/config
OpenPAI cluster config    :     ~/pai-deploy/cluster-cfg
OpenPAI cluster ID        :     pai
Default username          :     admin
Default password          :     admin-password

You can go to http://192.168.18.151, then use the default username and password to log in.

3. 问题与解决方法

解决部署k8s集失败重试还需要执行git clone
vi /home/pai/contrib/kubespray/script/environment.sh

#sudo rm -rf ${HOME}/pai-deploy/kubespray
#git clone -b release-2.11 https://github.com/kubernetes-sigs/kubespray.git ${HOME}/pai-deploy/kubespray

解决下载过程中无法下载下来的文件
在部署过程中出现cni-plugins-linux-amd64-v0.8.1.tgz、calicoctl-linux-amd64这两个文件下载不下来,可以单独下载,然后传到报错所输出的路径下面(master、node1节点均上传)

你可能感兴趣的:(openpai人工智能平台安装部署)