Kubernetes是一个开源的,用于管理云平台中多个主机上的容器化的应用,Kubernetes的目标是让部署容器化的应用简单并且高效(powerful),Kubernetes提供了应用部署,规划,更新,维护的一种机制。
准备
docker:k8s底层是基于docker的,所以你需要先安装docker.
禁掉swap分区:你可以用sudo swapoff -a,要永久禁用swap分区的话,需要sudo vim /etc/fstab,注释掉swap那一行
工具Shadowsocks:因为后面依赖的一些资源,docker镜像是放在google平台上的,所以要
设置http代理
一般开ss,其实是设置了一个socks5代理,所以我们还需要一个http转socks5的工具,这里用的是privoxy.
先安装privoxy
sudo apt-get install privoxy
配置Privoxy, 打开 /etc/privoxy/config,在最后一行后边加上
forward-socks5 / 127.0.0.1:1080 .
listen-address 127.0.0.1:8008
这里的意思是把请求全部映射到本地1080端口上,privoxy监听在8008端口.
然后重启Privoxy
sudo service privoxy restart
然后你就可以用
export http_proxy=http://127.0.0.1:8008
export https_proxy=http://127.0.0.1:8008
来访问国外资源,可以测试一下,curl https://google.com,配置正确的话,会输出
git clone https://github.com/rofl0r/proxychains-ng.git
cd proxychains-ng
./configure --prefix=/usr --sysconfdir=/etc
$ make
$ make install
$ make install-config (安装proxychains.conf配置文件)
使用的话,在需要代理的命令前加上proxychains4 ,如:
proxychains4 wget https://google.com
接下的一步是下载并添加Kubernetes安装的密钥。
proxychains4 curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add
配置kubernetes源
sudo touch /etc/apt/sources.list.d/kubernetes.list
sudo echo deb http://apt.kubernetes.io/ kubernetes-xenial main >> /etc/apt/sources.list.d/kubernetes.list
安装kubeadm和kubelet等依赖
proxychains4 apt-get update
proxychains4 apt-get install -y kubelet kubeadm kubectl kubernetes-cni
kubeadm init初始化集群
打开终端,我们先要设置http代理,这里用proxychains4没什么用
export http_proxy=http://127.0.0.1:8008
export https_proxy=http://127.0.0.1:8008
export no_proxy=192.168.1.118 # 你电脑的ip地址
还需要做的是给docker设置代理,因为镜像在google平台上,注意,有2种代理,一种是docker client的,一种是docker server的,不要搞混了。这里设置的是docker server的(因为pull镜像是docker server执行的),这里代理就是上述的privoxy地址:
#为docker service创建一个systemd drop-in 目录
mkdir -p /etc/systemd/system/docker.service.d
#使用下面内容创建文件/etc/systemd/system/docker.service.d/http-proxy.conf
[Service]
Environment="HTTP_PROXY=http://127.0.0.1:8008/"
#使用下面内容创建文件/etc/systemd/system/docker.service.d/https-proxy.conf
[Service]
Environment="HTTPS_PROXY=http://127.0.0.1:8008/"
#写入改动
sudo systemctl daemon-reload
#重启docker服务
sudo systemctl restart docker
执行kubeadm init, kubeadm init的时候要先想好使用Pod的哪个网络插件,这里选择的是Calico插件
kubeadm init --pod-network-cidr=172.16.0.0/16
整个过程看日志的话,可以使用
journalctl -xeu kubelet
可能一次执行不会成功,设置正确之后,你可以再执行kubeadm init的话,可以加参数忽略所有前置检查错误
kubeadm init --pod-network-cidr=172.16.0.0/16 --ignore-preflight-errors=all
正确初始化,会看到字样
Your Kubernetes master has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
......
按照提示,执行命令
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
查看所有pods,使用kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-78fcdf6894-5h7tl 0/1 Pending 0 1h
kube-system coredns-78fcdf6894-z7vcj 0/1 Pending 0 1h
kube-system etcd-salamanderpc 1/1 Running 0 1h
kube-system kube-apiserver-salamanderpc 1/1 Running 1 1h
kube-system kube-controller-manager-salamanderpc 1/1 Running 1 1h
kube-system kube-proxy-brgdx 1/1 Running 0 1h
kube-system kube-scheduler-salamanderpc 1/1 Running 1 1h
发现coredns还是pedding,这个没关系,我们还需要安装Pod Network插件,这里安装的是Calico
首先,安装etcd实例
kubectl apply -f \
https://docs.projectcalico.org/v3.2/getting-started/kubernetes/installation/hosted/etcd.yaml
输出
daemonset "calico-etcd" created
service "calico-etcd" created
安装calico的RBAC
kubectl apply -f \
https://docs.projectcalico.org/v3.2/getting-started/kubernetes/installation/rbac.yaml
输出
clusterrole.rbac.authorization.k8s.io "calico-kube-controllers" created
clusterrolebinding.rbac.authorization.k8s.io "calico-kube-controllers" created
clusterrole.rbac.authorization.k8s.io "calico-node" created
clusterrolebinding.rbac.authorization.k8s.io "calico-node" created
kubectl apply -f \
https://docs.projectcalico.org/v3.2/getting-started/kubernetes/installation/hosted/calico.yaml
输出
configmap "calico-config" created
secret "calico-etcd-secrets" created
daemonset.extensions "calico-node" created
serviceaccount "calico-node" created
deployment.extensions "calico-kube-controllers" created
serviceaccount "calico-kube-controllers" created
等待所有pod变成running
watch kubectl get pods --all-namespaces
需要等待一定时间(1,2分钟)
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-etcd-l9zrs 1/1 Running 0 1m
kube-system calico-kube-controllers-65945f849d-kpndn 1/1 Running 0 1m
kube-system calico-node-5bb4d 2/2 Running 0 1m
kube-system coredns-78fcdf6894-5pjcn 1/1 Running 0 3m
kube-system coredns-78fcdf6894-f5wtd 1/1 Running 0 3m
kube-system etcd-salamanderpc 1/1 Running 0 2m
kube-system kube-apiserver-salamanderpc 1/1 Running 0 2m
kube-system kube-controller-manager-salamanderpc 1/1 Running 0 2m
kube-system kube-proxy-f6kxr 1/1 Running 0 3m
kube-system kube-scheduler-salamanderpc 1/1 Running 0 2m
部署服务
因为是单节点,本来是需要加入worker节点去运行真正的服务的,但为了测试,我们可以
kubectl taint nodes --all node-role.kubernetes.io/master-
脱离限制(线上是不能这么做的)
我们新建一个Deployment文件,内容为
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 2 # tells deployment to run 2 pods matching the template
template: # create pods using pod definition in this template
metadata:
# unlike pod-nginx.yaml, the name is not included in the meta data as a unique name is
# generated from the deployment name
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.0
ports:
- containerPort: 80
Deployment是新一代用于Pod管理的对象,与Replication Controller相比,它提供了更加完善的功能,使用起来更加简单方便。
然后,创建Deployment
kubectl create -f nginx_deployment.yaml
上面会创建两种pods,容器开放端口80
查看Deployment
kubectl get deployment
查看创建的pods(有两个)
kubectl get pods
显示
NAME READY STATUS RESTARTS AGE
nginx-deployment-67594d6bf6-bwnlz 1/1 Running 0 39m
nginx-deployment-67594d6bf6-frrdx 1/1 Running 0 39m
为了能够对外访问,我们需要定义service
kind: Service
apiVersion: v1
metadata:
name: nginx-service
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 9898
targetPort: 80
创建service
kubectl create -f nginx-service.yaml
上面的service对外暴露端口为9898
查看创建的service
kubectl get svc
显示
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1
nginx-service ClusterIP 10.101.10.236
FFDL
项目地址:https://github.com/IBM/FfDL
安装helm
helm
helm init
kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
kubectl config set-context $(kubectl config current-context) --namespace=ivdai
export VM_TYPE=none
export PUBLIC_IP=
export NAMESPACE=default
ls
create nfs for pv
# Create the shared directory
sudo mkdir -p /data-nfs
# Install NFS kernel server
sudo apt update
sudo apt install -y nfs-kernel-server
# Update /etc/exports
sudo echo "/data-nfs *(rw,no_root_squash,no_subtree_check)" | sudo tee -a /etc/exports
# Restart NFS kernel server
sudo service nfs-kernel-server restart
test_pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv0001
labels:
type: dlaas-static-volume
spec:
capacity:
storage: 200Gi
accessModes:
- ReadWriteMany
nfs:
path: /data-nfs
server: 192.168.8.110
kubectl create -f test_pv.yaml
helm install .
kubectl config set-context $(kubectl config current-context) --namespace=$NAMESPACE
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# alertmanager-7cf6b988b9-h9q6q 1/1 Running 0 5h
# etcd0 1/1 Running 0 5h
# ffdl-lcm-65bc97bcfd-qqkfc 1/1 Running 0 5h
# ffdl-restapi-8777444f6-7jfcf 1/1 Running 0 5h
# ffdl-trainer-768d7d6b9-4k8ql 1/1 Running 0 5h
# ffdl-trainingdata-866c8f48f5-ng27z 1/1 Running 0 5h
# ffdl-ui-5bf86cc7f5-zsqv5 1/1 Running 0 5h
# mongo-0 1/1 Running 0 5h
# prometheus-5f85fd7695-6dpt8 2/2 Running 0 5h
# pushgateway-7dd8f7c86d-gzr2g 2/2 Running 0 5h
# storage-0 1/1 Running 0 5h
node_ip=$PUBLIC_IP
grafana_port=$(kubectl get service grafana -o jsonpath='{.spec.ports[0].nodePort}')
ui_port=$(kubectl get service ffdl-ui -o jsonpath='{.spec.ports[0].nodePort}')
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
s3_port=$(kubectl get service minio -o jsonpath='{.spec.ports[0].nodePort}')
echo "Monitoring dashboard: http://$node_ip:$grafana_port/ (login: admin/admin)"
echo "Web UI: http://$node_ip:$ui_port/#/login?endpoint=$node_ip:$restapi_port&username=test-user"
Using FfDL Local S3 Based Object Storage
node_ip=$PUBLIC_IP
s3_port=$(kubectl get service minio -o jsonpath='{.spec.ports[0].nodePort}')
s3_url=http://$node_ip:$s3_port
export AWS_ACCESS_KEY_ID=admin; export AWS_SECRET_ACCESS_KEY=password; export AWS_DEFAULT_REGION=us-east-1;
s3cmd="aws --endpoint-url=$s3_url s3"
$s3cmd mb s3://trainingdata
$s3cmd mb s3://trainedmodel
$s3cmd mb s3://mnist_lmdb_data
$s3cmd mb s3://dlaas-trained-models
mkdir tmp
for file in t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz;
do
test -e tmp/$file || wget -q -O tmp/$file http://yann.lecun.com/exdb/mnist/$file
$s3cmd cp tmp/$file s3://trainingdata/$file
done
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;
if [ "$(uname)" = "Darwin" ]; then
sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
else
sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
fi
CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model
Using Cloud Object Storage
export AWS_ACCESS_KEY_ID=mos
export AWS_SECRET_ACCESS_KEY=mos
s3_url=http://120.79.11.211:8080
s3cmd="aws --endpoint-url=$s3_url s3"
trainingDataBucket=
trainingResultBucket=
$s3cmd mb s3://$trainingDataBucket
$s3cmd mb s3://$trainingResultBucket
mkdir tmp
for file in t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz;
do
test -e tmp/$file || wget -q -O tmp/$file http://yann.lecun.com/exdb/mnist/$file
$s3cmd cp tmp/$file s3://$trainingDataBucket/$file
done
if [ "$(uname)" = "Darwin" ]; then
sed -i '' s/tf_training_data/$trainingDataBucket/ etc/examples/tf-model/manifest.yml
sed -i '' s/tf_trained_model/$trainingResultBucket/ etc/examples/tf-model/manifest.yml
sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
sed -i '' s/user_name: test/user_name: $AWS_ACCESS_KEY_ID/ etc/examples/tf-model/manifest.yml
sed -i '' s/password: test/password: $AWS_SECRET_ACCESS_KEY/ etc/examples/tf-model/manifest.yml
else
sed -i s/tf_training_data/$trainingDataBucket/ etc/examples/tf-model/manifest.yml
sed -i s/tf_trained_model/$trainingResultBucket/ etc/examples/tf-model/manifest.yml
sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/tf-model/manifest.yml
sed -i s/user_name: test/user_name: $AWS_ACCESS_KEY_ID/ etc/examples/tf-model/manifest.yml
sed -i s/password: test/password: $AWS_SECRET_ACCESS_KEY/ etc/examples/tf-model/manifest.yml
fi
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;
# Obtain the correct CLI for your machine and run the training job with our default TensorFlow model
CLI_CMD=cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
$CLI_CMD train etc/examples/tf-model/manifest.yml etc/examples/tf-model
docker build -q -t docker.io/ffdl/ffdl-restapi:user-lyh .
(cd ./restapi/ && (test ! -e main.go || CGO_ENABLED=0 GOOS=linux go build -ldflags "-s -w" -a -installsuffix cgo -o bin/main))
kubectl set image deploy ffdl-restapi ffdl-restapi-container=ffdl/ffdl-restapi:v0.1.1