tiup单机扩展多pd报错

配置环境:

os:CentOS Linux release 7.4.1708 (Core)

tiup: tiup cluster v0.6.0

tidb: v3.0.9

拓扑结构:

Starting component `cluster`: /root/.tiup/components/cluster/v0.6.0/cluster display test_cluster
TiDB Cluster: test_cluster
TiDB Version: v3.0.9
ID                   Role        Host           Ports        Status     Data Dir                                       Deploy Dir
--                   ----        ----           -----        ------     --------                                       ----------
172.21.141.61:3000   grafana     172.21.141.61  3000         Up         -                                              /apps/tidb/test_cluster/deploy/grafana-3000
172.21.141.61:2379   pd          172.21.141.61  2379/2380    Healthy    i/apps/tidb/test_cluster/data/pd-2379          /apps/tidb/test_cluster/deploy/pd-2379
172.21.141.61:2381   pd          172.21.141.61  2381/2382    Healthy|L  i/apps/tidb/test_cluster/data/pd-2381          /apps/tidb/test_cluster/deploy/pd-2381
172.21.141.61:2383   pd          172.21.141.61  2383/2384    Healthy    i/apps/tidb/test_cluster/data/pd-2383          /apps/tidb/test_cluster/deploy/pd-2383
172.21.141.61:9090   prometheus  172.21.141.61  9090         Up         i/apps/tidb/test_cluster/data/prometheus-9090  /apps/tidb/test_cluster/deploy/prometheus-9090
172.21.141.61:4000   tidb        172.21.141.61  4000/10080   Up         -                                              /apps/tidb/test_cluster/deploy/tidb-4000
172.21.141.61:20160  tikv        172.21.141.61  20160/20180  Up         i/apps/tidb/test_cluster/data/tikv-20160       /apps/tidb/test_cluster/deploy/tikv-20160
172.21.141.61:20161  tikv        172.21.141.61  20161/20181  Up         i/apps/tidb/test_cluster/data/tikv-20161       /apps/tidb/test_cluster/deploy/tikv-20161
172.21.141.61:20162  tikv        172.21.141.61  20162/20182  Up         i/apps/tidb/test_cluster/data/tikv-20162       /apps/tidb/test_cluster/deploy/tikv-20162
172.21.141.61:20163  tikv        172.21.141.61  20163/20183  Up         i/apps/tidb/test_cluster/data/tikv-20163       /apps/tidb/test_cluster/deploy/tikv-20163

scale-out配置文件

[root@dbatest05 tiup]# cat scale_out_pd.yaml
# # Global variables are applied to all deployments and used as the default value of
# # the deployments if a specific deployment value is missing.
global:
 user: "tidb"
 ssh_port: 22
 deploy_dir: "/apps/tidb/test_cluster/deploy"
 data_dir: "i/apps/tidb/test_cluster/data"

# # Monitored variables are applied to all the machines.
monitored:
 node_exporter_port: 9100
 blackbox_exporter_port: 9115

server_configs:
 tidb:
   log.slow-threshold: 300
 tikv:
   readpool.storage.use-unified-pool: false
   readpool.coprocessor.use-unified-pool: true
 pd:
   replication.enable-placement-rules: true
 tiflash:
   logger.level: "info"

pd_servers:
 - host: 172.21.141.61
   client_port: 2385
   peer_port: 2386

执行命令

tiup cluster scale-out test_cluster test_cluster_scale_out_pd.yaml -y

部署日志

tiup cluster scale-in test_cluster -N 172.21.141.61:2385
Starting component `cluster`: /root/.tiup/components/cluster/v0.6.0/cluster scale-in test_cluster -N 172.21.141.61:2385
This operation will delete the 172.21.141.61:2385 nodes in `test_cluster` and all their data.
Do you want to continue? [y/N]: y
Scale-in nodes...
+ [ Serial ] - SSHKeySet: privateKey=/root/.tiup/storage/cluster/clusters/test_cluster/ssh/id_rsa, publicKey=/root/.tiup/storage/cluster/clusters/test_cluster/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [ Serial ] - ClusterOperate: operation=ScaleInOperation, options={Roles:[] Nodes:[172.21.141.61:2385] Force:false Timeout:300}
Stopping component pd
	Stopping instance 172.21.141.61
	Stop pd 172.21.141.61:2385 success
Destroying component pd
Destroying instance 172.21.141.61
Deleting paths on 172.21.141.61: i/apps/tidb/test_cluster/data/pd-2385 /apps/tidb/test_cluster/deploy/pd-2385 /apps/tidb/test_cluster/deploy/pd-2385/log /etc/systemd/system/pd-2385.service
Destroy 172.21.141.61 success
+ [ Serial ] - UpdateMeta: cluster=test_cluster, deleted=`'172.21.141.61:2385'`
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/tikv-20160.service, deploy_dir=/apps/tidb/test_cluster/deploy/tikv-20160, data_dir=i/apps/tidb/test_cluster/data/tikv-20160, log_dir=/apps/tidb/test_cluster/deploy/tikv-20160/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/tikv-20161.service, deploy_dir=/apps/tidb/test_cluster/deploy/tikv-20161, data_dir=i/apps/tidb/test_cluster/data/tikv-20161, log_dir=/apps/tidb/test_cluster/deploy/tikv-20161/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/pd-2381.service, deploy_dir=/apps/tidb/test_cluster/deploy/pd-2381, data_dir=i/apps/tidb/test_cluster/data/pd-2381, log_dir=/apps/tidb/test_cluster/deploy/pd-2381/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/pd-2383.service, deploy_dir=/apps/tidb/test_cluster/deploy/pd-2383, data_dir=i/apps/tidb/test_cluster/data/pd-2383, log_dir=/apps/tidb/test_cluster/deploy/pd-2383/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/pd-2379.service, deploy_dir=/apps/tidb/test_cluster/deploy/pd-2379, data_dir=i/apps/tidb/test_cluster/data/pd-2379, log_dir=/apps/tidb/test_cluster/deploy/pd-2379/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/grafana-3000.service, deploy_dir=/apps/tidb/test_cluster/deploy/grafana-3000, data_dir=, log_dir=/apps/tidb/test_cluster/deploy/grafana-3000/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/tidb-4000.service, deploy_dir=/apps/tidb/test_cluster/deploy/tidb-4000, data_dir=, log_dir=/apps/tidb/test_cluster/deploy/tidb-4000/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/tikv-20162.service, deploy_dir=/apps/tidb/test_cluster/deploy/tikv-20162, data_dir=i/apps/tidb/test_cluster/data/tikv-20162, log_dir=/apps/tidb/test_cluster/deploy/tikv-20162/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/tikv-20163.service, deploy_dir=/apps/tidb/test_cluster/deploy/tikv-20163, data_dir=i/apps/tidb/test_cluster/data/tikv-20163, log_dir=/apps/tidb/test_cluster/deploy/tikv-20163/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/tidb-4001.service, deploy_dir=/apps/tidb/test_cluster/deploy/tidb-4001, data_dir=, log_dir=/apps/tidb/test_cluster/deploy/tidb-4001/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
+ [ Serial ] - InitConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, path=/root/.tiup/storage/cluster/clusters/test_cluster/config/prometheus-9090.service, deploy_dir=/apps/tidb/test_cluster/deploy/prometheus-9090, data_dir=i/apps/tidb/test_cluster/data/prometheus-9090, log_dir=/apps/tidb/test_cluster/deploy/prometheus-9090/log, cache_dir=/root/.tiup/storage/cluster/clusters/test_cluster/config
Scaled cluster `test_cluster` in successfully
[root@dbatest05 tiup]#
[root@dbatest05 tiup]#
[root@dbatest05 tiup]# tiup cluster scale-out test_cluster ./scale_out_pd.yaml -y
Starting component `cluster`: /root/.tiup/components/cluster/v0.6.0/cluster scale-out test_cluster ./scale_out_pd.yaml -y
+ [ Serial ] - SSHKeySet: privateKey=/root/.tiup/storage/cluster/clusters/test_cluster/ssh/id_rsa, publicKey=/root/.tiup/storage/cluster/clusters/test_cluster/ssh/id_rsa.pub
  - Download pd:v3.0.9 ... Done

  - Download blackbox_exporter:v0.12.0 ... Done
+ [ Serial ] - UserSSH: user=tidb, host=172.21.141.61
+ [ Serial ] - Mkdir: host=172.21.141.61, directories='/apps/tidb/test_cluster/deploy/pd-2385','i/apps/tidb/test_cluster/data/pd-2385','/apps/tidb/test_cluster/deploy/pd-2385/log','/apps/tidb/test_cluster/deploy/pd-2385/bin','/apps/tidb/test_cluster/deploy/pd-2385/conf','/apps/tidb/test_cluster/deploy/pd-2385/scripts'
+ [ Serial ] - CopyComponent: component=pd, version=v3.0.9, remote=172.21.141.61:/apps/tidb/test_cluster/deploy/pd-2385
+ [ Serial ] - ScaleConfig: cluster=test_cluster, user=tidb, host=172.21.141.61, service=pd-2385.service, deploy_dir=/apps/tidb/test_cluster/deploy/pd-2385, data_dir=i/apps/tidb/test_cluster/data/pd-2385, log_dir=/apps/tidb/test_cluster/deploy/pd-2385/log, cache_dir=
script path: /root/.tiup/storage/cluster/clusters/test_cluster/config/run_pd_172.21.141.61_2385.sh
script path: /root/.tiup/components/cluster/v0.6.0/templates/scripts/run_pd_scale.sh.tpl
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [ Serial ] - ClusterOperate: operation=StartOperation, options={Roles:[] Nodes:[] Force:false Timeout:0}
Starting component pd
	Starting instance pd 172.21.141.61:2383
	Starting instance pd 172.21.141.61:2379
	Starting instance pd 172.21.141.61:2381
	Start pd 172.21.141.61:2383 success
	Start pd 172.21.141.61:2381 success
	Start pd 172.21.141.61:2379 success
Starting component node_exporter
	Starting instance 172.21.141.61
	Start 172.21.141.61 success
Starting component blackbox_exporter
	Starting instance 172.21.141.61
	Start 172.21.141.61 success
Starting component tikv
	Starting instance tikv 172.21.141.61:20163
	Starting instance tikv 172.21.141.61:20160
	Starting instance tikv 172.21.141.61:20161
	Starting instance tikv 172.21.141.61:20162
	Start tikv 172.21.141.61:20161 success
	Start tikv 172.21.141.61:20163 success
	Start tikv 172.21.141.61:20162 success
	Start tikv 172.21.141.61:20160 success
Starting component tidb
	Starting instance tidb 172.21.141.61:4001
	Starting instance tidb 172.21.141.61:4000
	Start tidb 172.21.141.61:4001 success
	Start tidb 172.21.141.61:4000 success
Starting component prometheus
	Starting instance prometheus 172.21.141.61:9090
	Start prometheus 172.21.141.61:9090 success
Starting component grafana
	Starting instance grafana 172.21.141.61:3000
	Start grafana 172.21.141.61:3000 success
Checking service state of pd
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:29:52 CST; 11min ago
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:29:50 CST; 11min ago
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:29:48 CST; 11min ago
Checking service state of tikv
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:29:57 CST; 11min ago
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:30:58 CST; 10min ago
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:29:59 CST; 11min ago
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:30:01 CST; 11min ago
Checking service state of tidb
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:31:01 CST; 10min ago
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:30:59 CST; 10min ago
Checking service state of prometheus
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:31:02 CST; 10min ago
Checking service state of grafana
	172.21.141.61	   Active: active (running) since Fri 2020-05-15 11:31:02 CST; 10min ago
+ [Parallel] - UserSSH: user=tidb, host=172.21.141.61
+ [ Serial ] - save meta
+ [ Serial ] - ClusterOperate: operation=StartOperation, options={Roles:[] Nodes:[] Force:false Timeout:0}
Starting component pd
	Starting instance pd 172.21.141.61:2385
	pd 172.21.141.61:2385 failed to start: timed out waiting for port 2385 to be started after 1m0s, please check the log of the instance

Error: failed to start: failed to start pd: 	pd 172.21.141.61:2385 failed to start: timed out waiting for port 2385 to be started after 1m0s, please check the log of the instance: timed out waiting for port 2385 to be started after 1m0s

Verbose debug logs has been written to /apps/tiup/logs/tiup-cluster-debug-2020-05-15-11-42-42.log.
Error: run `/root/.tiup/components/cluster/v0.6.0/cluster` (wd:/root/.tiup/data/Rz1bA32) failed: exit status 1

报错信息

2020-05-15T11:42:42.409+0800    ERROR           pd 172.21.141.61:2385 failed to start: timed out waiting for port 2385 to be started after 1m0s, please check the log of the instance
2020-05-15T11:42:42.409+0800    DEBUG   TaskFinish      {"task": "ClusterOperate: operation=StartOperation, options={Roles:[] Nodes:[] Force:false Timeout:0}", "error": "failed to start: failed to start pd: \tpd 172.21.141.61:2385 failed to start: timed out waiting for port 2385 to be started after 1m0s, please check the log of the instance: timed out waiting for port 2385 to be started after 1m0s", "errorVerbose": "timed out waiting for port 2385 to be started after 1m0s\ngithub.com/pingcap-incubator/tiup-cluster/pkg/module.(*WaitFor).Execute\n\t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/module/wait_for.go:89\ngithub.com/pingcap-incubator/tiup-cluster/pkg/meta.PortStarted\n\t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/meta/logic.go:106\ngithub.com/pingcap-incubator/tiup-cluster/pkg/meta.(*instance).Ready\n\t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/meta/logic.go:135\ngithub.com/pingcap-incubator/tiup-cluster/pkg/operation.startInstance\n\t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/operation/action.go:421\ngithub.com/pingcap-incubator/tiup-cluster/pkg/operation.StartComponent.func1\n\t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/operation/action.go:454\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\n\tpd 172.21.141.61:2385 failed to start: timed out waiting for port 2385 to be started after 1m0s, please check the log of the instance\nfailed to start pd\nfailed to start"}
2020-05-15T11:42:42.409+0800    INFO    Execute command finished        {"code": 1, "error": "failed to start: failed to start pd: \tpd 172.21.141.61:2385 failed to start: timed out waiting for port 2385 to be started after 1m0s, please check the log of the instance: timed out waiting for port 2385 to be started after 1m0s", "errorVerbose": "timed out waiting for port 2385 to be started after 1m0s\ngithub.com/pingcap-incubator/tiup-cluster/pkg/module.(*WaitFor).Execute\n\t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/module/wait_for.go:89\ngithub.com/pingcap-incubator/tiup-cluster/pkg/meta.PortStarted\n\t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/meta/logic.go:106\ngithub.com/pingcap-incubator/tiup-cluster/pkg/meta.(*instance).Ready\n\t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/meta/logic.go:135\ngithub.com/pingcap-incubator/tiup-cluster/pkg/operation.startInstance\n\t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/operation/action.go:421\ngithub.com/pingcap-incubator/tiup-cluster/pkg/operation.StartComponent.func1\n\t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/operation/action.go:454\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\n\tpd 172.21.141.61:2385 failed to start: timed out waiting for port 2385 to be started after 1m0s, please check the log of the instance\nfailed to start pd\nfailed to start"}

问题分析

查看部署路径/apps/tidb/test_cluster/deploy/pd-2385已经包含,pd无法启动起来。排查问题发现/apps/tidb/test_cluster/deploy/pd-2385/scripts中的run_pd.sh脚本有问题

tiup单机扩展多pd报错_第1张图片

问题解决

1)将该name改为pd-172.21.141.61-2385,执行脚本/apps/tidb/test_cluster/deploy/pd-2385/scripts/run_pd.sh 即可成功启动。

2)tiup cluster edit-config test_cluster 手工添加pd-2385的配置信息

- host: 172.21.141.61
  ssh_port: 22
  name: pd-172.21.141.61-2385
  client_port: 2385
  peer_port: 2386
  deploy_dir: /apps/tidb/test_cluster/deploy/pd-2385
  data_dir: i/apps/tidb/test_cluster/data/pd-2385

3) reload cluster

后续

在单机扩展tidb和tikv的时候没有遇到该问题。

考虑是脚本在生成pd name的时候出现问题。

在scale-out脚本中手工指定pd-name同样会报错,run_pd.sh中的name怀疑是按照pd同步数据端口来自动生成,会造成名称重复,应该是tiup本身bug。

 

 

你可能感兴趣的:(tidb)