ChaosBlade权威指南

如果对混沌实验不了解,或者不了解ChaosBlade请先阅读以下三篇文章。

1、混沌工程之阿里巴巴Chaosblade

2、如何模拟将CPU、IO打满?

3、干货 | 阿里巴巴混沌测试工具ChaosBlade两万字解读

一、前言

ChaosBlade 是一款遵循混沌工程实验原理,建立在阿里巴巴近十年故障测试和演练实践基础上,并结合了集团各业务的最佳创意和实践,提供丰富故障场景实现,帮助分布式系统提升容错性和可恢复性的混沌工程工具。

Chaosblade 可直接编译运行,cli 命令提示使执行混沌实验更加简单。目前支持的演练场景有操作系统类的 CPU、磁盘、进程、网络,Java 应用类的 Dubbo、MySQL、Servlet 和自定义类方法延迟或抛异常等以及杀容器、杀 Pod。

二、ChaosBlade 能解决哪些问题?

1)衡量微服务的容错能力

通过模拟调用延迟、服务不可用、机器资源满载等,查看发生故障的节点或实例是否被自动隔离、下线,流量调度是否正确,预案是否有效,同时观察系统整体的 QPS 或 RT 是否受影响。在此基础上可以缓慢增加故障节点范围,验证上游服务限流降级、熔断等是否有效。最终故障节点增加到请求服务超时,估算系统容错红线,衡量系统容错能力。

2)验证容器编排配置是否合理

通过模拟杀服务 Pod、杀节点、增大 Pod 资源负载,观察系统服务可用性,验证副本配置、资源限制配置以及 Pod 下部署的容器是否合理。

3)测试 PaaS 层是否健壮

通过模拟上层资源负载,验证调度系统的有效性;模拟依赖的分布式存储不可用,验证系统的容错能力;模拟调度节点不可用,测试调度任务是否自动迁移到可用节点;模拟主备节点故障,测试主备切换是否正常。

4)验证监控告警的时效性

通过对系统注入故障,验证监控指标是否准确,监控维度是否完善,告警阈值是否合理,告警是否快速,告警接收人是否正确,通知渠道是否可用等,提升监控告警的准确和时效性。

5)定位与解决问题的应急能力

通过故障突袭,随机对系统注入故障,考察相关人员对问题的应急能力,以及问题上报、处理流程是否合理,达到以战养战,锻炼人定位与解决问题的能力。

三、ChaosBlade操作指南

(1)获取 ChaosBlade 最新的 release 包,目前支持的平台是 linux/amd64 和 darwin/64,下载对应平台的包。

wget https://github.com/chaosblade-io/chaosblade/releases/download/v0.3.0/chaosblade-0.3.0.linux-amd64.tar.gz

下载完成后解压即可,无需编译。解压后的目录如下:

├── bin
│   ├── chaos_burncpu
│   ├── chaos_burnio
│   ├── chaos_changedns
│   ├── chaos_delaynetwork
│   ├── chaos_dropnetwork
│   ├── chaos_filldisk
│   ├── chaos_killprocess
│   ├── chaos_lossnetwork
│   ├── chaos_stopprocess
│   ├── cplus-chaosblade.spec.yaml
│   ├── jvm.spec.yaml
│   └── tools.jar
├── blade
└── lib
    ├── cplus
    │   ├── chaosblade-exec-cplus.jar
    │   └── script
    │       ├── shell_break_and_return_attach.sh
    │       ├── shell_break_and_return.sh
    │       ├── shell_check_process_duplicate.sh
    │       ├── shell_check_process_id.sh
    │       ├── shell_initialization.sh
    │       ├── shell_modify_variable_attch.sh
    │       ├── shell_modify_variable.sh
    │       ├── shell_remove_process.sh
    │       ├── shell_response_delay_attach.sh
    │       └── shell_response_delay.sh
    └── sandbox
        ├── bin
        │   └── sandbox.sh
        ├── cfg
        │   ├── sandbox-logback.xml
        │   ├── sandbox.properties
        │   └── version
        ├── example
        │   └── sandbox-debug-module.jar
        ├── install-local.sh
        ├── lib
        │   ├── sandbox-agent.jar
        │   ├── sandbox-core.jar
        │   └── sandbox-spy.jar
        ├── module
        │   ├── chaosblade-java-agent-0.2.0.jar
        │   └── sandbox-mgr-module.jar
        └── provider
            └── sandbox-mgr-provider.jar


其中 blade 是可执行文件,即 chaosblade 工具的 cli,混沌实验执行的工具。执行 ./blade help 可以查看支持命令有哪些

blade 命令列表如下:

    prepare:简写 p,混沌实验前的准备,比如演练 Java 应用,则需要挂载 java agent。要演练应用名是 business 的应用,则在目标主机上执行 blade p jvm --process business。如果挂载成功,返回挂载的 uid,用于状态查询或者撤销挂载使用。

    revoke:简写 r,撤销之前混沌实验准备,比如卸载 java agent。命令是 blade revoke UID

    create: 简写是 c,创建一个混沌演练实验,指执行故障注入。命令是 blade create [TARGET] [ACTION] [FLAGS],比如实施一次 Dubbo consumer 调用 xxx.xxx.Service 接口延迟 3s,则执行的命令为 blade create dubbo delay --consumer --time 3000 --service xxx.xxx.Service,如果注入成功,则返回实验的 uid,用于状态查询和销毁此实验使用。

    destroy:简写是 d,销毁之前的混沌实验,比如销毁上面提到的 Dubbo 延迟实验,命令是 blade destroy UID

    status:简写 s,查询准备阶段或者实验的状态,命令是 blade status UID 或者 blade status --type create

以上命令帮助均可使用 blade help [COMMAND]。

(2)blade可以进行哪些实验

blade可以进行哪些实验,具体可执行 blade create -h 查看

Create a chaos engineering experiment
 
Usage:
  blade create [command]
 
Aliases:
  create, c
 
Examples:
create dubbo delay --time 3000 --offset 100 --service com.example.Service --consumer
 
Available Commands:
  cplus       c++ experiment
  cpu         Cpu experiment
  disk        Disk experiment
  docker      Execute a docker experiment
  druid       Druid experiment
  dubbo       dubbo experiment
  http        http experiment
  jvm         method
  k8s         Kubernetes experiment
  mysql       mysql experiment
  network     Network experiment
  process     Process experiment
  rocketmq    Rocketmq experiment,can make message send or pull delay and exception
  script      Script chaos experiment
  servlet     java servlet experiment
 
Flags:
  -h, --help   help for create
 
Global Flags:
  -d, --debug   Set client to DEBUG mode
 
Use "blade create [command] --help" for more information about a command.

 

(3) 使用实例

演示一下CPU使用率100%的故障,即使用blade create cpu fullload命令。blade create cpu的用法如下:

hidden@hidden:~/chaos/chaosblade-0.2.0$ ./blade create cpu -h
Cpu experiment, for example full load
 
Usage:
  blade create cpu [flags]
  blade create cpu [command]
 
Examples:
cpu fullload
 
Available Commands:
  fullload    cpu fullload
 
Flags:
      --cpu-count string   Cpu count
      --cpu-list string    CPUs in which to allow burning (0-3 or 1,3)
  -h, --help               help for cpu
 
Global Flags:
  -d, --debug   Set client to DEBUG mode
 
Use "blade create cpu [command] --help" for more information about a command.

执行实验:

hidden@hidden:~/chaos/chaosblade-0.2.0$ ./blade create cpu fullload
{"code":200,"success":true,"result":"d9e3879cb68416a2"}

注意上面的result: d9e3879cb68416a2中的d9e3879cb68416a2,这个在停止实验的时候会用到(./blade destroy UID)。

采用iostat -c 1 1000命令查看CPU使用率(%idle):   

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              98.75    0.00    1.25    0.00    0.00    0.00

 查看CPU的使用率还可以使用sar命令、top命令等。

此时命令已经生效。下一步停止实验,执行:

 hidden@hidden:~/chaos/chaosblade-0.2.0$ ./blade destroy d9e3879cb68416a2
    {"code":200,"success":true,"result":"command: cpu fullload --debug false --help false"}

再观察CPU的情况,负载已经回到正常状态:

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               0.25    0.00    0.50    2.00    0.00   97.25

至此,一次CPU满负荷的故障演练完成,其他命令读者可以自行完成。

(4)查看历史执行记录

如果忘记uid, 无法恢复,可以使用以下命令查看历史

[dev@hua1-dev ~]$ ./chaosblade/blade  status --type create

{
	"code": 200,
	"success": true,
	"result": [
		{
			"Uid": "77d533cdb8b61d07",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--debug false --help false",
			"Status": "Error",
			"Error": "{\"code\":604,\"success\":false,\"error\":\"mount: only root can do that\\n exit status 1\"} exit status 1",
			"CreateTime": "2019-12-02T11:01:27.059036062+08:00",
			"UpdateTime": "2019-12-02T11:01:27.112947513+08:00"
		},
		{
			"Uid": "fc3ff5dbcc3d8287",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--debug false --help false",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-02T11:01:37.653829453+08:00",
			"UpdateTime": "2019-12-02T11:02:41.072792303+08:00"
		},
		{
			"Uid": "ff941d81a1bfc583",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--debug false --help false",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T11:02:44.433088896+08:00",
			"UpdateTime": "2019-12-02T11:02:44.473296047+08:00"
		},
		{
			"Uid": "b1a2a18a9d7d2209",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--debug false --help false --timeout 120",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T11:03:19.26821251+08:00",
			"UpdateTime": "2019-12-02T11:03:19.306998571+08:00"
		},
		{
			"Uid": "ade8fd251c14c3ac",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--debug false --help false --mem-percent 20 --timeout 60",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-02T11:44:12.402871862+08:00",
			"UpdateTime": "2019-12-02T11:44:24.764721576+08:00"
		},
		{
			"Uid": "42f18ab2a9f647df",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--timeout 60 --debug false --help false --mem-percent 50",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-02T11:44:39.03004387+08:00",
			"UpdateTime": "2019-12-02T11:44:46.068383049+08:00"
		},
		{
			"Uid": "c4bd47a436c32f8a",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--cpu-count 4 --debug false --help false",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T17:57:21.848821251+08:00",
			"UpdateTime": "2019-12-02T17:57:22.954051573+08:00"
		},
		{
			"Uid": "b3c530b53ce081e7",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--debug false --help false --cpu-count 4",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T17:58:09.649468604+08:00",
			"UpdateTime": "2019-12-02T17:58:10.751782523+08:00"
		},
		{
			"Uid": "a8606090e6f3bf4d",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--debug false --help false --cpu-count 4",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T18:14:25.552265716+08:00",
			"UpdateTime": "2019-12-02T18:14:26.57991441+08:00"
		},
		{
			"Uid": "78d3d6004851c58e",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--cpu-count 4 --debug false --help false",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T18:15:38.276979861+08:00",
			"UpdateTime": "2019-12-02T18:15:39.361869497+08:00"
		},
		{
			"Uid": "d1fe9d0df56ffd38",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--debug false --help false --cpu-count 4",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-02T18:38:55.754252838+08:00",
			"UpdateTime": "2019-12-02T18:38:56.875084906+08:00"
		},
		{
			"Uid": "44e3083833a1d74a",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--cpu-count 4 --debug false --help false",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-02T18:47:57.880120218+08:00",
			"UpdateTime": "2019-12-02T18:53:37.707679493+08:00"
		},
		{
			"Uid": "bda3f35a7ca8ea16",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--debug false --help false --cpu-count 4",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T14:45:18.783440839+08:00",
			"UpdateTime": "2019-12-03T14:57:18.823532704+08:00"
		},
		{
			"Uid": "99a137ba58396e60",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T14:59:01.13288346+08:00",
			"UpdateTime": "2019-12-03T15:00:20.17680665+08:00"
		},
		{
			"Uid": "bcb4c1b17d445f55",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--size 20000 --debug false --help false",
			"Status": "Error",
			"Error": "dd: \ufffd\ufffd\ufffd诖\ufffd\ufffd\ufffd\"/chaos_filldisk.log.dat\": 权\ufffd薏\ufffd\ufffd\ufffd\n exit status 1 exit status 1",
			"CreateTime": "2019-12-03T15:00:46.651558522+08:00",
			"UpdateTime": "2019-12-03T15:00:46.712107819+08:00"
		},
		{
			"Uid": "0f60263c7b830b58",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T15:00:52.562074748+08:00",
			"UpdateTime": "2019-12-03T15:03:45.571227777+08:00"
		},
		{
			"Uid": "07486dcb6b8e1804",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T15:03:59.885515096+08:00",
			"UpdateTime": "2019-12-03T15:05:44.380778022+08:00"
		},
		{
			"Uid": "5f2c5c0353470b66",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T15:09:05.204692014+08:00",
			"UpdateTime": "2019-12-03T15:11:25.112595984+08:00"
		},
		{
			"Uid": "33019d022f93a58e",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T15:12:05.644988155+08:00",
			"UpdateTime": "2019-12-03T15:15:46.775748998+08:00"
		},
		{
			"Uid": "ae888993f31e9aeb",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T15:16:17.115550065+08:00",
			"UpdateTime": "2019-12-03T15:20:50.007686126+08:00"
		},
		{
			"Uid": "d24dd9239902eb6f",
			"Command": "network",
			"SubCommand": "delay",
			"Flag": "--time 3 --debug false --help false --interface eth0 --local-port 6396 --remote-port 6396",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T16:17:19.098079716+08:00",
			"UpdateTime": "2019-12-03T16:19:00.02772809+08:00"
		},
		{
			"Uid": "6aa70b124dce79f3",
			"Command": "network",
			"SubCommand": "delay",
			"Flag": "--help false --interface eth0 --local-port 6396 --remote-port 6396 --time 3000 --debug false",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-03T16:21:33.527410693+08:00",
			"UpdateTime": "2019-12-03T16:54:47.535580171+08:00"
		},
		{
			"Uid": "5d838cdd1584c7f0",
			"Command": "mem",
			"SubCommand": "load",
			"Flag": "--timeout 2 --debug false --help false --mem-percent 2",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-04T10:03:56.301575979+08:00",
			"UpdateTime": "2019-12-04T10:03:58.440572078+08:00"
		},
		{
			"Uid": "d87befe08c312ffe",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--debug false --help false --cpu-count 2",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-04T11:51:39.511920791+08:00",
			"UpdateTime": "2019-12-04T11:51:47.120728389+08:00"
		},
		{
			"Uid": "a510f7c62d4ddfef",
			"Command": "disk",
			"SubCommand": "fill",
			"Flag": "--debug false --help false --size 20000",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-04T16:59:11.557139226+08:00",
			"UpdateTime": "2019-12-04T17:01:11.52823494+08:00"
		},
		{
			"Uid": "a28f911f2f90e441",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--cpu-count 2 --debug false --help false",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-05T16:01:17.435926509+08:00",
			"UpdateTime": "2019-12-05T16:01:26.333131804+08:00"
		},
		{
			"Uid": "4778c8d168727f7a",
			"Command": "cpu",
			"SubCommand": "fullload",
			"Flag": "--help false --cpu-count 2 --debug false",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-05T16:11:24.526883118+08:00",
			"UpdateTime": "2019-12-05T16:11:32.166626915+08:00"
		},
		{
			"Uid": "1247625ff900e8b5",
			"Command": "disk",
			"SubCommand": "burn",
			"Flag": "--write false --debug false --help false --read true --size 20",
			"Status": "Destroyed",
			"Error": "",
			"CreateTime": "2019-12-05T16:42:52.76683927+08:00",
			"UpdateTime": "2019-12-05T16:42:58.872850884+08:00"
		},
		{
			"Uid": "6817b73edd7d1c36",
			"Command": "network",
			"SubCommand": "delay",
			"Flag": "--interface eth0 --time 2000 --debug false --help false",
			"Status": "Success",
			"Error": "",
			"CreateTime": "2019-12-05T16:43:06.855722578+08:00",
			"UpdateTime": "2019-12-05T16:43:06.880694868+08:00"
		}
	]
}

 

更多:chaosblade命令全解析

 

四、Chaosblade模型

ChaosBlade权威指南_第1张图片

 

 

五、Chaosblade命令执行过程

 

ChaosBlade权威指南_第2张图片


————————————————
版权声明:本文为CSDN博主「朱小厮」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/u013256816/article/details/99917021

https://chaosblade-io.gitbook.io/chaosblade-help-zh-cn/

你可能感兴趣的:(服务器技术)