如果对混沌实验不了解,或者不了解ChaosBlade请先阅读以下三篇文章。
1、混沌工程之阿里巴巴Chaosblade
2、如何模拟将CPU、IO打满?
3、干货 | 阿里巴巴混沌测试工具ChaosBlade两万字解读
ChaosBlade 是一款遵循混沌工程实验原理,建立在阿里巴巴近十年故障测试和演练实践基础上,并结合了集团各业务的最佳创意和实践,提供丰富故障场景实现,帮助分布式系统提升容错性和可恢复性的混沌工程工具。
Chaosblade 可直接编译运行,cli 命令提示使执行混沌实验更加简单。目前支持的演练场景有操作系统类的 CPU、磁盘、进程、网络,Java 应用类的 Dubbo、MySQL、Servlet 和自定义类方法延迟或抛异常等以及杀容器、杀 Pod。
1)衡量微服务的容错能力
通过模拟调用延迟、服务不可用、机器资源满载等,查看发生故障的节点或实例是否被自动隔离、下线,流量调度是否正确,预案是否有效,同时观察系统整体的 QPS 或 RT 是否受影响。在此基础上可以缓慢增加故障节点范围,验证上游服务限流降级、熔断等是否有效。最终故障节点增加到请求服务超时,估算系统容错红线,衡量系统容错能力。
2)验证容器编排配置是否合理
通过模拟杀服务 Pod、杀节点、增大 Pod 资源负载,观察系统服务可用性,验证副本配置、资源限制配置以及 Pod 下部署的容器是否合理。
3)测试 PaaS 层是否健壮
通过模拟上层资源负载,验证调度系统的有效性;模拟依赖的分布式存储不可用,验证系统的容错能力;模拟调度节点不可用,测试调度任务是否自动迁移到可用节点;模拟主备节点故障,测试主备切换是否正常。
4)验证监控告警的时效性
通过对系统注入故障,验证监控指标是否准确,监控维度是否完善,告警阈值是否合理,告警是否快速,告警接收人是否正确,通知渠道是否可用等,提升监控告警的准确和时效性。
5)定位与解决问题的应急能力
通过故障突袭,随机对系统注入故障,考察相关人员对问题的应急能力,以及问题上报、处理流程是否合理,达到以战养战,锻炼人定位与解决问题的能力。
wget https://github.com/chaosblade-io/chaosblade/releases/download/v0.3.0/chaosblade-0.3.0.linux-amd64.tar.gz
下载完成后解压即可,无需编译。解压后的目录如下:
├── bin
│ ├── chaos_burncpu
│ ├── chaos_burnio
│ ├── chaos_changedns
│ ├── chaos_delaynetwork
│ ├── chaos_dropnetwork
│ ├── chaos_filldisk
│ ├── chaos_killprocess
│ ├── chaos_lossnetwork
│ ├── chaos_stopprocess
│ ├── cplus-chaosblade.spec.yaml
│ ├── jvm.spec.yaml
│ └── tools.jar
├── blade
└── lib
├── cplus
│ ├── chaosblade-exec-cplus.jar
│ └── script
│ ├── shell_break_and_return_attach.sh
│ ├── shell_break_and_return.sh
│ ├── shell_check_process_duplicate.sh
│ ├── shell_check_process_id.sh
│ ├── shell_initialization.sh
│ ├── shell_modify_variable_attch.sh
│ ├── shell_modify_variable.sh
│ ├── shell_remove_process.sh
│ ├── shell_response_delay_attach.sh
│ └── shell_response_delay.sh
└── sandbox
├── bin
│ └── sandbox.sh
├── cfg
│ ├── sandbox-logback.xml
│ ├── sandbox.properties
│ └── version
├── example
│ └── sandbox-debug-module.jar
├── install-local.sh
├── lib
│ ├── sandbox-agent.jar
│ ├── sandbox-core.jar
│ └── sandbox-spy.jar
├── module
│ ├── chaosblade-java-agent-0.2.0.jar
│ └── sandbox-mgr-module.jar
└── provider
└── sandbox-mgr-provider.jar
其中 blade 是可执行文件,即 chaosblade 工具的 cli,混沌实验执行的工具。执行 ./blade help 可以查看支持命令有哪些
blade 命令列表如下:
prepare:简写 p,混沌实验前的准备,比如演练 Java 应用,则需要挂载 java agent。要演练应用名是 business 的应用,则在目标主机上执行 blade p jvm --process business。如果挂载成功,返回挂载的 uid,用于状态查询或者撤销挂载使用。
revoke:简写 r,撤销之前混沌实验准备,比如卸载 java agent。命令是 blade revoke UID
create: 简写是 c,创建一个混沌演练实验,指执行故障注入。命令是 blade create [TARGET] [ACTION] [FLAGS],比如实施一次 Dubbo consumer 调用 xxx.xxx.Service 接口延迟 3s,则执行的命令为 blade create dubbo delay --consumer --time 3000 --service xxx.xxx.Service,如果注入成功,则返回实验的 uid,用于状态查询和销毁此实验使用。
destroy:简写是 d,销毁之前的混沌实验,比如销毁上面提到的 Dubbo 延迟实验,命令是 blade destroy UID
status:简写 s,查询准备阶段或者实验的状态,命令是 blade status UID 或者 blade status --type create
以上命令帮助均可使用 blade help [COMMAND]。
blade可以进行哪些实验,具体可执行 blade create -h 查看
Create a chaos engineering experiment
Usage:
blade create [command]
Aliases:
create, c
Examples:
create dubbo delay --time 3000 --offset 100 --service com.example.Service --consumer
Available Commands:
cplus c++ experiment
cpu Cpu experiment
disk Disk experiment
docker Execute a docker experiment
druid Druid experiment
dubbo dubbo experiment
http http experiment
jvm method
k8s Kubernetes experiment
mysql mysql experiment
network Network experiment
process Process experiment
rocketmq Rocketmq experiment,can make message send or pull delay and exception
script Script chaos experiment
servlet java servlet experiment
Flags:
-h, --help help for create
Global Flags:
-d, --debug Set client to DEBUG mode
Use "blade create [command] --help" for more information about a command.
演示一下CPU使用率100%的故障,即使用blade create cpu fullload命令。blade create cpu的用法如下:
hidden@hidden:~/chaos/chaosblade-0.2.0$ ./blade create cpu -h
Cpu experiment, for example full load
Usage:
blade create cpu [flags]
blade create cpu [command]
Examples:
cpu fullload
Available Commands:
fullload cpu fullload
Flags:
--cpu-count string Cpu count
--cpu-list string CPUs in which to allow burning (0-3 or 1,3)
-h, --help help for cpu
Global Flags:
-d, --debug Set client to DEBUG mode
Use "blade create cpu [command] --help" for more information about a command.
执行实验:
hidden@hidden:~/chaos/chaosblade-0.2.0$ ./blade create cpu fullload
{"code":200,"success":true,"result":"d9e3879cb68416a2"}
注意上面的result: d9e3879cb68416a2中的d9e3879cb68416a2,这个在停止实验的时候会用到(./blade destroy UID)。
采用iostat -c 1 1000命令查看CPU使用率(%idle):
avg-cpu: %user %nice %system %iowait %steal %idle
98.75 0.00 1.25 0.00 0.00 0.00
查看CPU的使用率还可以使用sar命令、top命令等。
此时命令已经生效。下一步停止实验,执行:
hidden@hidden:~/chaos/chaosblade-0.2.0$ ./blade destroy d9e3879cb68416a2
{"code":200,"success":true,"result":"command: cpu fullload --debug false --help false"}
再观察CPU的情况,负载已经回到正常状态:
avg-cpu: %user %nice %system %iowait %steal %idle
0.25 0.00 0.50 2.00 0.00 97.25
至此,一次CPU满负荷的故障演练完成,其他命令读者可以自行完成。
如果忘记uid, 无法恢复,可以使用以下命令查看历史
[dev@hua1-dev ~]$ ./chaosblade/blade status --type create
{
"code": 200,
"success": true,
"result": [
{
"Uid": "77d533cdb8b61d07",
"Command": "mem",
"SubCommand": "load",
"Flag": "--debug false --help false",
"Status": "Error",
"Error": "{\"code\":604,\"success\":false,\"error\":\"mount: only root can do that\\n exit status 1\"} exit status 1",
"CreateTime": "2019-12-02T11:01:27.059036062+08:00",
"UpdateTime": "2019-12-02T11:01:27.112947513+08:00"
},
{
"Uid": "fc3ff5dbcc3d8287",
"Command": "mem",
"SubCommand": "load",
"Flag": "--debug false --help false",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-02T11:01:37.653829453+08:00",
"UpdateTime": "2019-12-02T11:02:41.072792303+08:00"
},
{
"Uid": "ff941d81a1bfc583",
"Command": "mem",
"SubCommand": "load",
"Flag": "--debug false --help false",
"Status": "Success",
"Error": "",
"CreateTime": "2019-12-02T11:02:44.433088896+08:00",
"UpdateTime": "2019-12-02T11:02:44.473296047+08:00"
},
{
"Uid": "b1a2a18a9d7d2209",
"Command": "mem",
"SubCommand": "load",
"Flag": "--debug false --help false --timeout 120",
"Status": "Success",
"Error": "",
"CreateTime": "2019-12-02T11:03:19.26821251+08:00",
"UpdateTime": "2019-12-02T11:03:19.306998571+08:00"
},
{
"Uid": "ade8fd251c14c3ac",
"Command": "mem",
"SubCommand": "load",
"Flag": "--debug false --help false --mem-percent 20 --timeout 60",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-02T11:44:12.402871862+08:00",
"UpdateTime": "2019-12-02T11:44:24.764721576+08:00"
},
{
"Uid": "42f18ab2a9f647df",
"Command": "mem",
"SubCommand": "load",
"Flag": "--timeout 60 --debug false --help false --mem-percent 50",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-02T11:44:39.03004387+08:00",
"UpdateTime": "2019-12-02T11:44:46.068383049+08:00"
},
{
"Uid": "c4bd47a436c32f8a",
"Command": "cpu",
"SubCommand": "fullload",
"Flag": "--cpu-count 4 --debug false --help false",
"Status": "Success",
"Error": "",
"CreateTime": "2019-12-02T17:57:21.848821251+08:00",
"UpdateTime": "2019-12-02T17:57:22.954051573+08:00"
},
{
"Uid": "b3c530b53ce081e7",
"Command": "cpu",
"SubCommand": "fullload",
"Flag": "--debug false --help false --cpu-count 4",
"Status": "Success",
"Error": "",
"CreateTime": "2019-12-02T17:58:09.649468604+08:00",
"UpdateTime": "2019-12-02T17:58:10.751782523+08:00"
},
{
"Uid": "a8606090e6f3bf4d",
"Command": "cpu",
"SubCommand": "fullload",
"Flag": "--debug false --help false --cpu-count 4",
"Status": "Success",
"Error": "",
"CreateTime": "2019-12-02T18:14:25.552265716+08:00",
"UpdateTime": "2019-12-02T18:14:26.57991441+08:00"
},
{
"Uid": "78d3d6004851c58e",
"Command": "cpu",
"SubCommand": "fullload",
"Flag": "--cpu-count 4 --debug false --help false",
"Status": "Success",
"Error": "",
"CreateTime": "2019-12-02T18:15:38.276979861+08:00",
"UpdateTime": "2019-12-02T18:15:39.361869497+08:00"
},
{
"Uid": "d1fe9d0df56ffd38",
"Command": "cpu",
"SubCommand": "fullload",
"Flag": "--debug false --help false --cpu-count 4",
"Status": "Success",
"Error": "",
"CreateTime": "2019-12-02T18:38:55.754252838+08:00",
"UpdateTime": "2019-12-02T18:38:56.875084906+08:00"
},
{
"Uid": "44e3083833a1d74a",
"Command": "cpu",
"SubCommand": "fullload",
"Flag": "--cpu-count 4 --debug false --help false",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-02T18:47:57.880120218+08:00",
"UpdateTime": "2019-12-02T18:53:37.707679493+08:00"
},
{
"Uid": "bda3f35a7ca8ea16",
"Command": "cpu",
"SubCommand": "fullload",
"Flag": "--debug false --help false --cpu-count 4",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-03T14:45:18.783440839+08:00",
"UpdateTime": "2019-12-03T14:57:18.823532704+08:00"
},
{
"Uid": "99a137ba58396e60",
"Command": "disk",
"SubCommand": "fill",
"Flag": "--debug false --help false --size 20000",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-03T14:59:01.13288346+08:00",
"UpdateTime": "2019-12-03T15:00:20.17680665+08:00"
},
{
"Uid": "bcb4c1b17d445f55",
"Command": "disk",
"SubCommand": "fill",
"Flag": "--size 20000 --debug false --help false",
"Status": "Error",
"Error": "dd: \ufffd\ufffd\ufffd诖\ufffd\ufffd\ufffd\"/chaos_filldisk.log.dat\": 权\ufffd薏\ufffd\ufffd\ufffd\n exit status 1 exit status 1",
"CreateTime": "2019-12-03T15:00:46.651558522+08:00",
"UpdateTime": "2019-12-03T15:00:46.712107819+08:00"
},
{
"Uid": "0f60263c7b830b58",
"Command": "disk",
"SubCommand": "fill",
"Flag": "--debug false --help false --size 20000",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-03T15:00:52.562074748+08:00",
"UpdateTime": "2019-12-03T15:03:45.571227777+08:00"
},
{
"Uid": "07486dcb6b8e1804",
"Command": "disk",
"SubCommand": "fill",
"Flag": "--debug false --help false --size 20000",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-03T15:03:59.885515096+08:00",
"UpdateTime": "2019-12-03T15:05:44.380778022+08:00"
},
{
"Uid": "5f2c5c0353470b66",
"Command": "disk",
"SubCommand": "fill",
"Flag": "--debug false --help false --size 20000",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-03T15:09:05.204692014+08:00",
"UpdateTime": "2019-12-03T15:11:25.112595984+08:00"
},
{
"Uid": "33019d022f93a58e",
"Command": "disk",
"SubCommand": "fill",
"Flag": "--debug false --help false --size 20000",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-03T15:12:05.644988155+08:00",
"UpdateTime": "2019-12-03T15:15:46.775748998+08:00"
},
{
"Uid": "ae888993f31e9aeb",
"Command": "disk",
"SubCommand": "fill",
"Flag": "--debug false --help false --size 20000",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-03T15:16:17.115550065+08:00",
"UpdateTime": "2019-12-03T15:20:50.007686126+08:00"
},
{
"Uid": "d24dd9239902eb6f",
"Command": "network",
"SubCommand": "delay",
"Flag": "--time 3 --debug false --help false --interface eth0 --local-port 6396 --remote-port 6396",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-03T16:17:19.098079716+08:00",
"UpdateTime": "2019-12-03T16:19:00.02772809+08:00"
},
{
"Uid": "6aa70b124dce79f3",
"Command": "network",
"SubCommand": "delay",
"Flag": "--help false --interface eth0 --local-port 6396 --remote-port 6396 --time 3000 --debug false",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-03T16:21:33.527410693+08:00",
"UpdateTime": "2019-12-03T16:54:47.535580171+08:00"
},
{
"Uid": "5d838cdd1584c7f0",
"Command": "mem",
"SubCommand": "load",
"Flag": "--timeout 2 --debug false --help false --mem-percent 2",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-04T10:03:56.301575979+08:00",
"UpdateTime": "2019-12-04T10:03:58.440572078+08:00"
},
{
"Uid": "d87befe08c312ffe",
"Command": "cpu",
"SubCommand": "fullload",
"Flag": "--debug false --help false --cpu-count 2",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-04T11:51:39.511920791+08:00",
"UpdateTime": "2019-12-04T11:51:47.120728389+08:00"
},
{
"Uid": "a510f7c62d4ddfef",
"Command": "disk",
"SubCommand": "fill",
"Flag": "--debug false --help false --size 20000",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-04T16:59:11.557139226+08:00",
"UpdateTime": "2019-12-04T17:01:11.52823494+08:00"
},
{
"Uid": "a28f911f2f90e441",
"Command": "cpu",
"SubCommand": "fullload",
"Flag": "--cpu-count 2 --debug false --help false",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-05T16:01:17.435926509+08:00",
"UpdateTime": "2019-12-05T16:01:26.333131804+08:00"
},
{
"Uid": "4778c8d168727f7a",
"Command": "cpu",
"SubCommand": "fullload",
"Flag": "--help false --cpu-count 2 --debug false",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-05T16:11:24.526883118+08:00",
"UpdateTime": "2019-12-05T16:11:32.166626915+08:00"
},
{
"Uid": "1247625ff900e8b5",
"Command": "disk",
"SubCommand": "burn",
"Flag": "--write false --debug false --help false --read true --size 20",
"Status": "Destroyed",
"Error": "",
"CreateTime": "2019-12-05T16:42:52.76683927+08:00",
"UpdateTime": "2019-12-05T16:42:58.872850884+08:00"
},
{
"Uid": "6817b73edd7d1c36",
"Command": "network",
"SubCommand": "delay",
"Flag": "--interface eth0 --time 2000 --debug false --help false",
"Status": "Success",
"Error": "",
"CreateTime": "2019-12-05T16:43:06.855722578+08:00",
"UpdateTime": "2019-12-05T16:43:06.880694868+08:00"
}
]
}
————————————————
版权声明:本文为CSDN博主「朱小厮」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/u013256816/article/details/99917021
https://chaosblade-io.gitbook.io/chaosblade-help-zh-cn/