Service Mesh Istio 从入门到放弃 (三) istio 弹性保障

文章目录

  • Request Timeouts
  • Retry
  • Circuit Breaking
  • Fault Injection
  • Mirroring

Request Timeouts

  • 超时是为了控制服务故障的范围,系统弹性防护的措施之一,为了模拟这个现象,接下来需要给demo应用做一些修改
    • 指定请求到reviews服务的v2版本,因为只有v2版本会调用ratings服务
    • 然后给ratings服务注入一个2s的延迟
    • 最后给请求reviews服务的路由加一个0.5s的超时

1.指定请求到reviews服务为v2版本,这样再请求页面只会看到黑色的星星

kubectl apply -f - <

2.给ratings服务注入一个2s的延迟,这样在刷新/productpage页面时会等待两秒才有内容

kubectl apply -f - <

3.给reviews服务增加一个0.5s的超时,这样再刷新页面就提示服务不可用了

kubectl apply -f - <

Retry

  • 重试是为了避免网络抖动导致请求的失败
  • 为了模拟这个情况,需要
    • 取消reveiws服务超时的配置
    • 给ratings服务添加一个5s的延迟,同时配置1s的超时重试

1.取消reviews服务的超时配置

kubectl apply -f - <

2.给ratings服务添加一个5s的延迟,同时配置1s的超时重试

kubectl apply -f - <

3.打开ratings服务sidecar的log来看看请求,发现有两次请求,这样就证明重试配置起作用了

kubectl logs -f ratings-v1-5d4f4b45bf-5sjw4 -c istio-proxy

[2020-09-07T13:20:14.753Z] "GET /ratings/0 HTTP/1.1" 200 - "-" "-" 0 48 4 2 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15" "43419be2-8404-9410-b30c-451860d84689" "ratings:9080" "127.0.0.1:9080" inbound|9080|http|ratings.default.svc.cluster.local 127.0.0.1:47484 10.244.1.42:9080 10.244.1.39:40632 outbound_.9080_.v1_.ratings.default.svc.cluster.local default
[2020-09-07T13:20:17.781Z] "GET /ratings/0 HTTP/1.1" 200 - "-" "-" 0 48 2 2 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15" "43419be2-8404-9410-b30c-451860d84689" "ratings:9080" "127.0.0.1:9080" inbound|9080|http|ratings.default.svc.cluster.local 127.0.0.1:47484 10.244.1.42:9080 10.244.1.39:40634 outbound_.9080_.v1_.ratings.default.svc.cluster.local default

Circuit Breaking

  • istio能针对连接、请求和异常检测去配置熔断,它是保障微服务弹性的重要手段之一,能够保证微服务级联雪崩,导致整个系统崩溃

  • 为了展示熔断的特性,需要

    • 部署一个httpbin服务
    • 并为httpbin服务添加熔断配置
    • 部署一个负载测试工具fortio来出发服务的熔断

1.部署一个httpbin服务

root@kube1:~/istio/istio-1.5.1# kubectl apply -f samples/httpbin/httpbin.yaml
serviceaccount/httpbin created
service/httpbin created
deployment.apps/httpbin created

2.为这个服务添加熔断配置

kubectl apply -f - <
  • 连接池配置
    • maxConnections tcp的最大连接数是1
    • maxRequestsPerConnection tcp连接上的最大http请求数是1
    • http1MaxPendingRequests 最大的http最大被阻挡的请求数是1
  • 异常探测
    • consecutiveErrors触发熔断的失败次数是1
    • 熔断的间隔时间
    • baseEjectionTime 基础驱逐时间 是3min,默认是30s,驱逐时间根据这个属性会乘一个被熔断驱逐次数,实现一个指数级退避策略,也就是当你错误次数越来越多,熔断探测的间隔时间会越来越长
    • maxEjectionPercent 最大有多少比例的服务实例被熔断驱逐出去
      3.验证熔断配置是否生效
root@kube1:~/istio/istio-1.5.1# kubectl describe dr httpbin
Name:         httpbin
Namespace:    default
Labels:       
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"networking.istio.io/v1alpha3","kind":"DestinationRule","metadata":{"annotations":{},"name":"httpbin","namespace":"default"}...
API Version:  networking.istio.io/v1beta1
Kind:         DestinationRule
Metadata:
  Creation Timestamp:  2020-09-07T13:45:57Z
  Generation:          1
  Resource Version:    4347907
  Self Link:           /apis/networking.istio.io/v1beta1/namespaces/default/destinationrules/httpbin
  UID:                 1c270ce5-be8e-437a-81ba-77797db148ea
Spec:
  Host:  httpbin
  Traffic Policy:
    Connection Pool:
      Http:
        http1MaxPendingRequests:      1
        Max Requests Per Connection:  1
      Tcp:
        Max Connections:  1
    Outlier Detection:
      Base Ejection Time:    3m
      Consecutive Errors:    1
      Interval:              1s
      Max Ejection Percent:  100
Events:                      

4.部署一个流量测试工具fortio

root@kube1:~/istio/istio-1.5.1# kubectl apply -f samples/httpbin/sample-client/fortio-deploy.yaml
service/fortio created
deployment.apps/fortio-deploy created

5.登录pod,使用fortio工具 来调用httpbin 触发一次请求,可以看到响应结果

root@kube1:~/istio/istio-1.5.1# export FORTIO_POD=$(kubectl get pods -lapp=fortio -o 'jsonpath={.items[0].metadata.name}')

root@kube1:~/istio/istio-1.5.1# kubectl exec "$FORTIO_POD" -c fortio -- /usr/bin/fortio curl -quiet http://httpbin:8000/g
HTTP/1.1 200 OK
server: envoy
date: Mon, 07 Sep 2020 14:02:24 GMT
content-type: application/json
content-length: 586
access-control-allow-origin: *
access-control-allow-credentials: true
x-envoy-upstream-service-time: 17

{
  "args": {},
  "headers": {
    "Content-Length": "0",
    "Host": "httpbin:8000",
    "User-Agent": "fortio.org/fortio-1.6.8",
    "X-B3-Parentspanid": "d964f49dc1381ad1",
    "X-B3-Sampled": "1",
    "X-B3-Spanid": "f251aa8ed27b094a",
    "X-B3-Traceid": "684683095322c746d964f49dc1381ad1",
    "X-Forwarded-Client-Cert": "By=spiffe://cluster.local/ns/default/sa/httpbin;Hash=01cc58646a677f556d03ba08ae4bb4bf77784d1e6874e46d4965b4ef9a52a5da;Subject=\"\";URI=spiffe://cluster.local/ns/default/sa/default"
  },
  "origin": "127.0.0.1",
  "url": "http://httpbin:8000/get"
}

6.因为之前的熔断配置是maxConnections: 1 和 http1MaxPendingRequests: 1,这些配置意味着如果连接和请求同时超过一个,就会发生熔断错误,所以下面的命令同时使用2个连接(-c 2) 和发送20个请求(-n 20)

kubectl exec "$FORTIO_POD" -c fortio -- /usr/bin/fortio load -c 2 -qps 0 -n 20 -loglevel Warning http://httpbin:8000/get

14:05:39 I logger.go:115> Log level is now 3 Warning (was 2 Info)
Fortio 1.6.8 running at 0 queries per second, 8->8 procs, for 20 calls: http://httpbin:8000/get
Starting at max qps with 2 thread(s) [gomax 8] for exactly 20 calls (10 per thread + 0)
14:05:39 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:05:39 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:05:39 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:05:39 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:05:39 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:05:39 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:05:39 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:05:39 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
Ended after 811.898072ms : 20 calls. qps=24.634
Aggregated Function Time : count 20 avg 0.057826079 +/- 0.06969 min 0.000387521 max 0.221833788 sum 1.15652158
# range, mid point, percentile, count
>= 0.000387521 <= 0.001 , 0.000693761 , 25.00, 5
> 0.002 <= 0.003 , 0.0025 , 30.00, 1
> 0.003 <= 0.004 , 0.0035 , 35.00, 1
> 0.008 <= 0.009 , 0.0085 , 40.00, 1
> 0.016 <= 0.018 , 0.017 , 45.00, 1
> 0.025 <= 0.03 , 0.0275 , 55.00, 2
> 0.03 <= 0.035 , 0.0325 , 60.00, 1
> 0.045 <= 0.05 , 0.0475 , 70.00, 2
> 0.09 <= 0.1 , 0.095 , 75.00, 1
> 0.12 <= 0.14 , 0.13 , 80.00, 1
> 0.14 <= 0.16 , 0.15 , 85.00, 1
> 0.16 <= 0.18 , 0.17 , 95.00, 2
> 0.2 <= 0.221834 , 0.210917 , 100.00, 1
# target 50% 0.0275
# target 75% 0.1
# target 90% 0.17
# target 99% 0.217467
# target 99.9% 0.221397
Sockets used: 9 (for perfect keepalive, would be 2)
Jitter: false
Code 200 : 12 (60.0 %)
Code 503 : 8 (40.0 %)
Response Header Sizes : count 20 avg 138.85 +/- 113.4 min 0 max 232 sum 2777
Response Body/Total Sizes : count 20 avg 586.85 +/- 282.4 min 241 max 818 sum 11737
All done 20 calls (plus 0 warmup) 57.826 ms avg, 24.6 qps

上面的统计结果显示60%的请求通过了,40%的请求返回503被熔断了

Code 200 : 12 (60.0 %)
Code 503 : 8 (40.0 %)

7.把并发连接数提到3进行测试

kubectl exec "$FORTIO_POD" -c fortio -- /usr/bin/fortio load -c 3 -qps 0 -n 30 -loglevel Warning http://httpbin:8000/get


14:07:32 I logger.go:115> Log level is now 3 Warning (was 2 Info)
Fortio 1.6.8 running at 0 queries per second, 8->8 procs, for 30 calls: http://httpbin:8000/get
Starting at max qps with 3 thread(s) [gomax 8] for exactly 30 calls (10 per thread + 0)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
14:07:32 W http_client.go:698> Parsed non ok code 503 (HTTP/1.1 503)
Ended after 28.717189ms : 30 calls. qps=1044.7
Aggregated Function Time : count 30 avg 0.0024530259 +/- 0.005584 min 0.00031694 max 0.022741939 sum 0.073590778
# range, mid point, percentile, count
>= 0.00031694 <= 0.001 , 0.00065847 , 86.67, 26
> 0.001 <= 0.002 , 0.0015 , 90.00, 1
> 0.012 <= 0.014 , 0.013 , 93.33, 1
> 0.02 <= 0.0227419 , 0.021371 , 100.00, 2
# target 50% 0.000699454
# target 75% 0.000904372
# target 90% 0.002
# target 99% 0.0223306
# target 99.9% 0.0227008
Sockets used: 28 (for perfect keepalive, would be 3)
Jitter: false
Code 200 : 2 (6.7 %)
Code 503 : 28 (93.3 %)
Response Header Sizes : count 30 avg 15.4 +/- 57.62 min 0 max 231 sum 462
Response Body/Total Sizes : count 30 avg 210.13333 +/- 165.5 min 153 max 817 sum 6304
All done 30 calls (plus 0 warmup) 2.453 ms avg, 1044.7 qps

成功率进一步降低,失败率进一步升高

Code 200 : 2 (6.7 %)
Code 503 : 28 (93.3 %)

9.通过下面的命令可以看一下被熔断的具体情况的聚合数据,httpbin.default.svc.cluster.local.upstream_rq_pending_overflow 展示的就是具体被熔断的次数是11

kubectl exec "$FORTIO_POD" -c istio-proxy -- pilot-agent request GET stats | grep httpbin | grep pending

cluster.outbound|8000||httpbin.default.svc.cluster.local.circuit_breakers.default.rq_pending_open: 0
cluster.outbound|8000||httpbin.default.svc.cluster.local.circuit_breakers.high.rq_pending_open: 0
cluster.outbound|8000||httpbin.default.svc.cluster.local.upstream_rq_pending_active: 0
cluster.outbound|8000||httpbin.default.svc.cluster.local.upstream_rq_pending_failure_eject: 0
cluster.outbound|8000||httpbin.default.svc.cluster.local.upstream_rq_pending_overflow: 11
cluster.outbound|8000||httpbin.default.svc.cluster.local.upstream_rq_pending_total: 16

Fault Injection

  • 如果能够主动模拟一些故障情景的发生,通过观察系统在这种情况下的表现,再进行对应优化修改,那么当真正的故障发生时,就可以做到心中有数,游刃有余
  • 在模拟超时重试的时候,我们其实已经给ratings服务注入了一个超时的故障,这里不再赘述
  • 接下来我们模拟一个终止(abort)类型的故障

1.首先需要恢复服务的正常路由配置,然后添加路由配置,让登录名为jason的用户访问v2版本的reviews服务
,让其他用户访问v1版本的reviews服务

kubectl apply -f samples/bookinfo/networking/virtual-service-all-v1.yaml
kubectl apply -f samples/bookinfo/networking/virtual-service-reviews-test-v2.yaml

2.为v2版本reviews服务注入一个100%的流量返回500错误的abort故障

kubectl apply -f samples/bookinfo/networking/virtual-service-ratings-test-abort.yaml

3.然后用jason用户登录,刷新页面就可以看到Ratings service is currently unavailable的提示信息

总结一下,可以注入的故障有两种超时(delay)和终止(abort),他们的配置如下:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
...
spec:
  hosts:
  - ratings
  http:
  - fault:
      delay:
        fixedDelay: 7s
        percentage:
          value: 50
  - fault:
      abort:
        httpStatus: 500
        percentage:
          value: 50
    match:
    - headers:
        end-user:
          exact: jason
    route:
    - destination:
        host: ratings
        subset: v1
  - route:
    - destination:
        host: ratings
        subset: v1

Mirroring

  • 流量镜像功能主要是为了模拟线上的流量,来解决一些在测试环境无法复现的问题

1.首先需要部署v1、v2两个版本的httpbin服务

  • httpbin-v1:
cat <
  • httpbin-v2:
cat <

2.为这两个服务创建可访问的service

kubectl create -f - <

3.为httpbin定义路由,将其指向v1版本

kubectl apply -f - <

4.发一个请求测试一下,可以看到header信息,说明请求成功了

root@kube1:~/istio/istio-1.5.1# export SLEEP_POD=$(kubectl get pod -l app=sleep -o jsonpath={.items..metadata.name})
root@kube1:~/istio/istio-1.5.1# kubectl exec "${SLEEP_POD}" -c sleep -- curl -s http://httpbin:8000/headers
{
  "headers": {
    "Accept": "*/*",
    "Content-Length": "0",
    "Host": "httpbin:8000",
    "User-Agent": "curl/7.69.1",
    "X-B3-Parentspanid": "917d1f30c9d691cd",
    "X-B3-Sampled": "1",
    "X-B3-Spanid": "1a7987c2578d6687",
    "X-B3-Traceid": "107cc7e533b09efb917d1f30c9d691cd",
    "X-Forwarded-Client-Cert": "By=spiffe://cluster.local/ns/default/sa/default;Hash=4c1074a241f191c1d6732cccb583a46dcfbf882813693a6211ec5b417c59d865;Subject=\"\";URI=spiffe://cluster.local/ns/default/sa/sleep"
  }
}

5.为了观测后面v2会产生镜像流量,我们打开两个终端分别观看v1和v2产生的请求log

root@kube1:~/istio/istio-1.5.1# kubectl logs -f httpbin-v1-d879b9568-qtffc -c httpbin
[2020-09-08 01:52:10 +0000] [1] [INFO] Starting gunicorn 19.9.0
[2020-09-08 01:52:10 +0000] [1] [INFO] Listening at: http://0.0.0.0:80 (1)
[2020-09-08 01:52:10 +0000] [1] [INFO] Using worker: sync
[2020-09-08 01:52:10 +0000] [8] [INFO] Booting worker with pid: 8
127.0.0.1 - - [08/Sep/2020:02:05:08 +0000] "GET /headers HTTP/1.1" 200 516 "-" "curl/7.69.1"

root@kube1:~/istio/istio-1.5.1# kubectl logs -f httpbin-v2-69bcdd6f7c-c78ld -c httpbin
[2020-09-08 01:52:16 +0000] [1] [INFO] Starting gunicorn 19.9.0
[2020-09-08 01:52:16 +0000] [1] [INFO] Listening at: http://0.0.0.0:80 (1)
[2020-09-08 01:52:16 +0000] [1] [INFO] Using worker: sync
[2020-09-08 01:52:16 +0000] [9] [INFO] Booting worker with pid: 9

6.为httpbin-v2 添加流量镜像的配置

kubectl apply -f - <

7.为v1发送一个请求

root@kube1:~# kubectl exec sleep-6bdb595bcb-xktfl  -c sleep -- curl -s http://httpbin:8000/headers
{
  "headers": {
    "Accept": "*/*",
    "Content-Length": "0",
    "Host": "httpbin:8000",
    "User-Agent": "curl/7.69.1",
    "X-B3-Parentspanid": "cbe3361fec2e1656",
    "X-B3-Sampled": "1",
    "X-B3-Spanid": "d8ae3fbe573ce08d",
    "X-B3-Traceid": "c7b6073fc8111651cbe3361fec2e1656",
    "X-Forwarded-Client-Cert": "By=spiffe://cluster.local/ns/default/sa/httpbin;Hash=4c1074a241f191c1d6732cccb583a46dcfbf882813693a6211ec5b417c59d865;Subject=\"\";URI=spiffe://cluster.local/ns/default/sa/sleep"
  }
}

然后在v1和v2的终端里就会看到对应的log

root@kube1:~# kubectl logs -f httpbin-v1-d879b9568-qtffc -c httpbin
[2020-09-08 01:52:10 +0000] [1] [INFO] Starting gunicorn 19.9.0
[2020-09-08 01:52:10 +0000] [1] [INFO] Listening at: http://0.0.0.0:80 (1)
[2020-09-08 01:52:10 +0000] [1] [INFO] Using worker: sync
[2020-09-08 01:52:10 +0000] [8] [INFO] Booting worker with pid: 8



127.0.0.1 - - [08/Sep/2020:02:22:47 +0000] "GET /headers HTTP/1.1" 200 516 "-" "curl/7.69.1"
root@kube1:~/istio/istio-1.5.1# kubectl logs -f httpbin-v2-69bcdd6f7c-c78ld -c httpbin
[2020-09-08 01:52:16 +0000] [1] [INFO] Starting gunicorn 19.9.0
[2020-09-08 01:52:16 +0000] [1] [INFO] Listening at: http://0.0.0.0:80 (1)
[2020-09-08 01:52:16 +0000] [1] [INFO] Using worker: sync
[2020-09-08 01:52:16 +0000] [9] [INFO] Booting worker with pid: 9



127.0.0.1 - - [08/Sep/2020:02:22:47 +0000] "GET /headers HTTP/1.1" 200 556 "-" "curl/7.69.1"

你可能感兴趣的:(云原生,kubernetes,service,mesh,istio,云原生)