通常,始发集群和上游集群属于不同区域的部署中,Envoy执行区域感知路由
区域感知路由(zone aware routing)用于尽可能地向上游集群中的本地区域发送流量,并大致确保将流量
均衡分配至上游相关的所有端点;它依赖于以下几个先决条件
- 始发集群(客户端)和上游集群(服务端)都未处于恐慌模式
- 启用了区域感知路由
- 始发集群与上游集群具有相同数量的区域
- 上游集群具有能承载所有请求流量的主机
Envoy将流量路由到本地区域,还是进行跨区域路由取决于始发集群和上游集群中健康主机的百分比
- 始发集群的本地区域百分比大于上游集群中本地区域的百分比:Envoy 计算可以直接路由到上游集群的本地区域的
请求的百分比,其余的请求被路由到其它区域
区域感知路由仅支持0优先级
多级服务调度用场景中,某上游因网络 故障或服务繁忙无法响应请求时很可能会导致多级上游调用者大规模联故障,进而导致整个系统不可用,此即为服务的雪崩效应;
服务网格之上的微服务应用中,多级调用的长调用链并不鲜见.
熔断:上游服务(被调用者,即服务提供者)因压力过大而变得响应过慢甚至失败时,下游服务(服务消费者)通过暂时切断对上游的请求调用达到牺牲局部,保全上游甚至整体的目的
熔断打开 ( Open):在固定时间窗口内,检测到的失败指标达到指定的阈值时启动熔断
熔断半打开 (Half Open):断路器在工作一段时间后自动切换至半打开状态,并根据下一次请求的返回结果判定状态切换
熔断关闭 ( Closed):一定时长后上游服务可能会变得再次可用,此时下游即可关闭熔断,并再次请求其服务
总结起来, 熔断 是分布式 应用常的一种流量管理模式,它能够让应用程序免受上游服务失败、延迟峰值或
其它网络异常的侵害
Envoy在网络级别强制进行断路限,于是不必独立配置和编码每个应用
Envoy支持多种类型的完全分布式断路机制,达到由其定义的阈值时,相应的断路器即会溢出:
1. 集群最大连接数: Envoy 同上游集群建立的最大连接数,仅适用于HTTP/1.1,因为 HTTP/2可以链路复用
1. 集群最大请求数:在给定的时间,中所有主机未完成仅适用于 HTTP/2
1. 集群可挂起的最大请求数: 连接池满载时允许的等待队列的最大长度
1. 集群最大活跃并发重试次数: 给定时间内集群中所有主机可以执行的最大并发重试次数
1. 集群最大并发连接池: 可以同时实例化出最大连接池数量
每个断路器都可在集群及优先级的基础上进行配置和跟踪,它们可分别拥有各自不同的设定
在Istio中,熔断的功能通过连接池(连接池管理)和故障实例隔离(异常点检测)进行定义,而Envoy的断路器通常仅对应于Istio中的连接池功能
连接池的常用指标:
熔断器的常用指标(Istio上下文)
---
clusters:
- name: ...
...
connect_timeout: ... # TCP连接的超时时长,即主机网络连接超时,合理的设置可以能够改善因调用服务变慢而导致整个链接变慢的情形;
max_requests_per_connection: ... # 每个连接可以承载的最大请求数,HTTP/1.1和HTTP/2的连接池均受限于此设置,无设置则无限制,1表示禁用keep-alive
...
circuit_breakers: {...} # 熔断相关的配置,可选;
threasholds: [] # 适用于特定路由优先级的相关指标及阈值的列表;
- priority: ... # 当前断路器适用的路由优先级;
max_connections: ... # 可发往上游集群的最大并发连接数,仅适用于HTTP/1,默认为1024;超过指定数量的连接则将其短路;
max_pending_requests: ... # 允许请求服务时的可挂起的最大请求数,默认为1024;;超过指定数量的连接则将其短路;
max_requests: ... # Envoy可调度给上游集群的最大并发请求数,默认为1024;仅适用于HTTP/2
max_retries: ... # 允许发往上游集群的最大并发重试数量(假设配置了retry_policy),默认为3;
track_remaining: ... # 其值为true时表示将公布统计数据以显示断路器打开前所剩余的资源数量;默认为false;
max_connection_pools: ... # 每个集群可同时打开的最大连接池数量,默认为无限制;
使用fortio进行测试
fortio load -c 2 -qps 0 -n 20 -loglevel Warning URL
-c 2 并发连接个数
-qps 0 不做限制
-n 20 发20个请求
-loglevel Warning log的等级为Warning
共11个container:
version: '3'
services:
front-envoy:
image: envoyproxy/envoy-alpine:v1.21.5
volumes:
- ./front-envoy.yaml:/etc/envoy/envoy.yaml
networks:
- envoymesh
expose:
# Expose ports 80 (for general traffic) and 9901 (for the admin server)
- "80"
- "9901"
webserver01-sidecar:
image: envoyproxy/envoy-alpine:v1.21.5
volumes:
- ./envoy-sidecar-proxy.yaml:/etc/envoy/envoy.yaml
hostname: red
networks:
envoymesh:
ipv4_address: 172.31.35.11
aliases:
- webservice1
- red
webserver01:
image: ikubernetes/demoapp:v1.0
environment:
- PORT=8080
- HOST=127.0.0.1
network_mode: "service:webserver01-sidecar"
depends_on:
- webserver01-sidecar
webserver02-sidecar:
image: envoyproxy/envoy-alpine:v1.21.5
volumes:
- ./envoy-sidecar-proxy.yaml:/etc/envoy/envoy.yaml
hostname: blue
networks:
envoymesh:
ipv4_address: 172.31.35.12
aliases:
- webservice1
- blue
webserver02:
image: ikubernetes/demoapp:v1.0
environment:
- PORT=8080
- HOST=127.0.0.1
network_mode: "service:webserver02-sidecar"
depends_on:
- webserver02-sidecar
webserver03-sidecar:
image: envoyproxy/envoy-alpine:v1.21.5
volumes:
- ./envoy-sidecar-proxy.yaml:/etc/envoy/envoy.yaml
hostname: green
networks:
envoymesh:
ipv4_address: 172.31.35.13
aliases:
- webservice1
- green
webserver03:
image: ikubernetes/demoapp:v1.0
environment:
- PORT=8080
- HOST=127.0.0.1
network_mode: "service:webserver03-sidecar"
depends_on:
- webserver03-sidecar
webserver04-sidecar:
image: envoyproxy/envoy-alpine:v1.21.5
volumes:
- ./envoy-sidecar-proxy.yaml:/etc/envoy/envoy.yaml
hostname: gray
networks:
envoymesh:
ipv4_address: 172.31.35.14
aliases:
- webservice2
- gray
webserver04:
image: ikubernetes/demoapp:v1.0
environment:
- PORT=8080
- HOST=127.0.0.1
network_mode: "service:webserver04-sidecar"
depends_on:
- webserver04-sidecar
webserver05-sidecar:
image: envoyproxy/envoy-alpine:v1.21.5
volumes:
- ./envoy-sidecar-proxy.yaml:/etc/envoy/envoy.yaml
hostname: black
networks:
envoymesh:
ipv4_address: 172.31.35.15
aliases:
- webservice2
- black
webserver05:
image: ikubernetes/demoapp:v1.0
environment:
- PORT=8080
- HOST=127.0.0.1
network_mode: "service:webserver05-sidecar"
depends_on:
- webserver05-sidecar
networks:
envoymesh:
driver: bridge
ipam:
config:
- subnet: 172.31.35.0/24
熔断: 最大并发连接数 1,最大挂起连接数1,最大并发重试数3
异常检测: 时间间隔1S,连续5xx错误3个,网关故障3次,被弹出基准时长10s,因网关故障被弹出主机的百分比100%,集群中最多被弹出30%的主机,集群启动成功率异常检测最少主机数2
admin:
access_log_path: "/dev/null"
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- address:
socket_address: { address: 0.0.0.0, port_value: 80 }
name: listener_http
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
codec_type: auto
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: backend
domains:
- "*"
routes:
- match:
prefix: "/livez"
route:
cluster: webcluster2
- match:
prefix: "/"
route:
cluster: webcluster1
http_filters:
- name: envoy.filters.http.router
clusters:
- name: webcluster1
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: webcluster1
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: webservice1
port_value: 80
circuit_breakers:
thresholds:
max_connections: 1
max_pending_requests: 1
max_retries: 3
- name: webcluster2
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: webcluster2
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: webservice2
port_value: 80
outlier_detection:
interval: "1s"
consecutive_5xx: "3"
consecutive_gateway_failure: "3"
base_ejection_time: "10s"
enforcing_consecutive_gateway_failure: "100"
max_ejection_percent: "30"
success_rate_minimum_hosts: "2"
# docker-compose up
## 测试脚本
# cat send-requests.sh
#!/bin/bash
if [ $# -ne 2 ]
then
echo "USAGE: $0 "
exit 1;
fi
URL=$1
COUNT=$2
c=1
#interval="0.2"
while [[ ${c} -le ${COUNT} ]];
do
#echo "Sending GET request: ${URL}"
curl -o /dev/null -w '%{http_code}\n' -s ${URL} &
(( c++ ))
# sleep $interval
done
wait
测试效果
/(webcluster1)大概会有5.83%的几率503(11.12.13三台)
/livez(webcluster2)大概有8.83%的记录503(14.15两台)
# ./send-requests.sh http://172.31.35.2/ 600 > scale.txt
# wc -l scale.txt
600 scale.txt
# grep 200 scale.txt|wc -l
565
# grep 5 scale.txt|wc -l
35
# echo 'scale=4;35/600*100'|bc
5.8300
# ./send-requests.sh http://172.31.35.2/livez 600 > livez.txt
# grep 503 livez.txt|wc -l
53
# echo 'scale=4;53/600*100'|bc
8.8300
如果此时将13服务器状态置fail和之前几乎没区别
root@k8s-node-1:~# curl -XPOST -d 'livez=FAIL' http://172.31.35.13/livez
root@k8s-node-1:~# curl http://172.31.35.13/livez
FAILroot@k8s-node-1:~# ^C
root@k8s-node-1:~# curl http://172.31.35.13/livez
FAIL
## 此时
# ./send-requests.sh http://172.31.35.2/ 600 > scale.txt
# grep 503 scale.txt|wc -l
44
# echo 'scale=4;30/600*100'|bc
5.0000
将15服务器状态改为FAIL,503的返回几乎是之前的4倍
root@k8s-node-1:~curl -XPOST -d 'livez=FAIL' http://172.31.35.15/livez
root@k8s-node-1:~# curl http://172.31.35.15/livez
FAILroot@k8s-node-1:~#
# ./send-requests.sh http://172.31.35.2/livez 600 > livez.txt
# grep 5 livez.txt|wc -l
142
# echo 'scale=4;142/600*100'|bc
23.6600