vswitchd是用户态的daemon进程,其核心是执行ofproto的逻辑。我们知道ovs是遵从openflow交换机的规范实现的,就拿二层包转发为例,传统交换机(包括Linux bridge的实现)是通过查找cam表,找到dst mac对应的port;而open vswitch的实现则是根据入包skb,查找是否有对应的flow。如果有flow,说明这个skb不是流的第一个包了,那么可以在flow->action里找到转发的port。这里要说明的是,SDN的思想就是所有的包都需要对应一个flow,基于flow给出包的行为action,传统的action无非就是转发,接受,或者丢弃,而在SDN中,会有更多的action定义:修改skb的内容,改变包的路径,clone多份出来发到不同路径等等。
如果skb没有对应的flow,说明这是flow的第一个包,需要为这个包创建一个flow,vswitchd会在一个while循环里反复检查有没有ofproto的请求过来,有可能是ovs-ofctl传过来的,也可能是openvswitch.ko通过netlink发送的upcall请求,当然大部分情况下,都是flow miss导致的创建flow的请求,这时vswitchd会基于openflow规范创建flow, action,我们看下这个流程:
由于open vswitch是一个2层交换机模型,所有包开始都是从某个port接收进来,即调用ovs_dp_process_received_packet,该函数先基于skb通过ovs_flow_extract生成key,然后调用ovs_flow_tbl_lookup基于key查找flow,如果无法找到flow,调用ovs_dp_upcall通过netlink把一个dp_upcall_info结构发到vswitchd里去处理(调用genlmsg_unicast)
vswitchd会在handle_upcalls里来处理上述的netlink request,对于flow table里miss的情况,会调用handle_miss_upcalls,继而又调用handle_flow_miss,下面来看handle_miss_upcalls的实现
static void
handle_miss_upcalls(struct dpif_backer *backer, struct dpif_upcall *upcalls,
size_t n_upcalls)
{
/* Construct the to-do list.
*
* This just amounts to extracting the flow from each packet and sticking
* the packets that have the same flow in the same "flow_miss" structure so
* that we can process them together. */
hmap_init(&todo);
n_misses = 0;
注释里写得很明白,下面的循环会遍历netlink传到用户态的struct dpif_upcall,该结构包含了miss packet,和基于报文生成的的flow key,对于flow key相同的packet,会集中处理
for (upcall = upcalls; upcall < &upcalls[n_upcalls]; upcall++) {
fitness = odp_flow_key_to_flow(upcall->key, upcall->key_len, &flow);
port = odp_port_to_ofport(backer, flow.in_port);
odp_flow_key_to_flow,先调用lib/parse_flow_nlattrs函数解析upcall->key, upcall->key_len,把解析出来的attr属性放到一个bitmap present_attrs中,而对应类型的struct nlattr则放到struct nlattr* attrs[]中。接下来对present_attrs的每一位,从upcall->key中取得相应值并存入flow中。对于vlan的parse,特别调用了parse_8021q_onward
odp_port_to_ofport,用来把flow.in_port,即datapath的port号转换成openflow port,即struct ofport_dpif* port
flow_extract(upcall->packet, flow.skb_priority,
&flow.tunnel, flow.in_port, &miss->flow);
这里把packet解析到flow中,该函数和odp_flow_key_to_flow有些地方重复
/* Add other packets to a to-do list. */
hash = flow_hash(&miss->flow, 0);
existing_miss = flow_miss_find(&todo, &miss->flow, hash);
if (!existing_miss) {
hmap_insert(&todo, &miss->hmap_node, hash);
miss->ofproto = ofproto;
miss->key = upcall->key;
miss->key_len = upcall->key_len;
miss->upcall_type = upcall->type;
list_init(&miss->packets);
n_misses++;
} else {
miss = existing_miss;
}
list_push_back(&miss->packets, &upcall->packet->list_node);
}
flow_hash计算出miss->flow的哈希值,之后在todo这个hmap里基于哈希值查找struct flow_miss*,如果为空,表示这是第一个flow_miss,初始化这个flow_miss并加入到todo中,最后把packet假如到flow_miss->packets的list中。这里验证了之前的结论,对于一次性的多个upcall,会把属于同一个flow_miss的packets链接到同一个flow_miss下再一并处理。
OVS定义了facet,用来表示用户态程序,比如vswitchd,对于一条被匹配的flow的视图。同时kernel space对于一条flow同样有一个视图,facet表示两个视图相同的部分。不同的部分用subfacet来表示,struct subfacet里定义了action行为
如果datapath计算出的flow_key,和vswitchd基于packet计算出的flow_key完全一致的话,facet只会包含唯一的subfacet,如果datapath计算出的flow_key的成员比vswitchd基于packet计算出来的还要多,那么每个多出来的部分都会成为一个subfacet
struct subfacet {
/* Owners. */
struct hmap_node hmap_node; /* In struct ofproto_dpif 'subfacets' list. */
struct list list_node; /* In struct facet's 'facets' list. */
struct facet *facet; /* Owning facet. */
/* Key.
*
* To save memory in the common case, 'key' is NULL if 'key_fitness' is
* ODP_FIT_PERFECT, that is, odp_flow_key_from_flow() can accurately
* regenerate the ODP flow key from ->facet->flow. */
enum odp_key_fitness key_fitness;
struct nlattr *key;
int key_len;
long long int used; /* Time last used; time created if not used. */
uint64_t dp_packet_count; /* Last known packet count in the datapath. */
uint64_t dp_byte_count; /* Last known byte count in the datapath. */
/* Datapath actions.
*
* These should be essentially identical for every subfacet in a facet, but
* may differ in trivial ways due to VLAN splinters. */
size_t actions_len; /* Number of bytes in actions[]. */
struct nlattr *actions; /* Datapath actions. */
enum slow_path_reason slow; /* 0 if fast path may be used. */
enum subfacet_path path; /* Installed in datapath? */
}
我们先来看handle_flow_miss
/* Handles flow miss 'miss' on 'ofproto'. May add any required datapath
* operations to 'ops', incrementing '*n_ops' for each new op. */
static void
handle_flow_miss(struct ofproto_dpif *ofproto, struct flow_miss *miss,
struct flow_miss_op *ops, size_t *n_ops)
{
struct facet *facet;
uint32_t hash;
/* The caller must ensure that miss->hmap_node.hash contains
* flow_hash(miss->flow, 0). */
hash = miss->hmap_node.hash;
facet = facet_lookup_valid(ofproto, &miss->flow, hash);
在表示datapath的数据结构struct ofproto_dpif* ofproto中查找flow。ofproto->facets是一个hashmap,首先计算出miss flow的hash值,之后在hash对应的hmap_node list中查找是否有匹配的flow,比较的方式比较暴力,直接拿memcmp比较。。
if (!facet) {
struct rule_dpif *rule = rule_dpif_lookup(ofproto, &miss->flow);
if (!flow_miss_should_make_facet(ofproto, miss, hash)) {
handle_flow_miss_without_facet(miss, rule, ops, n_ops);
此时认为没有必要创建flow facet,对于一些trivial的流量,创建一个flow facet反而会带来更大的overload
return;
}
facet = facet_create(rule, &miss->flow, hash);
好吧,我们为这个flow创建一个facet
}
handle_flow_miss_with_facet(miss, facet, ops, n_ops);
}
struct flow_miss是对flow的一个封装,用来加快miss flow的batch处理。大多数情况下,都会创建这个facet出来,
2012-10-26T07:15:43Z|22522|ofproto_dpif|INFO|[qinq] miss flow, create facet: vlan_tci 0, proto 0x806, in_port 1, src mac 0:16:3e:83:0:1, dst mac 0:25:9e:5d:62:53
2012-10-26T07:15:43Z|22529|ofproto_dpif|INFO|[qinq] miss flow, create facet: vlan_tci 0, proto 0x806, in_port 2, src mac 0:25:9e:5d:62:53, dst mac 0:16:3e:83:0:1
可以看出一个双工通信创建了两个flow出来,同时也创建了facet
下面来看handle_flow_miss_with_facet,里面调用subfacet_make_actions来生成action,该函数首先调用action_xlate_ctx_init,初始化一个action_xlate_ctx结构,该结构定义如下:
struct action_xlate_ctx {
/* action_xlate_ctx_init() initializes these members. */
/* The ofproto. */
struct ofproto_dpif *ofproto;
/* Flow to which the OpenFlow actions apply. xlate_actions() will modify
* this flow when actions change header fields. */
struct flow flow;
/* The packet corresponding to 'flow', or a null pointer if we are
* revalidating without a packet to refer to. */
const struct ofpbuf *packet;
/* Should OFPP_NORMAL update the MAC learning table? Should "learn"
* actions update the flow table?
*
* We want to update these tables if we are actually processing a packet,
* or if we are accounting for packets that the datapath has processed, but
* not if we are just revalidating. */
bool may_learn;
/* The rule that we are currently translating, or NULL. */
struct rule_dpif *rule;
/* Union of the set of TCP flags seen so far in this flow. (Used only by
* NXAST_FIN_TIMEOUT. Set to zero to avoid updating updating rules'
* timeouts.) */
uint8_t tcp_flags;
/* xlate_actions() initializes and uses these members. The client might want
* to look at them after it returns. */
struct ofpbuf *odp_actions; /* Datapath actions. */
tag_type tags; /* Tags associated with actions. */
enum slow_path_reason slow; /* 0 if fast path may be used. */
bool has_learn; /* Actions include NXAST_LEARN? */
bool has_normal; /* Actions output to OFPP_NORMAL? */
bool has_fin_timeout; /* Actions include NXAST_FIN_TIMEOUT? */
uint16_t nf_output_iface; /* Output interface index for NetFlow. */
mirror_mask_t mirrors; /* Bitmap of associated mirrors. */
/* xlate_actions() initializes and uses these members, but the client has no
* reason to look at them. */
int recurse; /* Recursion level, via xlate_table_action. */
bool max_resubmit_trigger; /* Recursed too deeply during translation. */
struct flow base_flow; /* Flow at the last commit. */
uint32_t orig_skb_priority; /* Priority when packet arrived. */
uint8_t table_id; /* OpenFlow table ID where flow was found. */
uint32_t sflow_n_outputs; /* Number of output ports. */
uint16_t sflow_odp_port; /* Output port for composing sFlow action. */
uint16_t user_cookie_offset;/* Used for user_action_cookie fixup. */
bool exit; /* No further actions should be processed. */
struct flow orig_flow; /* Copy of original flow. */
};
之后调用xlate_actions,openflow1.0定义了如下action,
enum ofp10_action_type {
OFPAT10_OUTPUT, /* Output to switch port. */
OFPAT10_SET_VLAN_VID, /* Set the 802.1q VLAN id. */
OFPAT10_SET_VLAN_PCP, /* Set the 802.1q priority. */
OFPAT10_STRIP_VLAN, /* Strip the 802.1q header. */
OFPAT10_SET_DL_SRC, /* Ethernet source address. */
OFPAT10_SET_DL_DST, /* Ethernet destination address. */
OFPAT10_SET_NW_SRC, /* IP source address. */
OFPAT10_SET_NW_DST, /* IP destination address. */
OFPAT10_SET_NW_TOS, /* IP ToS (DSCP field, 6 bits). */
OFPAT10_SET_TP_SRC, /* TCP/UDP source port. */
OFPAT10_SET_TP_DST, /* TCP/UDP destination port. */
OFPAT10_ENQUEUE, /* Output to queue. */
OFPAT10_VENDOR = 0xffff
};
对应不同的action type,其action传入的数据结构也不同,e.g.
/* Action structure for OFPAT10_SET_VLAN_VID. */
struct ofp_action_vlan_vid {
ovs_be16 type; /* OFPAT10_SET_VLAN_VID. */
ovs_be16 len; /* Length is 8. */
ovs_be16 vlan_vid; /* VLAN id. */
uint8_t pad[2];
};
/* Action structure for OFPAT10_SET_VLAN_PCP. */
struct ofp_action_vlan_pcp {
ovs_be16 type; /* OFPAT10_SET_VLAN_PCP. */
ovs_be16 len; /* Length is 8. */
uint8_t vlan_pcp; /* VLAN priority. */
uint8_t pad[3];
};
union ofp_action {
ovs_be16 type;
struct ofp_action_header header;
struct ofp_action_vendor_header vendor;
struct ofp_action_output output;
struct ofp_action_vlan_vid vlan_vid;
struct ofp_action_vlan_pcp vlan_pcp;
struct ofp_action_nw_addr nw_addr;
struct ofp_action_nw_tos nw_tos;
struct ofp_action_tp_port tp_port;
};
do_xlate_actions传入一个struct ofp_action*数组,对每个struct ofp_action,执行不同的操作,e.g.
case OFPUTIL_OFPAT10_OUTPUT:
xlate_output_action(ctx, &ia->output);
break;
case OFPUTIL_OFPAT10_SET_VLAN_VID:
ctx->flow.vlan_tci &= ~htons(VLAN_VID_MASK);
ctx->flow.vlan_tci |= ia->vlan_vid.vlan_vid | htons(VLAN_CFI);
break;
case OFPUTIL_OFPAT10_SET_VLAN_PCP:
ctx->flow.vlan_tci &= ~htons(VLAN_PCP_MASK);
ctx->flow.vlan_tci |= htons(
(ia->vlan_pcp.vlan_pcp << VLAN_PCP_SHIFT) | VLAN_CFI);
break;
case OFPUTIL_OFPAT10_STRIP_VLAN:
ctx->flow.vlan_tci = htons(0);
break;
对于转发报文,最重要的就是xlate_output_action,该函数调用的xlate_output_action__,其中传入的port为datapath port index,或者其他控制参数,可以在ofp_port的定义中看到如下定义:
enum ofp_port {
/* Maximum number of physical switch ports. */
OFPP_MAX = 0xff00,
/* Fake output "ports". */
OFPP_IN_PORT = 0xfff8, /* Send the packet out the input port. This
virtual port must be explicitly used
in order to send back out of the input
port. */
OFPP_TABLE = 0xfff9, /* Perform actions in flow table.
NB: This can only be the destination
port for packet-out messages. */
OFPP_NORMAL = 0xfffa, /* Process with normal L2/L3 switching. */
OFPP_FLOOD = 0xfffb, /* All physical ports except input port and
those disabled by STP. */
OFPP_ALL = 0xfffc, /* All physical ports except input port. */
OFPP_CONTROLLER = 0xfffd, /* Send to controller. */
OFPP_LOCAL = 0xfffe, /* Local openflow "port". */
OFPP_NONE = 0xffff /* Not associated with a physical port. */
};
在xlate_output_action__中,大部分情况都是走到OFPP_NORMAL里面,调用xlate_normal,里面会调用mac_learning_lookup, 查找mac表找到报文的出口port,然后调用output_normal,output_normal最终调用compose_output_action
compose_output_action__(struct action_xlate_ctx *ctx, uint16_t ofp_port,
bool check_stp)
{
const struct ofport_dpif *ofport = get_ofp_port(ctx->ofproto, ofp_port);
uint16_t odp_port = ofp_port_to_odp_port(ofp_port);
ovs_be16 flow_vlan_tci = ctx->flow.vlan_tci;
uint8_t flow_nw_tos = ctx->flow.nw_tos;
uint16_t out_port;
...
out_port = vsp_realdev_to_vlandev(ctx->ofproto, odp_port,
ctx->flow.vlan_tci);
if (out_port != odp_port) {
ctx->flow.vlan_tci = htons(0);
}
commit_odp_actions(&ctx->flow, &ctx->base_flow, ctx->odp_actions);
nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_OUTPUT, out_port);
ctx->sflow_odp_port = odp_port;
ctx->sflow_n_outputs++;
ctx->nf_output_iface = ofp_port;
ctx->flow.vlan_tci = flow_vlan_tci;
ctx->flow.nw_tos = flow_nw_tos;
}
commit_odp_actions,用来把所有action编码车功能nlattr的格式存到ctx->odp_actions中,之后的nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_OUTPUT, out_port)把报文的出口port添加进去,这样一条flow action差不多组合完毕了
下面来讨论下vswitchd中的cam表,代码在lib/mac-learning.h lib/mac-learning.c中,
vswitchd内部维护了一个mac/port的cam表,其中mac entry的老化时间为300秒,cam表定义了flooding vlan的概念,即如果vlan是flooding,表示不会去学习任何地址,这个vlan的所有转发都通过flooding完成,
/* A MAC learning table entry. */
struct mac_entry {
struct hmap_node hmap_node; /* Node in a mac_learning hmap. */
struct list lru_node; /* Element in 'lrus' list. */
time_t expires; /* Expiration time. */
time_t grat_arp_lock; /* Gratuitous ARP lock expiration time. */
uint8_t mac[ETH_ADDR_LEN]; /* Known MAC address. */
uint16_t vlan; /* VLAN tag. */
tag_type tag; /* Tag for this learning entry. */
/* Learned port. */
union {
void *p;
int i;
} port;
};
/* MAC learning table. */
struct mac_learning {
struct hmap table; /* Learning table. */ mac_entry组成的hmap哈希表,mac_entry通过hmap_node挂载到mac_learning->table中
struct list lrus; /* In-use entries, least recently used at the
front, most recently used at the back. */ lru的链表,mac_entry通过lru_node挂载到mac_learning->lrus中
uint32_t secret; /* Secret for randomizing hash table. */
unsigned long *flood_vlans; /* Bitmap of learning disabled VLANs. */
unsigned int idle_time; /* Max age before deleting an entry. */ 最大老化时间
};
static uint32_t
mac_table_hash(const struct mac_learning *ml, const uint8_t mac[ETH_ADDR_LEN],
uint16_t vlan)
{
unsigned int mac1 = get_unaligned_u32((uint32_t *) mac);
unsigned int mac2 = get_unaligned_u16((uint16_t *) (mac + 4));
return hash_3words(mac1, mac2 | (vlan << 16), ml->secret);
}
mac_entry计算的hash值,由mac_learning->secret,vlan, mac地址共同通过hash_3words计算出来
mac_entry_lookup,通过mac地址,vlan来查看是否已经对应的mac_entry
get_lru,找到lru链表对应的第一个mac_entry
mac_learning_create/mac_learning_destroy,创建/销毁mac_learning表
mac_learning_may_learn,如果vlan不是flooding vlan且mac地址不是多播地址,返回true
mac_learning_insert,向mac_learning中插入一条mac_entry,首先通过mac_entry_lookup查看mac, vlan对应的mac_entry是否存在,不存在的话如果此时mac_learning已经有了MAC_MAX条mac_entry,老化最老的那条,之后创建mac_entry并插入到cam表中。
mac_learning_lookup,调用mac_entry_lookup在cam表中查找某个vlan对应的mac地址
mac_learning_run,循环老化已经超时的mac_entry