open vswitch研究:vswitchd

vswitchd是用户态的daemon进程,其核心是执行ofproto的逻辑。我们知道ovs是遵从openflow交换机的规范实现的,就拿二层包转发为例,传统交换机(包括Linux bridge的实现)是通过查找cam表,找到dst mac对应的port;而open vswitch的实现则是根据入包skb,查找是否有对应的flow。如果有flow,说明这个skb不是流的第一个包了,那么可以在flow->action里找到转发的port。这里要说明的是,SDN的思想就是所有的包都需要对应一个flow,基于flow给出包的行为action,传统的action无非就是转发,接受,或者丢弃,而在SDN中,会有更多的action定义:修改skb的内容,改变包的路径,clone多份出来发到不同路径等等。


如果skb没有对应的flow,说明这是flow的第一个包,需要为这个包创建一个flow,vswitchd会在一个while循环里反复检查有没有ofproto的请求过来,有可能是ovs-ofctl传过来的,也可能是openvswitch.ko通过netlink发送的upcall请求,当然大部分情况下,都是flow miss导致的创建flow的请求,这时vswitchd会基于openflow规范创建flow, action,我们看下这个流程:


由于open vswitch是一个2层交换机模型,所有包开始都是从某个port接收进来,即调用ovs_dp_process_received_packet,该函数先基于skb通过ovs_flow_extract生成key,然后调用ovs_flow_tbl_lookup基于key查找flow,如果无法找到flow,调用ovs_dp_upcall通过netlink把一个dp_upcall_info结构发到vswitchd里去处理(调用genlmsg_unicast)


vswitchd会在handle_upcalls里来处理上述的netlink request,对于flow table里miss的情况,会调用handle_miss_upcalls,继而又调用handle_flow_miss,下面来看handle_miss_upcalls的实现

static void
handle_miss_upcalls(struct dpif_backer *backer, struct dpif_upcall *upcalls,
                    size_t n_upcalls)
{

    /* Construct the to-do list.
     *
     * This just amounts to extracting the flow from each packet and sticking
     * the packets that have the same flow in the same "flow_miss" structure so
     * that we can process them together. */
    hmap_init(&todo);
    n_misses = 0;

注释里写得很明白,下面的循环会遍历netlink传到用户态的struct dpif_upcall,该结构包含了miss packet,和基于报文生成的的flow key,对于flow key相同的packet,会集中处理

    for (upcall = upcalls; upcall < &upcalls[n_upcalls]; upcall++) {

        fitness = odp_flow_key_to_flow(upcall->key, upcall->key_len, &flow);
        port = odp_port_to_ofport(backer, flow.in_port); 

odp_flow_key_to_flow,先调用lib/parse_flow_nlattrs函数解析upcall->key, upcall->key_len,把解析出来的attr属性放到一个bitmap present_attrs中,而对应类型的struct nlattr则放到struct nlattr* attrs[]中。接下来对present_attrs的每一位,从upcall->key中取得相应值并存入flow中。对于vlan的parse,特别调用了parse_8021q_onward

odp_port_to_ofport,用来把flow.in_port,即datapath的port号转换成openflow port,即struct ofport_dpif* port

        flow_extract(upcall->packet, flow.skb_priority,                                               
                     &flow.tunnel, flow.in_port, &miss->flow);                                        

这里把packet解析到flow中,该函数和odp_flow_key_to_flow有些地方重复

       /* Add other packets to a to-do list. */                                                      
        hash = flow_hash(&miss->flow, 0);
        existing_miss = flow_miss_find(&todo, &miss->flow, hash);
        if (!existing_miss) {
            hmap_insert(&todo, &miss->hmap_node, hash);
            miss->ofproto = ofproto;
            miss->key = upcall->key;
            miss->key_len = upcall->key_len;                                                          
            miss->upcall_type = upcall->type;
            list_init(&miss->packets);
    
            n_misses++;                                                                               
        } else {
            miss = existing_miss;
        }   
        list_push_back(&miss->packets, &upcall->packet->list_node);
    }

flow_hash计算出miss->flow的哈希值,之后在todo这个hmap里基于哈希值查找struct flow_miss*,如果为空,表示这是第一个flow_miss,初始化这个flow_miss并加入到todo中,最后把packet假如到flow_miss->packets的list中。这里验证了之前的结论,对于一次性的多个upcall,会把属于同一个flow_miss的packets链接到同一个flow_miss下再一并处理。


OVS定义了facet,用来表示用户态程序,比如vswitchd,对于一条被匹配的flow的视图。同时kernel space对于一条flow同样有一个视图,facet表示两个视图相同的部分。不同的部分用subfacet来表示,struct subfacet里定义了action行为

如果datapath计算出的flow_key,和vswitchd基于packet计算出的flow_key完全一致的话,facet只会包含唯一的subfacet,如果datapath计算出的flow_key的成员比vswitchd基于packet计算出来的还要多,那么每个多出来的部分都会成为一个subfacet

struct subfacet {
    /* Owners. */
    struct hmap_node hmap_node; /* In struct ofproto_dpif 'subfacets' list. */
    struct list list_node;      /* In struct facet's 'facets' list. */
    struct facet *facet;        /* Owning facet. */

    /* Key.
     *
     * To save memory in the common case, 'key' is NULL if 'key_fitness' is
     * ODP_FIT_PERFECT, that is, odp_flow_key_from_flow() can accurately
     * regenerate the ODP flow key from ->facet->flow. */
    enum odp_key_fitness key_fitness;
    struct nlattr *key;
    int key_len;

    long long int used;         /* Time last used; time created if not used. */

    uint64_t dp_packet_count;   /* Last known packet count in the datapath. */
    uint64_t dp_byte_count;     /* Last known byte count in the datapath. */

    /* Datapath actions.
     *
     * These should be essentially identical for every subfacet in a facet, but
     * may differ in trivial ways due to VLAN splinters. */
    size_t actions_len;         /* Number of bytes in actions[]. */
    struct nlattr *actions;     /* Datapath actions. */

    enum slow_path_reason slow; /* 0 if fast path may be used. */
    enum subfacet_path path;    /* Installed in datapath? */

}

我们先来看handle_flow_miss

/* Handles flow miss 'miss' on 'ofproto'.  May add any required datapath
 * operations to 'ops', incrementing '*n_ops' for each new op. */
static void
handle_flow_miss(struct ofproto_dpif *ofproto, struct flow_miss *miss,
                 struct flow_miss_op *ops, size_t *n_ops)
{
    struct facet *facet;
    uint32_t hash;

    /* The caller must ensure that miss->hmap_node.hash contains
     * flow_hash(miss->flow, 0). */
    hash = miss->hmap_node.hash;

    facet = facet_lookup_valid(ofproto, &miss->flow, hash);

在表示datapath的数据结构struct ofproto_dpif* ofproto中查找flow。ofproto->facets是一个hashmap,首先计算出miss flow的hash值,之后在hash对应的hmap_node list中查找是否有匹配的flow,比较的方式比较暴力,直接拿memcmp比较。。


    if (!facet) {
        struct rule_dpif *rule = rule_dpif_lookup(ofproto, &miss->flow);

        if (!flow_miss_should_make_facet(ofproto, miss, hash)) {
            handle_flow_miss_without_facet(miss, rule, ops, n_ops);

此时认为没有必要创建flow facet,对于一些trivial的流量,创建一个flow facet反而会带来更大的overload


            return;
        }

        facet = facet_create(rule, &miss->flow, hash);

好吧,我们为这个flow创建一个facet
    }
    handle_flow_miss_with_facet(miss, facet, ops, n_ops);
}

struct flow_miss是对flow的一个封装,用来加快miss flow的batch处理。大多数情况下,都会创建这个facet出来,

2012-10-26T07:15:43Z|22522|ofproto_dpif|INFO|[qinq] miss flow, create facet: vlan_tci 0, proto 0x806, in_port 1, src mac 0:16:3e:83:0:1, dst mac 0:25:9e:5d:62:53

2012-10-26T07:15:43Z|22529|ofproto_dpif|INFO|[qinq] miss flow, create facet: vlan_tci 0, proto 0x806, in_port 2, src mac 0:25:9e:5d:62:53, dst mac 0:16:3e:83:0:1

可以看出一个双工通信创建了两个flow出来,同时也创建了facet


下面来看handle_flow_miss_with_facet,里面调用subfacet_make_actions来生成action,该函数首先调用action_xlate_ctx_init,初始化一个action_xlate_ctx结构,该结构定义如下:

struct action_xlate_ctx {
/* action_xlate_ctx_init() initializes these members. */


    /* The ofproto. */
    struct ofproto_dpif *ofproto;

    /* Flow to which the OpenFlow actions apply.  xlate_actions() will modify
     * this flow when actions change header fields. */
    struct flow flow;

    /* The packet corresponding to 'flow', or a null pointer if we are
     * revalidating without a packet to refer to. */
    const struct ofpbuf *packet;

    /* Should OFPP_NORMAL update the MAC learning table?  Should "learn"
     * actions update the flow table?
     *
     * We want to update these tables if we are actually processing a packet,
     * or if we are accounting for packets that the datapath has processed, but
     * not if we are just revalidating. */
    bool may_learn;

    /* The rule that we are currently translating, or NULL. */

    struct rule_dpif *rule;

    /* Union of the set of TCP flags seen so far in this flow.  (Used only by
     * NXAST_FIN_TIMEOUT.  Set to zero to avoid updating updating rules'
     * timeouts.) */
    uint8_t tcp_flags;

/* xlate_actions() initializes and uses these members.  The client might want
 * to look at them after it returns. */

    struct ofpbuf *odp_actions; /* Datapath actions. */
    tag_type tags;              /* Tags associated with actions. */
    enum slow_path_reason slow; /* 0 if fast path may be used. */
    bool has_learn;             /* Actions include NXAST_LEARN? */
    bool has_normal;            /* Actions output to OFPP_NORMAL? */
    bool has_fin_timeout;       /* Actions include NXAST_FIN_TIMEOUT? */
    uint16_t nf_output_iface;   /* Output interface index for NetFlow. */
    mirror_mask_t mirrors;      /* Bitmap of associated mirrors. */

/* xlate_actions() initializes and uses these members, but the client has no
 * reason to look at them. */

    int recurse;                /* Recursion level, via xlate_table_action. */
    bool max_resubmit_trigger;  /* Recursed too deeply during translation. */
    struct flow base_flow;      /* Flow at the last commit. */
    uint32_t orig_skb_priority; /* Priority when packet arrived. */
    uint8_t table_id;           /* OpenFlow table ID where flow was found. */
    uint32_t sflow_n_outputs;   /* Number of output ports. */
    uint16_t sflow_odp_port;    /* Output port for composing sFlow action. */
    uint16_t user_cookie_offset;/* Used for user_action_cookie fixup. */
    bool exit;                  /* No further actions should be processed. */
    struct flow orig_flow;      /* Copy of original flow. */
};

之后调用xlate_actions,openflow1.0定义了如下action,

enum ofp10_action_type {
    OFPAT10_OUTPUT,             /* Output to switch port. */
    OFPAT10_SET_VLAN_VID,       /* Set the 802.1q VLAN id. */
    OFPAT10_SET_VLAN_PCP,       /* Set the 802.1q priority. */
    OFPAT10_STRIP_VLAN,         /* Strip the 802.1q header. */
    OFPAT10_SET_DL_SRC,         /* Ethernet source address. */
    OFPAT10_SET_DL_DST,         /* Ethernet destination address. */
    OFPAT10_SET_NW_SRC,         /* IP source address. */
    OFPAT10_SET_NW_DST,         /* IP destination address. */
    OFPAT10_SET_NW_TOS,         /* IP ToS (DSCP field, 6 bits). */
    OFPAT10_SET_TP_SRC,         /* TCP/UDP source port. */
    OFPAT10_SET_TP_DST,         /* TCP/UDP destination port. */
    OFPAT10_ENQUEUE,            /* Output to queue. */
    OFPAT10_VENDOR = 0xffff
};

对应不同的action type,其action传入的数据结构也不同,e.g.

/* Action structure for OFPAT10_SET_VLAN_VID. */
struct ofp_action_vlan_vid {
    ovs_be16 type;                  /* OFPAT10_SET_VLAN_VID. */
    ovs_be16 len;                   /* Length is 8. */
    ovs_be16 vlan_vid;              /* VLAN id. */
    uint8_t pad[2];
};


/* Action structure for OFPAT10_SET_VLAN_PCP. */
struct ofp_action_vlan_pcp {
    ovs_be16 type;                  /* OFPAT10_SET_VLAN_PCP. */
    ovs_be16 len;                   /* Length is 8. */
    uint8_t vlan_pcp;               /* VLAN priority. */
    uint8_t pad[3];
};

union ofp_action {
    ovs_be16 type;
    struct ofp_action_header header;
    struct ofp_action_vendor_header vendor;
    struct ofp_action_output output;
    struct ofp_action_vlan_vid vlan_vid;
    struct ofp_action_vlan_pcp vlan_pcp;
    struct ofp_action_nw_addr nw_addr;
    struct ofp_action_nw_tos nw_tos;
    struct ofp_action_tp_port tp_port;
};

do_xlate_actions传入一个struct ofp_action*数组,对每个struct ofp_action,执行不同的操作,e.g.

        case OFPUTIL_OFPAT10_OUTPUT:
            xlate_output_action(ctx, &ia->output);
            break;

        case OFPUTIL_OFPAT10_SET_VLAN_VID:
            ctx->flow.vlan_tci &= ~htons(VLAN_VID_MASK);
            ctx->flow.vlan_tci |= ia->vlan_vid.vlan_vid | htons(VLAN_CFI);
            break;

        case OFPUTIL_OFPAT10_SET_VLAN_PCP:
            ctx->flow.vlan_tci &= ~htons(VLAN_PCP_MASK);
            ctx->flow.vlan_tci |= htons(
                (ia->vlan_pcp.vlan_pcp << VLAN_PCP_SHIFT) | VLAN_CFI);
            break;

        case OFPUTIL_OFPAT10_STRIP_VLAN:
            ctx->flow.vlan_tci = htons(0);
            break;

对于转发报文,最重要的就是xlate_output_action,该函数调用的xlate_output_action__,其中传入的port为datapath port index,或者其他控制参数,可以在ofp_port的定义中看到如下定义:

enum ofp_port {
    /* Maximum number of physical switch ports. */
    OFPP_MAX = 0xff00,
    
    /* Fake output "ports". */
    OFPP_IN_PORT    = 0xfff8,  /* Send the packet out the input port.  This
                                  virtual port must be explicitly used
                                  in order to send back out of the input
                                  port. */
    OFPP_TABLE      = 0xfff9,  /* Perform actions in flow table.
                                  NB: This can only be the destination
                                  port for packet-out messages. */
    OFPP_NORMAL     = 0xfffa,  /* Process with normal L2/L3 switching. */
    OFPP_FLOOD      = 0xfffb,  /* All physical ports except input port and
                                  those disabled by STP. */
    OFPP_ALL        = 0xfffc,  /* All physical ports except input port. */
    OFPP_CONTROLLER = 0xfffd,  /* Send to controller. */
    OFPP_LOCAL      = 0xfffe,  /* Local openflow "port". */
    OFPP_NONE       = 0xffff   /* Not associated with a physical port. */
};  

在xlate_output_action__中,大部分情况都是走到OFPP_NORMAL里面,调用xlate_normal,里面会调用mac_learning_lookup, 查找mac表找到报文的出口port,然后调用output_normal,output_normal最终调用compose_output_action

compose_output_action__(struct action_xlate_ctx *ctx, uint16_t ofp_port,

                        bool check_stp)
{
    const struct ofport_dpif *ofport = get_ofp_port(ctx->ofproto, ofp_port);
    uint16_t odp_port = ofp_port_to_odp_port(ofp_port);
    ovs_be16 flow_vlan_tci = ctx->flow.vlan_tci;
    uint8_t flow_nw_tos = ctx->flow.nw_tos;
    uint16_t out_port;

...

    out_port = vsp_realdev_to_vlandev(ctx->ofproto, odp_port,
                                      ctx->flow.vlan_tci);
    if (out_port != odp_port) {
        ctx->flow.vlan_tci = htons(0);
    }
    commit_odp_actions(&ctx->flow, &ctx->base_flow, ctx->odp_actions);
    nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_OUTPUT, out_port);

    ctx->sflow_odp_port = odp_port;
    ctx->sflow_n_outputs++;
    ctx->nf_output_iface = ofp_port;
    ctx->flow.vlan_tci = flow_vlan_tci;
    ctx->flow.nw_tos = flow_nw_tos;
}

commit_odp_actions,用来把所有action编码车功能nlattr的格式存到ctx->odp_actions中,之后的nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_OUTPUT, out_port)把报文的出口port添加进去,这样一条flow action差不多组合完毕了


下面来讨论下vswitchd中的cam表,代码在lib/mac-learning.h lib/mac-learning.c中,

vswitchd内部维护了一个mac/port的cam表,其中mac entry的老化时间为300秒,cam表定义了flooding vlan的概念,即如果vlan是flooding,表示不会去学习任何地址,这个vlan的所有转发都通过flooding完成,

/* A MAC learning table entry. */
struct mac_entry {
    struct hmap_node hmap_node; /* Node in a mac_learning hmap. */         
    struct list lru_node;       /* Element in 'lrus' list. */
    time_t expires;             /* Expiration time. */
    time_t grat_arp_lock;       /* Gratuitous ARP lock expiration time. */        
    uint8_t mac[ETH_ADDR_LEN];  /* Known MAC address. */
    uint16_t vlan;              /* VLAN tag. */
    tag_type tag;               /* Tag for this learning entry. */

    /* Learned port. */
    union {
        void *p;
        int i;
    } port;
};

/* MAC learning table. */
struct mac_learning {
    struct hmap table;          /* Learning table. */        mac_entry组成的hmap哈希表,mac_entry通过hmap_node挂载到mac_learning->table中
    struct list lrus;           /* In-use entries, least recently used at the
                                   front, most recently used at the back. */              lru的链表,mac_entry通过lru_node挂载到mac_learning->lrus中
    uint32_t secret;            /* Secret for randomizing hash table. */     
    unsigned long *flood_vlans; /* Bitmap of learning disabled VLANs. */
    unsigned int idle_time;     /* Max age before deleting an entry. */           最大老化时间
};  


static uint32_t
mac_table_hash(const struct mac_learning *ml, const uint8_t mac[ETH_ADDR_LEN],
               uint16_t vlan)
{
    unsigned int mac1 = get_unaligned_u32((uint32_t *) mac);
    unsigned int mac2 = get_unaligned_u16((uint16_t *) (mac + 4));
    return hash_3words(mac1, mac2 | (vlan << 16), ml->secret);
}   

mac_entry计算的hash值,由mac_learning->secret,vlan, mac地址共同通过hash_3words计算出来


mac_entry_lookup,通过mac地址,vlan来查看是否已经对应的mac_entry

get_lru,找到lru链表对应的第一个mac_entry

mac_learning_create/mac_learning_destroy,创建/销毁mac_learning表

mac_learning_may_learn,如果vlan不是flooding vlan且mac地址不是多播地址,返回true

mac_learning_insert,向mac_learning中插入一条mac_entry,首先通过mac_entry_lookup查看mac, vlan对应的mac_entry是否存在,不存在的话如果此时mac_learning已经有了MAC_MAX条mac_entry,老化最老的那条,之后创建mac_entry并插入到cam表中。

mac_learning_lookup,调用mac_entry_lookup在cam表中查找某个vlan对应的mac地址

mac_learning_run,循环老化已经超时的mac_entry






你可能感兴趣的:(open vswitch研究:vswitchd)