OVS DPDK VXLAN隧道处理

在学习OVS VXLAN实现之前,我们先回顾一下传统VTEP设备是如何处理VXLAN报文的。如下图所示:

vxlan报文进入交换机端口后,根据报文头部信息进行vxlan隧道终结。隧道终结后,根据underlay信息进行overlay映射,得到overlay的bd和vrf.对于上图来说,报文隧道终结后从vxlan10进入br10,就为overlay报文绑定了br10和bdif。其中br10进行同子网FDB转发,如果overlay报文的目的MAC为bdif的mac,那么报文将从bdif进入其所属的vrf进行三层路由。这个过程就是VTEP接收到vxlan报文后的处理流程。

对于overlay报文,在经过overlay路由后,如果其目的bd为br10。那么报文将会从bdif进入br10,经过fdb后从vxlan10输出。vxlan10接口负责为报文构建vxlan封装。vxlan报文封装好后进入underlay路由转发,离开VTEP。

VTEP组成要素

tunnel-terminate table

隧道终结表用于剥掉vxlan报文的underlay报文头。

VXLAN报文进入VTEP后,需要进行隧道终结。VXLAN属于P2MP(点到多点)隧道。进行隧道终结时只需要校验目的IP是本机的IP即可(目的MAC必定是本机的)。当然隧道终结之前,需要确定报文是否为VXLAN报文。隧道终结一般有两种形式:

  • 像linux内核那样,不管报文是否为vxlan报文,当成常规的报文进行处理,因为vxlan报文的目的IP是本机,报文将会被送往local的udp进行处理,在udp处理时,根据目的端口为4789,然后将报文转入vlxan端口进行处理,这时overlay报文将会从vxlan中进入协议栈进行第二遍处理。
  • 向传统硬件厂商一样,在parser阶段就已经将整个报文的内外层报文头都提取出来,如果是vxlan报文,直接进入tunnel-terminate表进行终结。

tunnel-decap-map table

隧道解封装映射表,用于确定overlay报文的二层广播域和三层路由域,即bd和vrf。

vxlan隧道时点到多点的隧道,可以根据vni来映射其所属的BD和VRF,用于overlay报文的同子网FDB和不同子网的路由。

tunnel-encap table

隧道封装表负责对fdb转发后或者路由后的报文进行vxlan报文封装。对于同子网的报文需要确定vni,underlay源IP,underlay目的IP即可,一般来说同子网转发,vni是不会变的。对于跨子网转发,需要借助路由,路由后确定了overlay-smac,overlay-dmac。对于vni,underlay-sip,underlay-dip的确定,不同的转发模型有很大的不同。传统的转发模型是,overlay-route只负责路由,路由后封装好链路层。报文从bdif输出,bdif连接到一个桥,由桥的vni决定报文的vni。还有一些厂商可以直接由路由决定vni,underlay-sip和underlay-dip,具体可以参考sonic的sai接口设计。

underlay route table

overlay报文封装好后,进入underlay进行路由转发,所以需要underlay路由。传统网络设备需要给vxlan隧道携带一个underlay rif,通过该rif指定underlay vrf。

underlay neighbor

underlay路由后,需要邻居进行underlay链路封装,需要邻居表。

OVS DPDK VXLAN

ovs要实现vtep功能,必定需要实现上面的要素。

主要数据结构

struct tnl_match {
    ovs_be64 in_key;//vni
    struct in6_addr ipv6_src;//源IP
    struct in6_addr ipv6_dst;//目的IP
    odp_port_t odp_port;//对应的接口编号
    bool in_key_flow;//该标志位false,表示vni需要严格匹配,为true,表示希望流表设置vni。
    bool ip_src_flow;//为false表示必须匹配源IP,如果源IP为0表示通配所有源IP。为true表示使用openflow流表设置隧道 
                     //tunnel-id进行进一步匹配。
    bool ip_dst_flow;//为false表示必须匹配目的IP,为true表示使用openflow流表设置隧道的目的IP。
};

struct tnl_port {
    struct hmap_node ofport_node;
    struct hmap_node match_node;

    const struct ofport_dpif *ofport;
    uint64_t change_seq;
    struct netdev *netdev;

    struct tnl_match match;//隧道认证元素,唯一标识一个隧道
};

// ovs将vxlan port存在一个hash表中:
/* Each hmap contains "struct tnl_port"s.
 * The index is a combination of how each of the fields listed under "Tunnel
 * matches" above matches, see the final paragraph for ordering.
 * vxlan端口映射表。根据in_key_flow,in_dst_flow,in_src_flow三个参数分成12个
 * 优先级。
 */
static struct hmap *tnl_match_maps[N_MATCH_TYPES] OVS_GUARDED_BY(rwlock);

详细说明

struct tnl_match 结构为vxlan端口的核心结构。in_key,ipv6_src,ipv6_dst指定了vxlan封装的核心成员,这些是传统VTEP的vxlan接口必须的成员。它们用于隧道终结和隧道封装。但是在ovs中,设计者添加了是三个重要的标志:

  • in_key_flow:有两种取值false表示该隧道VNI由vxlan端口创建时设置,该值会用来进行隧道终结,如果报文从该接口出去的话,会使用该值进行封装。所以为false的时候,优先级会更高。如果为true,则表示隧道终结不匹配vni,具体vni的操作由流表进行处理,封装报文时,vni由流表进行设置。
  • ip_src_flow:有三种取值,如果为false,表示隧道终结时需要使用该IP与vxlan报文的目的IP进行匹配。报文封装时,使用该IP作为该vxlan端口出去报文的隧道源IP。为true时有两种情况,src_ip为0表示通配所有IP,否则,只能匹配确定IP。
  • ip_dst_flow:有两种值,如果为false,表示隧道终结时需要匹配报文的源IP为该IP。报文从该接口输出时,隧道目的接口为该接口。如果为true的话,都由flow进行处理。

vxlan端口创建示例:

admin@ubuntu:$ sudo ovs-vsctl add-port br0 vxlan1 -- set interface vxlan1 type=vxlan     options:remote_ip=flow options:key=flow options:dst_port=8472 options:local_ip=flow  
admin@ubuntu:$ 
sudo ovs-vsctl add-port br0 vxlan2 -- set interface vxlan2 type=vxlan     options:remote_ip=flow options:key=flow options:dst_port=8472   
admin@ubuntu:/var/log/openvswitch$ 
sudo ovs-vsctl add-port br0 vxlan13 -- set interface vxlan13 type=vxlan     options:remote_ip=flow options:key=191 options:dst_port=8472 
admin@ubuntu:/var/log/openvswitch$ 

ovs交换机是一个sdn交换机,其需要核心动作是openflow流表来完成。引入这三个标志正是用来解除传统vxlan设备的限制。这三个要素根据取值的不同,可以有2*2*3=12种组合,即tnl_match_maps数组的大小。

隧道端口查找过程

/* Returns a pointer to the 'tnl_match_maps' element corresponding to 'm''s
 * matching criteria. 
 * 根据三个标志和配置决定其优先级,即map的索引。这个函数在vxlan接口添加时会调用,决定vxlan接口加入到那个map中
 */
static struct hmap **
tnl_match_map(const struct tnl_match *m)
{
    enum ip_src_type ip_src;

    ip_src = (m->ip_src_flow ? IP_SRC_FLOW
              : ipv6_addr_is_set(&m->ipv6_src) ? IP_SRC_CFG
              : IP_SRC_ANY);

    return &tnl_match_maps[6 * m->in_key_flow + 3 * m->ip_dst_flow + ip_src];
}

/* Returns the tnl_port that is the best match for the tunnel data in 'flow',
 * or NULL if no tnl_port matches 'flow'. 
 * 隧道终结过程,根据报文信息查找对应的vxlan端口,进行隧道终结
 */
static struct tnl_port *
tnl_find(const struct flow *flow) OVS_REQ_RDLOCK(rwlock)
{
    enum ip_src_type ip_src;
    int in_key_flow;
    int ip_dst_flow;
    int i;

    i = 0;
    for (in_key_flow = 0; in_key_flow < 2; in_key_flow++) {//in_key_flow优先级最高,0优先级高于1
        for (ip_dst_flow = 0; ip_dst_flow < 2; ip_dst_flow++) {//ip_dst_flow优先级次之,即vxlan报文源IP
            for (ip_src = 0; ip_src < 3; ip_src++) {//ip_src优先级最低,可能的取值可以查看IP_SRC_CFG
                struct hmap *map = tnl_match_maps[i];

                if (map) {
                    struct tnl_port *tnl_port;
                    struct tnl_match match;

                    memset(&match, 0, sizeof match);

                    /* The apparent mix-up of 'ip_dst' and 'ip_src' below is
                     * correct, because "struct tnl_match" is expressed in
                     * terms of packets being sent out, but we are using it
                     * here as a description of how to treat received
                     * packets. 
                     * in_key_flow为真的时候,不需要匹配vni
                     */
                    match.in_key = in_key_flow ? 0 : flow->tunnel.tun_id;
                    if (ip_src == IP_SRC_CFG) {
                        match.ipv6_src = flow_tnl_dst(&flow->tunnel);
                    }
                    if (!ip_dst_flow) {/*  */
                        match.ipv6_dst = flow_tnl_src(&flow->tunnel);
                    }
                    match.odp_port = flow->in_port.odp_port;
                    match.in_key_flow = in_key_flow;
                    match.ip_dst_flow = ip_dst_flow;
                    match.ip_src_flow = ip_src == IP_SRC_FLOW;
                    //进行精确匹配
                    tnl_port = tnl_find_exact(&match, map);
                    if (tnl_port) {
                        return tnl_port;
                    }
                }

                i++;
            }
        }
    }

    return NULL;
}

优势

ovs通过这三个标志的加入,极大的简化了vxlan端口的配置,使得全局一个vxlan端口足以满足应用,其它参数通过流表来操作,突出了SDN的优势,可以适应大规模场景。

terminate table

ovs的隧道终结表是在创建vxlan port的时候构建。

//隧道全局信息初始化。ovs采用分类器来构建隧道终结表cls。全局变量addr_list保存了本地所有的underlay ip地址。
//underlay ip将来会作为隧道终结的目的地址,也会作为隧道封装的源IP地址。
//port_list保存了使用传输层端口的隧道,比如vxlan隧道。
void
tnl_port_map_init(void)
{
    classifier_init(&cls, flow_segment_u64s);//隧道终结表
    ovs_list_init(&addr_list);//underlay ip链表
    ovs_list_init(&port_list);//tnl_port控制块链表
    unixctl_command_register("tnl/ports/show", "-v", 0, 1, tnl_port_show, NULL);
}

隧道端口添加

/* Adds 'ofport' to the module with datapath port number 'odp_port'. 'ofport's
 * must be added before they can be used by the module. 'ofport' must be a
 * tunnel.
 *
 * Returns 0 if successful, otherwise a positive errno value. 
 * native_tnl表示是否开启了隧道终结
 */
int
tnl_port_add(const struct ofport_dpif *ofport, const struct netdev *netdev,
             odp_port_t odp_port, bool native_tnl, const char name[]) OVS_EXCLUDED(rwlock)
{
    bool ok;

    fat_rwlock_wrlock(&rwlock);
    ok = tnl_port_add__(ofport, netdev, odp_port, true, native_tnl, name);
    fat_rwlock_unlock(&rwlock);

    return ok ? 0 : EEXIST;
}

//添加隧道端口
static bool
tnl_port_add__(const struct ofport_dpif *ofport, const struct netdev *netdev,
               odp_port_t odp_port, bool warn, bool native_tnl, const char name[])
    OVS_REQ_WRLOCK(rwlock)
{
    const struct netdev_tunnel_config *cfg;
    struct tnl_port *existing_port;
    struct tnl_port *tnl_port;
    struct hmap **map;

    cfg = netdev_get_tunnel_config(netdev);
    ovs_assert(cfg);

    tnl_port = xzalloc(sizeof *tnl_port);
    tnl_port->ofport = ofport;
    tnl_port->netdev = netdev_ref(netdev);
    tnl_port->change_seq = netdev_get_change_seq(tnl_port->netdev);
    //这些参数不会影响隧道终结
    tnl_port->match.in_key = cfg->in_key;
    tnl_port->match.ipv6_src = cfg->ipv6_src;
    tnl_port->match.ipv6_dst = cfg->ipv6_dst;
    tnl_port->match.ip_src_flow = cfg->ip_src_flow;
    tnl_port->match.ip_dst_flow = cfg->ip_dst_flow;
    tnl_port->match.in_key_flow = cfg->in_key_flow;
    tnl_port->match.odp_port = odp_port;
    //根据隧道的匹配条件查找其在map中的位置
    map = tnl_match_map(&tnl_port->match);
    //查看是否存在一样的接口
    existing_port = tnl_find_exact(&tnl_port->match, *map);
    if (existing_port) {
        if (warn) {
            struct ds ds = DS_EMPTY_INITIALIZER;
            tnl_match_fmt(&tnl_port->match, &ds);
            VLOG_WARN("%s: attempting to add tunnel port with same config as "
                      "port '%s' (%s)", tnl_port_get_name(tnl_port),
                      tnl_port_get_name(existing_port), ds_cstr(&ds));
            ds_destroy(&ds);
        }
        netdev_close(tnl_port->netdev);
        free(tnl_port);
        return false;
    }

    hmap_insert(ofport_map, &tnl_port->ofport_node, hash_pointer(ofport, 0));

    if (!*map) {
        *map = xmalloc(sizeof **map);
        hmap_init(*map);
    }
    hmap_insert(*map, &tnl_port->match_node, tnl_hash(&tnl_port->match));
    tnl_port_mod_log(tnl_port, "adding");

    if (native_tnl) {//如果支持隧道终结,则构建隧道终结表,一般来说dpdk模式下需要开启,内核模式下不需要
        const char *type;

        type = netdev_get_type(netdev);
        tnl_port_map_insert(odp_port, cfg->dst_port, name, type);

    }
    return true;
}

//对于需要传输层的隧道,则对目的端口进行处理
void
tnl_port_map_insert(odp_port_t port, ovs_be16 tp_port,
                    const char dev_name[], const char type[])
{
    struct tnl_port *p;
    struct ip_device *ip_dev;
    uint8_t nw_proto;

    nw_proto = tnl_type_to_nw_proto(type);
    if (!nw_proto) {//不需要传输层,则直接返回
        return;
    }

    //将隧道端口加入链表,这里有一个bug:不是比较tp_port == p->tp_port,而是p->port == port
    //判断条件修改为(p->port == port && p->nw_proto == nw_proto)
    ovs_mutex_lock(&mutex);
    LIST_FOR_EACH(p, node, &port_list) {
        if (tp_port == p->tp_port && p->nw_proto == nw_proto) {
             goto out;
        }
    }

    p = xzalloc(sizeof *p);
    p->port = port;
    p->tp_port = tp_port;
    p->nw_proto = nw_proto;
    ovs_strlcpy(p->dev_name, dev_name, sizeof p->dev_name);
    ovs_list_insert(&port_list, &p->node);
    //遍历每一个本机的设备ip地址,使用该设备作为源地址,构建隧道终结表
    //隧道终结需要参数为:传输层协议,传输层目的端口,本机IP(vxlan报文的目的IP)
    LIST_FOR_EACH(ip_dev, node, &addr_list) {
        map_insert_ipdev__(ip_dev, p->dev_name, p->port, p->nw_proto, p->tp_port);
    }

out:
    ovs_mutex_unlock(&mutex);
}
//ip_dev:源设备
//:设备名字
//:端口编号,
//: 协议,端口
static void
map_insert_ipdev__(struct ip_device *ip_dev, char dev_name[],
                   odp_port_t port, uint8_t nw_proto, ovs_be16 tp_port)
{
    if (ip_dev->n_addr) {//遍历设备的每一个地址
        int i;

        for (i = 0; i < ip_dev->n_addr; i++) {
            //报文的目的mac必须是ip_dev->mac
            map_insert(port, ip_dev->mac, &ip_dev->addr[i],
                       nw_proto, tp_port, dev_name);
        }
    }
}

隧道终结过程

上面我们看到了如何构建隧道终结表,这里我们看一下ovs是如何执行隧道终结过程。

ovs在设计的时候只能处理一层报文,即对报文进行解析时,只能解析到传输层,无法感知报文的overlay信息。vxlan报文从dpdk接口到达ovs时,其目的mac是internel接口的mac,目的IP是internel接口的IP。使用normal规则进行转发时会将报文转发到internal接口,在构建OUTPUT动作的时候进行隧道终结。

/* 组合报文输出动作 */
static void
compose_output_action(struct xlate_ctx *ctx, ofp_port_t ofp_port,
                      const struct xlate_bond_recirc *xr)
{
    /* 需要检查一下是否是stp报文 */
    compose_output_action__(ctx, ofp_port, xr, true);
}


/* 组合报文输出动作 */
static void
compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
                        const struct xlate_bond_recirc *xr, bool check_stp)
{
    const struct xport *xport = get_ofp_port(ctx->xbridge, ofp_port);/* 获取xport */
    struct flow_wildcards *wc = ctx->wc;/* 获取流通配符 */
    struct flow *flow = &ctx->xin->flow;/* 获取输入流 */
    struct flow_tnl flow_tnl;
    ovs_be16 flow_vlan_tci;
    uint32_t flow_pkt_mark;
    uint8_t flow_nw_tos;
    odp_port_t out_port, odp_port;
    bool tnl_push_pop_send = false;
    uint8_t dscp;

    ......
        
    if (out_port != ODPP_NONE) {/* 输出转换 */
        xlate_commit_actions(ctx);/* 转换输出动作 */

        if (xr) {/* 如果存在bond的重入动作 */
            struct ovs_action_hash *act_hash;

            /* Hash action. */
            act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
                                                OVS_ACTION_ATTR_HASH,
                                                sizeof *act_hash);
            act_hash->hash_alg = xr->hash_alg;
            act_hash->hash_basis = xr->hash_basis;

            /* Recirc action. 添加重入动作,设置重入id */
            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
                           xr->recirc_id);
        } else {

            if (tnl_push_pop_send) {/* 是否需要进行标签的弹入或者弹出动作 */
                build_tunnel_send(ctx, xport, flow, odp_port);
                flow->tunnel = flow_tnl; /* Restore tunnel metadata 报文报文的元数据 */
            } else {
                odp_port_t odp_tnl_port = ODPP_NONE;

                /* XXX: Write better Filter for tunnel port. We can use inport
                * int tunnel-port flow to avoid these checks completely. 
                * 报文如果是发往local,那么检查一下是否设置了隧道终结功能。进行隧道终结
                */
                if (ofp_port == OFPP_LOCAL &&
                    ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {

                    odp_tnl_port = tnl_port_map_lookup(flow, wc);
                }

                if (odp_tnl_port != ODPP_NONE) {
                    nl_msg_put_odp_port(ctx->odp_actions,
                                        OVS_ACTION_ATTR_TUNNEL_POP,
                                        odp_tnl_port);
                } else {
                    /* Tunnel push-pop action is not compatible with
                     * IPFIX action. */
                    compose_ipfix_action(ctx, out_port);

                    /* Handle truncation of the mirrored packet. */
                    if (ctx->mirror_snaplen > 0 &&
                        ctx->mirror_snaplen < UINT16_MAX) {
                        struct ovs_action_trunc *trunc;

                        trunc = nl_msg_put_unspec_uninit(ctx->odp_actions,
                                                         OVS_ACTION_ATTR_TRUNC,
                                                         sizeof *trunc);
                        trunc->max_len = ctx->mirror_snaplen;
                        if (!ctx->xbridge->support.trunc) {
                            ctx->xout->slow |= SLOW_ACTION;
                        }
                    }

                    nl_msg_put_odp_port(ctx->odp_actions,
                                        OVS_ACTION_ATTR_OUTPUT,
                                        out_port);
                }
            }
        }

        ctx->sflow_odp_port = odp_port;
        ctx->sflow_n_outputs++;
        /* 设置出接口 */
        ctx->nf_output_iface = ofp_port;
    }

    /* 出端口镜像处理,这里做的是出镜像 */
    if (mbridge_has_mirrors(ctx->xbridge->mbridge) && xport->xbundle) {/* 判断该网桥是否支持镜像,并且处在出端口 */
    /*进行出镜像处理,处理出镜像 */
        mirror_packet(ctx, xport->xbundle,
                      xbundle_mirror_dst(xport->xbundle->xbridge,
                                         xport->xbundle));/* 获取该端口的镜像策略 */
    }

 out:
    /* Restore flow,值写入到动作后,需要进行还原 */
    flow->vlan_tci = flow_vlan_tci;
    flow->pkt_mark = flow_pkt_mark;
    flow->nw_tos = flow_nw_tos;
}

/* 'flow' is non-const to allow for temporary modifications during the lookup.
 * Any changes are restored before returning. 
 * flow参数允许临时修改一些值,但是在返回之前需要被恢复原来的值。
 */
odp_port_t
tnl_port_map_lookup(struct flow *flow, struct flow_wildcards *wc)
{
    //进行分类器规则查找,进行隧道终结
    const struct cls_rule *cr = classifier_lookup(&cls, OVS_VERSION_MAX, flow,
                                                  wc);
    //返回隧道端口编号。
    return (cr) ? tnl_port_cast(cr)->portno : ODPP_NONE;
}

上面是慢路径查找分类器时执行的处理,报文被处理完后,最终会安装快转流表,执行datapath动作。

/* 动作执行回调函数,参数may_steal表示是否可以释放报文 */
static void
dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
              const struct nlattr *a, bool may_steal)
{
    struct dp_netdev_execute_aux *aux = aux_;/* 动作执行辅助函数 */
    uint32_t *depth = recirc_depth_get();/* 重入深度 */
    struct dp_netdev_pmd_thread *pmd = aux->pmd;/* 轮询线程 */
    struct dp_netdev *dp = pmd->dp;/* 轮询的网桥设备 */
    int type = nl_attr_type(a);/* 获取动作类型 */
    long long now = aux->now;/* 获取时间 */
    struct tx_port *p;/* 发送端口 */

    switch ((enum ovs_action_attr)type) {/* 动作类型 */
    ......

    case OVS_ACTION_ATTR_TUNNEL_POP:/* 去掉外层标签,依然需要重入,进行内层报文的处理 */
        if (*depth < MAX_RECIRC_DEPTH) {
            struct dp_packet_batch *orig_packets_ = packets_;
            odp_port_t portno = nl_attr_get_odp_port(a);
            //查看端口是否存在
            p = pmd_tnl_port_cache_lookup(pmd, portno);
            if (p) {
                struct dp_packet_batch tnl_pkt;
                int i;

                if (!may_steal) {
                    dp_packet_batch_clone(&tnl_pkt, packets_);
                    packets_ = &tnl_pkt;
                    dp_packet_batch_reset_cutlen(orig_packets_);
                }
                
                dp_packet_batch_apply_cutlen(packets_);
                //进行隧道decap,p->port->netdev指导如何decap封装。端口号无意义。
                netdev_pop_header(p->port->netdev, packets_);
                if (!packets_->count) {
                    return;
                }

                for (i = 0; i < packets_->count; i++) {
                    //overlay报文的输入接口被设置为portno。
                    packets_->packets[i]->md.in_port.odp_port = portno;
                }
                 
                (*depth)++;
                dp_netdev_recirculate(pmd, packets_);
                (*depth)--;
                return;
            }
        }
        break;
        ......
}
    
    
/* vxlan头出栈 */
struct dp_packet *
netdev_vxlan_pop_header(struct dp_packet *packet)
{
    struct pkt_metadata *md = &packet->md;/* 获取报文的元数据 */
    struct flow_tnl *tnl = &md->tunnel;/* 获取元数据的隧道信息 */
    struct vxlanhdr *vxh;
    unsigned int hlen;

    pkt_metadata_init_tnl(md);/* 初始化报文的元数据 */
    if (VXLAN_HLEN > dp_packet_l4_size(packet)) {/* 如果整个报文的大小没有vxlan封装大,那么返回错误 */
        goto err;
    }

    vxh = udp_extract_tnl_md(packet, tnl, &hlen);/* 提取vxlan隧道信息 */
    if (!vxh) {
        goto err;
    }
    /* 进行vxlan头部校验 */
    if (get_16aligned_be32(&vxh->vx_flags) != htonl(VXLAN_FLAGS) ||
       (get_16aligned_be32(&vxh->vx_vni) & htonl(0xff))) {
        VLOG_WARN_RL(&err_rl, "invalid vxlan flags=%#x vni=%#x\n",
                     ntohl(get_16aligned_be32(&vxh->vx_flags)),
                     ntohl(get_16aligned_be32(&vxh->vx_vni)));
        goto err;
    }
    //提取vni设置成隧道id
    tnl->tun_id = htonll(ntohl(get_16aligned_be32(&vxh->vx_vni)) >> 8);
    tnl->flags |= FLOW_TNL_F_KEY;

    /* 偏移掉隧道头 */
    dp_packet_reset_packet(packet, hlen + VXLAN_HLEN);

    return packet;
err:
    dp_packet_delete(packet);
    return NULL;
}

提取隧道元数据填充到flow中,将隧道头偏移掉,隧道终结完毕,内层报文准备重入。

tunnel-decap-map table

前面已经说明了隧道终结过程,接下来需要进行解封装后的映射,查找overlay报文的输入VXLAN端口。从而开始overlay报文的处理。ovs是通过vxlan隧道描述控制块实现decap-map功能的。

vxlan接口描述控制块查找

隧道终结后,报文带着隧道元数据调用dp_netdev_recirculate函数进行重入,重入后走慢路径查询分类器。

static struct ofproto_dpif *
xlate_lookup_ofproto_(const struct dpif_backer *backer, const struct flow *flow,
                      ofp_port_t *ofp_in_port, const struct xport **xportp)
{
    struct xlate_cfg *xcfg = ovsrcu_get(struct xlate_cfg *, &xcfgp);/* 获取当前生效的xlate配置 */
    const struct xport *xport;

    xport = xport_lookup(xcfg, tnl_port_should_receive(flow)
                         ? tnl_port_receive(flow)
                         : odp_port_to_ofport(backer, flow->in_port.odp_port));
    if (OVS_UNLIKELY(!xport)) {
        return NULL;
    }
    *xportp = xport;
    if (ofp_in_port) {
        *ofp_in_port = xport->ofp_port;
    }
    return xport->xbridge->ofproto;/* 根据xlate端口找到其openflow交换机描述控制块ofproto */
}

使用函数tnl_port_receive(flow)查询vxlan端口描述控制块。

/* Looks in the table of tunnels for a tunnel matching the metadata in 'flow'.
 * Returns the 'ofport' corresponding to the new in_port, or a null pointer if
 * none is found.
 *
 * Callers should verify that 'flow' needs to be received by calling
 * tnl_port_should_receive() before this function. */
const struct ofport_dpif *
tnl_port_receive(const struct flow *flow) OVS_EXCLUDED(rwlock)
{
    char *pre_flow_str = NULL;
    const struct ofport_dpif *ofport;
    struct tnl_port *tnl_port;

    fat_rwlock_rdlock(&rwlock);
    //找到对应的隧道接口
    tnl_port = tnl_find(flow);
    //使用隧道接口作为新的输入接口
    ofport = tnl_port ? tnl_port->ofport : NULL;
    if (!tnl_port) {
        char *flow_str = flow_to_string(flow);

        VLOG_WARN_RL(&rl, "receive tunnel port not found (%s)", flow_str);
        free(flow_str);
        goto out;
    }

    if (!VLOG_DROP_DBG(&dbg_rl)) {
        pre_flow_str = flow_to_string(flow);
    }

    if (pre_flow_str) {
        char *post_flow_str = flow_to_string(flow);
        char *tnl_str = tnl_port_fmt(tnl_port);
        VLOG_DBG("flow received\n"
                 "%s"
                 " pre: %s\n"
                 "post: %s",
                 tnl_str, pre_flow_str, post_flow_str);
        free(tnl_str);
        free(pre_flow_str);
        free(post_flow_str);
    }

out:
    fat_rwlock_unlock(&rwlock);
    return ofport;
}

/* Returns the tnl_port that is the best match for the tunnel data in 'flow',
 * or NULL if no tnl_port matches 'flow'. */
static struct tnl_port *
tnl_find(const struct flow *flow) OVS_REQ_RDLOCK(rwlock)
{
    enum ip_src_type ip_src;
    int in_key_flow;
    int ip_dst_flow;
    int i;

    i = 0;
    for (in_key_flow = 0; in_key_flow < 2; in_key_flow++) {
        for (ip_dst_flow = 0; ip_dst_flow < 2; ip_dst_flow++) {
            for (ip_src = 0; ip_src < 3; ip_src++) {
                struct hmap *map = tnl_match_maps[i];

                if (map) {
                    struct tnl_port *tnl_port;
                    struct tnl_match match;

                    memset(&match, 0, sizeof match);

                    /* The apparent mix-up of 'ip_dst' and 'ip_src' below is
                     * correct, because "struct tnl_match" is expressed in
                     * terms of packets being sent out, but we are using it
                     * here as a description of how to treat received
                     * packets. 
                     * in_key_flow为真的时候,不需要匹配vni
                     */
                    match.in_key = in_key_flow ? 0 : flow->tunnel.tun_id;
                    if (ip_src == IP_SRC_CFG) {
                        match.ipv6_src = flow_tnl_dst(&flow->tunnel);
                    }
                    if (!ip_dst_flow) {/*  */
                        match.ipv6_dst = flow_tnl_src(&flow->tunnel);
                    }
                    match.odp_port = flow->in_port.odp_port;
                    match.in_key_flow = in_key_flow;
                    match.ip_dst_flow = ip_dst_flow;
                    match.ip_src_flow = ip_src == IP_SRC_FLOW;
                    //进行精确匹配
                    tnl_port = tnl_find_exact(&match, map);
                    if (tnl_port) {
                        return tnl_port;
                    }
                }

                i++;
            }
        }
    }

    return NULL;
}

tunnel-encap table

这里开始分析vxlan封装。当overlay报文完成了处理之后,如果它是发往另外vtep的话,最后将会从一个tunnel口出去。在慢速路径构建output动作时将会处理隧道相关事情。

/* 组合报文输出动作 */
static void
compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
                        const struct xlate_bond_recirc *xr, bool check_stp)
{
    const struct xport *xport = get_ofp_port(ctx->xbridge, ofp_port);/* 获取xport */
    ......

    if (xport->is_tunnel) {/* 如果该端口是隧道接口 */
        struct in6_addr dst;
         /* Save tunnel metadata so that changes made due to
          * the Logical (tunnel) Port are not visible for any further
          * matches, while explicit set actions on tunnel metadata are.
          */
        flow_tnl = flow->tunnel;/* 先保存隧道元数据 */
        //
        odp_port = tnl_port_send(xport->ofport, flow, ctx->wc);
        if (odp_port == ODPP_NONE) {
            xlate_report(ctx, OFT_WARN, "Tunneling decided against output");
            goto out; /* restore flow_nw_tos */
        }
        dst = flow_tnl_dst(&flow->tunnel);//
        if (ipv6_addr_equals(&dst, &ctx->orig_tunnel_ipv6_dst)) {
            xlate_report(ctx, OFT_WARN, "Not tunneling to our own address");
            goto out; /* restore flow_nw_tos */
        }
        if (ctx->xin->resubmit_stats) {/* 跟新统计信息 */
            netdev_vport_inc_tx(xport->netdev, ctx->xin->resubmit_stats);
        }
        if (ctx->xin->xcache) {/* 添加netdev统计信息 */
            struct xc_entry *entry;

            entry = xlate_cache_add_entry(ctx->xin->xcache, XC_NETDEV);
            entry->dev.tx = netdev_ref(xport->netdev);
        }
        out_port = odp_port;
        //执行隧道添加或者隧道终结
        if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
            xlate_report(ctx, OFT_DETAIL, "output to native tunnel");
            tnl_push_pop_send = true;
        } else {
            xlate_report(ctx, OFT_DETAIL, "output to kernel tunnel");
            commit_odp_tunnel_action(flow, &ctx->base_flow, ctx->odp_actions);/* 提交隧道动作 */
            flow->tunnel = flow_tnl; /* Restore tunnel metadata 恢复隧道元数据 */
        }
    } else {
        odp_port = xport->odp_port;
        out_port = odp_port;
    }

    if (out_port != ODPP_NONE) {/* 输出转换 */
        xlate_commit_actions(ctx);/* 转换输出动作 */

        if (xr) {/* 如果存在bond的重入动作 */
            struct ovs_action_hash *act_hash;

            /* Hash action. */
            act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
                                                OVS_ACTION_ATTR_HASH,
                                                sizeof *act_hash);
            act_hash->hash_alg = xr->hash_alg;
            act_hash->hash_basis = xr->hash_basis;

            /* Recirc action. 添加重入动作,设置重入id */
            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
                           xr->recirc_id);
        } else {

            if (tnl_push_pop_send) {/* 是否需要进行标签的弹入或者弹出动作 */
                build_tunnel_send(ctx, xport, flow, odp_port);
                flow->tunnel = flow_tnl; /* Restore tunnel metadata 报文报文的元数据 */
            } else {
                odp_port_t odp_tnl_port = ODPP_NONE;

                /* XXX: Write better Filter for tunnel port. We can use inport
                * int tunnel-port flow to avoid these checks completely. 
                * 报文如果是发往local,那么检查一下是否设置了隧道终结功能。
                */
                if (ofp_port == OFPP_LOCAL &&
                    ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {

                    odp_tnl_port = tnl_port_map_lookup(flow, wc);
                }
                //存在隧道终结表项,添加隧道终结动作。
                if (odp_tnl_port != ODPP_NONE) {
                    nl_msg_put_odp_port(ctx->odp_actions,
                                        OVS_ACTION_ATTR_TUNNEL_POP,
                                        odp_tnl_port);
                } else {
                    /* Tunnel push-pop action is not compatible with
                     * IPFIX action. */
                    compose_ipfix_action(ctx, out_port);

                    /* Handle truncation of the mirrored packet. */
                    if (ctx->mirror_snaplen > 0 &&
                        ctx->mirror_snaplen < UINT16_MAX) {
                        struct ovs_action_trunc *trunc;

                        trunc = nl_msg_put_unspec_uninit(ctx->odp_actions,
                                                         OVS_ACTION_ATTR_TRUNC,
                                                         sizeof *trunc);
                        trunc->max_len = ctx->mirror_snaplen;
                        if (!ctx->xbridge->support.trunc) {
                            ctx->xout->slow |= SLOW_ACTION;
                        }
                    }

                    nl_msg_put_odp_port(ctx->odp_actions,
                                        OVS_ACTION_ATTR_OUTPUT,
                                        out_port);
                }
            }
        }

        ctx->sflow_odp_port = odp_port;
        ctx->sflow_n_outputs++;
        /* 设置出接口 */
        ctx->nf_output_iface = ofp_port;
    }

    /* 出端口镜像处理,这里做的是出镜像 */
    if (mbridge_has_mirrors(ctx->xbridge->mbridge) && xport->xbundle) {/* 判断该网桥是否支持镜像,并且处在出端口 */
    /*进行出镜像处理,处理出镜像 */
        mirror_packet(ctx, xport->xbundle,
                      xbundle_mirror_dst(xport->xbundle->xbridge,
                                         xport->xbundle));/* 获取该端口的镜像策略 */
    }

 out:
    /* Restore flow,值写入到动作后,需要进行还原 */
    flow->vlan_tci = flow_vlan_tci;
    flow->pkt_mark = flow_pkt_mark;
    flow->nw_tos = flow_nw_tos;
}


/* Given that 'flow' should be output to the ofport corresponding to
 * 'tnl_port', updates 'flow''s tunnel headers and returns the actual datapath
 * port that the output should happen on.  May return ODPP_NONE if the output
 * shouldn't occur. 
 * */
odp_port_t
tnl_port_send(const struct ofport_dpif *ofport, struct flow *flow,
              struct flow_wildcards *wc) OVS_EXCLUDED(rwlock)
{
    const struct netdev_tunnel_config *cfg;/* 隧道配置信息 */
    struct tnl_port *tnl_port;
    char *pre_flow_str = NULL;
    odp_port_t out_port;

    fat_rwlock_rdlock(&rwlock);/* 获取读写锁的读锁 */
    tnl_port = tnl_find_ofport(ofport);/* 获取该端口的隧道端口 */
    out_port = tnl_port ? tnl_port->match.odp_port : ODPP_NONE;
    if (!tnl_port) {
        goto out;
    }

    cfg = netdev_get_tunnel_config(tnl_port->netdev);/* 获取端口的隧道配置 */
    ovs_assert(cfg);

    if (!VLOG_DROP_DBG(&dbg_rl)) {
        pre_flow_str = flow_to_string(flow);
    }

    if (!cfg->ip_src_flow) {/* 指定了源IP */
        flow->tunnel.ip_src = in6_addr_get_mapped_ipv4(&tnl_port->match.ipv6_src);
        if (!flow->tunnel.ip_src) {
            flow->tunnel.ipv6_src = tnl_port->match.ipv6_src;
        } else {
            flow->tunnel.ipv6_src = in6addr_any;
        }
    }
    if (!cfg->ip_dst_flow) {/* 指定了目的IP */
        flow->tunnel.ip_dst = in6_addr_get_mapped_ipv4(&tnl_port->match.ipv6_dst);
        if (!flow->tunnel.ip_dst) {
            flow->tunnel.ipv6_dst = tnl_port->match.ipv6_dst;
        } else {
            flow->tunnel.ipv6_dst = in6addr_any;
        }
    }
    flow->tunnel.tp_dst = cfg->dst_port;/* 目的端口号 */
    if (!cfg->out_key_flow) {
        flow->tunnel.tun_id = cfg->out_key;
    }

    if (cfg->ttl_inherit && is_ip_any(flow)) {
        wc->masks.nw_ttl = 0xff;/* 需要匹配ttl */
        flow->tunnel.ip_ttl = flow->nw_ttl;
    } else {
        flow->tunnel.ip_ttl = cfg->ttl;
    }

    if (cfg->tos_inherit && is_ip_any(flow)) {
        wc->masks.nw_tos |= IP_DSCP_MASK;
        flow->tunnel.ip_tos = flow->nw_tos & IP_DSCP_MASK;
    } else {
        flow->tunnel.ip_tos = cfg->tos;
    }

    /* ECN fields are always inherited. */
    if (is_ip_any(flow)) {
        wc->masks.nw_tos |= IP_ECN_MASK;

        if (IP_ECN_is_ce(flow->nw_tos)) {
            flow->tunnel.ip_tos |= IP_ECN_ECT_0;
        } else {
            flow->tunnel.ip_tos |= flow->nw_tos & IP_ECN_MASK;
        }
    }

    flow->tunnel.flags |= (cfg->dont_fragment ? FLOW_TNL_F_DONT_FRAGMENT : 0)
        | (cfg->csum ? FLOW_TNL_F_CSUM : 0)
        | (cfg->out_key_present ? FLOW_TNL_F_KEY : 0);

    if (pre_flow_str) {
        char *post_flow_str = flow_to_string(flow);
        char *tnl_str = tnl_port_fmt(tnl_port);
        VLOG_DBG("flow sent\n"
                 "%s"
                 " pre: %s\n"
                 "post: %s",
                 tnl_str, pre_flow_str, post_flow_str);
        free(tnl_str);
        free(pre_flow_str);
        free(post_flow_str);
    }

out:
    fat_rwlock_unlock(&rwlock);
    return out_port;
}

构建外层封装

这里只是将外层封装的数据准备好,放在tnl_push_data,为后面写入报文做准备。

static int
build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
                  const struct flow *flow, odp_port_t tunnel_odp_port)
{
    struct netdev_tnl_build_header_params tnl_params;
    struct ovs_action_push_tnl tnl_push_data;
    struct xport *out_dev = NULL;
    ovs_be32 s_ip = 0, d_ip = 0;
    struct in6_addr s_ip6 = in6addr_any;
    struct in6_addr d_ip6 = in6addr_any;
    struct eth_addr smac;
    struct eth_addr dmac;
    int err;
    char buf_sip6[INET6_ADDRSTRLEN];
    char buf_dip6[INET6_ADDRSTRLEN];
    //underlay路由查找,因为隧道的remote-ip根据vxlan端口就已经知道了。
    err = tnl_route_lookup_flow(flow, &d_ip6, &s_ip6, &out_dev);
    if (err) {
        xlate_report(ctx, OFT_WARN, "native tunnel routing failed");
        return err;
    }

    xlate_report(ctx, OFT_DETAIL, "tunneling to %s via %s",
                 ipv6_string_mapped(buf_dip6, &d_ip6),
                 netdev_get_name(out_dev->netdev));

    /* Use mac addr of bridge port of the peer. 使用桥的mac地址作为源mac地址 */
    err = netdev_get_etheraddr(out_dev->netdev, &smac);
    if (err) {
        xlate_report(ctx, OFT_WARN,
                     "tunnel output device lacks Ethernet address");
        return err;
    }

    d_ip = in6_addr_get_mapped_ipv4(&d_ip6);
    if (d_ip) {
        s_ip = in6_addr_get_mapped_ipv4(&s_ip6);
    }
    //获取邻居目的mac地址
    err = tnl_neigh_lookup(out_dev->xbridge->name, &d_ip6, &dmac);
    if (err) {
        xlate_report(ctx, OFT_DETAIL,
                     "neighbor cache miss for %s on bridge %s, "
                     "sending %s request",
                     buf_dip6, out_dev->xbridge->name, d_ip ? "ARP" : "ND");
        if (d_ip) {//进行arp请求
            tnl_send_arp_request(ctx, out_dev, smac, s_ip, d_ip);
        } else {
            tnl_send_nd_request(ctx, out_dev, smac, &s_ip6, &d_ip6);
        }
        return err;
    }

    if (ctx->xin->xcache) {
        struct xc_entry *entry;

        entry = xlate_cache_add_entry(ctx->xin->xcache, XC_TNL_NEIGH);
        ovs_strlcpy(entry->tnl_neigh_cache.br_name, out_dev->xbridge->name,
                    sizeof entry->tnl_neigh_cache.br_name);
        entry->tnl_neigh_cache.d_ipv6 = d_ip6;
    }

    xlate_report(ctx, OFT_DETAIL, "tunneling from "ETH_ADDR_FMT" %s"
                 " to "ETH_ADDR_FMT" %s",
                 ETH_ADDR_ARGS(smac), ipv6_string_mapped(buf_sip6, &s_ip6),
                 ETH_ADDR_ARGS(dmac), buf_dip6);
    //构建underlay 链路层信息
    netdev_init_tnl_build_header_params(&tnl_params, flow, &s_ip6, dmac, smac);
    //构建隧道外层信息udp,ip,eth层,保存在tnl_push_data中
    err = tnl_port_build_header(xport->ofport, &tnl_push_data, &tnl_params);
    if (err) {
        return err;
    }
    //输出端口和隧道端口
    tnl_push_data.tnl_port = odp_to_u32(tunnel_odp_port);
    tnl_push_data.out_port = odp_to_u32(out_dev->odp_port);
    //为报文添加一个隧道封装动作,在执行动作的时候进行最后的报文封装。
    odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
    return 0;
}
void
odp_put_tnl_push_action(struct ofpbuf *odp_actions,
                        struct ovs_action_push_tnl *data)
{
    int size = offsetof(struct ovs_action_push_tnl, header);

    size += data->header_len;
    nl_msg_put_unspec(odp_actions, OVS_ACTION_ATTR_TUNNEL_PUSH, data, size);
}

执行push

/* 动作执行回调函数,参数may_steal表示是否可以释放报文 */
static void
dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
              const struct nlattr *a, bool may_steal)
{
    ......
    case OVS_ACTION_ATTR_TUNNEL_PUSH:/* 隧道处理,添加外层标签,需要重入,通常用于处理vxlan等隧道报文 */
        if (*depth < MAX_RECIRC_DEPTH) {/* 嵌套深度小于限制的最大深度,那么重入 */
            struct dp_packet_batch tnl_pkt;
            struct dp_packet_batch *orig_packets_ = packets_;
            int err;

            if (!may_steal) {/* 如果调用者不希望发送设备接管报文,那么需要复制一份报文进行处理 */
                dp_packet_batch_clone(&tnl_pkt, packets_);
                packets_ = &tnl_pkt;
                dp_packet_batch_reset_cutlen(orig_packets_);
            }

            dp_packet_batch_apply_cutlen(packets_);
            /* 执行隧道标签添加动作 */
            err = push_tnl_action(pmd, a, packets_);
            if (!err) {/* 添加隧道标签后,需要重入处理 */
                (*depth)++;
                dp_netdev_recirculate(pmd, packets_);
                (*depth)--;
            }
            return;
        }
        break;
        ......
    dp_packet_delete_batch(packets_, may_steal);
}

/* 增加隧道标签动作 */
static int
push_tnl_action(const struct dp_netdev_pmd_thread *pmd,/* 当前的轮询进程 */
                const struct nlattr *attr,/* 属性 */
                struct dp_packet_batch *batch)/* 批量报文处理 */
{
    struct tx_port *tun_port;
    const struct ovs_action_push_tnl *data;
    int err;

    data = nl_attr_get(attr);

    /* 隧道端口查找 */
    tun_port = pmd_tnl_port_cache_lookup(pmd, u32_to_odp(data->tnl_port));
    if (!tun_port) {
        err = -EINVAL;
        goto error;
    }

    /* 添加隧道头 */
    err = netdev_push_header(tun_port->port->netdev, batch, data);
    if (!err) {
        return 0;
    }
error:
    dp_packet_delete_batch(batch, true);
    return err;
}
/* Push tunnel header (reading from tunnel metadata) and resize
 * 'batch->packets' for further processing.
 *
 * The caller must make sure that 'netdev' support this operation by checking
 * that netdev_has_tunnel_push_pop() returns true. */
int
netdev_push_header(const struct netdev *netdev,
                   struct dp_packet_batch *batch,
                   const struct ovs_action_push_tnl *data)
{
    int i;

    for (i = 0; i < batch->count; i++) {/* 逐个处理每一个报文 */
        netdev->netdev_class->push_header(batch->packets[i], data);
        //关键函数,对封装好的报文的元数据进行初始化,为报文重入做准备。这里的出接口是unerlay路由
        //查出来的data->out_port。
        pkt_metadata_init(&batch->packets[i]->md, u32_to_odp(data->out_port));
    }

    return 0;
}
//对于vxlan来说,netdev->netdev_class->push_header函数为
/* 为报文添加隧道头 */
void
netdev_tnl_push_udp_header(struct dp_packet *packet,
                           const struct ovs_action_push_tnl *data)
{
    struct udp_header *udp;
    int ip_tot_size;

    /* 先压入以太网和IP头 */
    udp = netdev_tnl_push_ip_header(packet, data->header, data->header_len, &ip_tot_size);

    /* set udp src port 获取随机的udp源端口 */
    udp->udp_src = netdev_tnl_get_src_port(packet);
    udp->udp_len = htons(ip_tot_size);/* 设置报文的udp总长度 */

    if (udp->udp_csum) {/* 进行udp校验码计算 */
        uint32_t csum;
        if (netdev_tnl_is_header_ipv6(dp_packet_data(packet))) {
            csum = packet_csum_pseudoheader6(netdev_tnl_ipv6_hdr(dp_packet_data(packet)));
        } else {
            csum = packet_csum_pseudoheader(netdev_tnl_ip_hdr(dp_packet_data(packet)));
        }

        csum = csum_continue(csum, udp, ip_tot_size);
        udp->udp_csum = csum_finish(csum);

        if (!udp->udp_csum) {
            udp->udp_csum = htons(0xffff);
        }
    }
}

//将输出端口变为输入端口,即将三层端口转换为二层端口
static inline void
pkt_metadata_init(struct pkt_metadata *md, odp_port_t port)
{
    /* It can be expensive to zero out all of the tunnel metadata. However,
     * we can just zero out ip_dst and the rest of the data will never be
     * looked at. */
    memset(md, 0, offsetof(struct pkt_metadata, in_port));/* 将端口之前的数据全部初始化为0 */
    md->tunnel.ip_dst = 0;
    md->tunnel.ipv6_dst = in6addr_any;

    md->in_port.odp_port = port;
}

走到这里后,执行dp_netdev_recirculate进行重入,这个时候vxlan报文经过fdb从dpdk物理口离开服务器。

underlay route table

ovs的路由来自两部分,一部分是同步内核的路由,标记为cached。另外一部分是使用ovs-appctl ovs/route/add等命令添加的路由。

[root@ ~]# ovs-appctl ovs/route/show
Route Table:
Cached: 1.1.1.1/32 dev tun0 SRC 1.1.1.1
Cached: 10.226.137.204/32 dev eth2 SRC 10.226.137.204
Cached: 10.255.9.204/32 dev br-phy SRC 10.255.9.204
Cached: 127.0.0.1/32 dev lo SRC 127.0.0.1
Cached: 169.254.169.110/32 dev tap_metadata SRC 169.254.169.110
Cached: 169.254.169.240/32 dev tap_proxy SRC 169.254.169.240
Cached: 169.254.169.241/32 dev tap_proxy SRC 169.254.169.241
Cached: 169.254.169.250/32 dev tap_metadata SRC 169.254.169.250
Cached: 169.254.169.254/32 dev tap_metadata SRC 169.254.169.254
Cached: 172.17.0.1/32 dev docker0 SRC 172.17.0.1
Cached: ::1/128 dev lo SRC ::1
Cached: 10.226.137.192/27 dev eth2 SRC 10.226.137.204
Cached: 10.226.137.224/27 dev br-phy GW 10.255.9.193 SRC 10.255.9.204
Cached: 10.254.225.0/27 dev br-phy GW 10.255.9.193 SRC 10.255.9.204
Cached: 10.254.225.224/27 dev br-phy GW 10.255.9.193 SRC 10.255.9.204
Cached: 10.255.8.192/27 dev br-phy GW 10.255.9.193 SRC 10.255.9.204
Cached: 10.255.9.192/27 dev br-phy SRC 10.255.9.204
Cached: 1.1.1.0/24 dev br-phy GW 10.255.9.193 SRC 10.255.9.204
Cached: 10.226.0.0/16 dev eth2 GW 10.226.137.193 SRC 10.226.137.204
Cached: 172.17.0.0/16 dev docker0 SRC 172.17.0.1
Cached: 127.0.0.0/8 dev lo SRC 127.0.0.1
Cached: 0.0.0.0/0 dev eth2 GW 10.226.137.193 SRC 10.226.137.204
Cached: fe80::/64 dev port-r6kxee6d3t SRC fe80::80a0:94ff:fedc:43b
[root@A04-R08-I137-204-9320C72 ~]# 

路由模块初始化

/* Users of the route_table module should register themselves with this
 * function before making any other route_table function calls. */
void
route_table_init(void)
    OVS_EXCLUDED(route_table_mutex)
{
    ovs_mutex_lock(&route_table_mutex);
    ovs_assert(!nln);
    ovs_assert(!route_notifier);
    ovs_assert(!route6_notifier);

    ovs_router_init();
    nln = nln_create(NETLINK_ROUTE, (nln_parse_func *) route_table_parse,
                     &rtmsg);

    route_notifier =
        nln_notifier_create(nln, RTNLGRP_IPV4_ROUTE,
                            (nln_notify_func *) route_table_change, NULL);
    route6_notifier =
        nln_notifier_create(nln, RTNLGRP_IPV6_ROUTE,
                            (nln_notify_func *) route_table_change, NULL);

    route_table_reset();
    name_table_init();

    ovs_mutex_unlock(&route_table_mutex);
}

/* May not be called more than once. */
void
ovs_router_init(void)
{
    classifier_init(&cls, NULL);//使用分类器实现路由查找。
    unixctl_command_register("ovs/route/add", "ip_addr/prefix_len out_br_name gw", 2, 3,
                             ovs_router_add, NULL);
    unixctl_command_register("ovs/route/show", "", 0, 0, ovs_router_show, NULL);
    unixctl_command_register("ovs/route/del", "ip_addr/prefix_len", 1, 1, ovs_router_del,
                             NULL);
    unixctl_command_register("ovs/route/lookup", "ip_addr", 1, 1,
                             ovs_router_lookup_cmd, NULL);
}

使用netlink监听内核路由事件

//该函数设置路由事件变化标志
static void
route_table_change(const struct route_table_msg *change OVS_UNUSED,
                   void *aux OVS_UNUSED)
{
    route_table_valid = false;
}

/* Run periodically to update the locally maintained routing table. */
//周期性处理路由变化函数
void
route_table_run(void)
    OVS_EXCLUDED(route_table_mutex)
{
    ovs_mutex_lock(&route_table_mutex);
    if (nln) {
        rtnetlink_run();
        nln_run(nln);

        if (!route_table_valid) {
            route_table_reset();
        }
    }
    ovs_mutex_unlock(&route_table_mutex);
}

static int
route_table_reset(void)
{
    struct nl_dump dump;
    struct rtgenmsg *rtmsg;
    uint64_t reply_stub[NL_DUMP_BUFSIZE / 8];
    struct ofpbuf request, reply, buf;

    route_map_clear();//删除所有路由
    netdev_get_addrs_list_flush();
    route_table_valid = true;
    rt_change_seq++;

    ofpbuf_init(&request, 0);

    nl_msg_put_nlmsghdr(&request, sizeof *rtmsg, RTM_GETROUTE, NLM_F_REQUEST);

    rtmsg = ofpbuf_put_zeros(&request, sizeof *rtmsg);
    rtmsg->rtgen_family = AF_UNSPEC;
    //重新添加所有路由
    nl_dump_start(&dump, NETLINK_ROUTE, &request);
    ofpbuf_uninit(&request);

    ofpbuf_use_stub(&buf, reply_stub, sizeof reply_stub);
    while (nl_dump_next(&dump, &reply, &buf)) {
        struct route_table_msg msg;

        if (route_table_parse(&reply, &msg)) {
            route_table_handle_msg(&msg);
        }
    }
    ofpbuf_uninit(&buf);

    return nl_dump_done(&dump);
}

underlay neighbor

dpdk-ovs为隧道维护了一个underlay neighbor信息。

static void
dp_initialize(void)
{
    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;

    if (ovsthread_once_start(&once)) {
        int i;

        tnl_conf_seq = seq_create();
        dpctl_unixctl_register();
        tnl_port_map_init();
        tnl_neigh_cache_init();
        route_table_init();

        for (i = 0; i < ARRAY_SIZE(base_dpif_classes); i++) {
            dp_register_provider(base_dpif_classes[i]);
        }

        ovsthread_once_done(&once);
    }
}

void
tnl_neigh_cache_init(void)
{
    unixctl_command_register("tnl/arp/show", "", 0, 0, tnl_neigh_cache_show, NULL);
    unixctl_command_register("tnl/arp/set", "BRIDGE IP MAC", 3, 3, tnl_neigh_cache_add, NULL);
    unixctl_command_register("tnl/arp/flush", "", 0, 0, tnl_neigh_cache_flush, NULL);
    unixctl_command_register("tnl/neigh/show", "", 0, 0, tnl_neigh_cache_show, NULL);
    unixctl_command_register("tnl/neigh/set", "BRIDGE IP MAC", 3, 3, tnl_neigh_cache_add, NULL);
    unixctl_command_register("tnl/neigh/flush", "", 0, 0, tnl_neigh_cache_flush, NULL);
}

使用命令查看邻居

[root@ ~]# ovs-appctl tnl/arp/show
IP                                            MAC                 Bridge
==========================================================================
10.255.9.193                                  9c:e8:95:0f:49:16   br-phy
[root@ ~]# 

ovs-dpdk通过在数据面处理arp报文和neigh报文来获取邻居信息。

/* 执行动作转换 */
static void
do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
                 struct xlate_ctx *ctx)
{
    struct flow_wildcards *wc = ctx->wc;/* 通配符 */
    struct flow *flow = &ctx->xin->flow;/* 需要处理的流 */
    const struct ofpact *a;

    /* 只有开启隧道的时候才启用邻居侦听,主要是侦听arp包和icmpv6报文进行邻居学习 */
    if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
        tnl_neigh_snoop(flow, wc, ctx->xbridge->name);
    }
    /* dl_type already in the mask, not set below. */
    ......
}
//进行邻居学习。
int
tnl_neigh_snoop(const struct flow *flow, struct flow_wildcards *wc,
                const char name[IFNAMSIZ])
{
    int res;
    res = tnl_arp_snoop(flow, wc, name);
    if (res != EINVAL) {
        return res;
    }
    return tnl_nd_snoop(flow, wc, name);
}
static int
tnl_arp_snoop(const struct flow *flow, struct flow_wildcards *wc,
              const char name[IFNAMSIZ])
{
    if (flow->dl_type != htons(ETH_TYPE_ARP)
        || FLOW_WC_GET_AND_MASK_WC(flow, wc, nw_proto) != ARP_OP_REPLY
        || eth_addr_is_zero(FLOW_WC_GET_AND_MASK_WC(flow, wc, arp_sha))) {
        return EINVAL;
    }

    tnl_arp_set(name, FLOW_WC_GET_AND_MASK_WC(flow, wc, nw_src), flow->arp_sha);
    return 0;
}
static int
tnl_nd_snoop(const struct flow *flow, struct flow_wildcards *wc,
             const char name[IFNAMSIZ])
{
    if (!is_nd(flow, wc) || flow->tp_src != htons(ND_NEIGHBOR_ADVERT)) {
        return EINVAL;
    }
    /* - RFC4861 says Neighbor Advertisements sent in response to unicast Neighbor
     *   Solicitations SHOULD include the Target link-layer address. However, Linux
     *   doesn't. So, the response to Solicitations sent by OVS will include the
     *   TLL address and other Advertisements not including it can be ignored.
     * - OVS flow extract can set this field to zero in case of packet parsing errors.
     *   For details refer miniflow_extract()*/
    if (eth_addr_is_zero(FLOW_WC_GET_AND_MASK_WC(flow, wc, arp_tha))) {
        return EINVAL;
    }

    memset(&wc->masks.ipv6_src, 0xff, sizeof wc->masks.ipv6_src);
    memset(&wc->masks.ipv6_dst, 0xff, sizeof wc->masks.ipv6_dst);
    memset(&wc->masks.nd_target, 0xff, sizeof wc->masks.nd_target);

    tnl_neigh_set__(name, &flow->nd_target, flow->arp_tha);
    return 0;
}

邻居定期老化

void
tnl_neigh_cache_run(void)
{
    struct tnl_neigh_entry *neigh;
    bool changed = false;

    ovs_mutex_lock(&mutex);
    CMAP_FOR_EACH(neigh, cmap_node, &table) {
        if (neigh->expires <= time_now()) {
            tnl_neigh_delete(neigh);
            changed = true;
        }
    }
    ovs_mutex_unlock(&mutex);

    if (changed) {
        seq_change(tnl_conf_seq);
    }
}

你可能感兴趣的:(ubuntu,linux,c,c++)