DPDK中LPM(Longest Prefix Match)的实现,使用了DIR-24-8算法的一个变种,实际上就是用空间换时间。其由一个224大小的表,和256(RTE_LPM_TBL8_NUM_GROUPS)个大小为28的表组成。前者叫做tbl24,可使用IP地址的前24位进行索引。后者叫做tbl8,可使用IP地址的后8位进行索引。
理论上,tbl8表的数量应为2^24个,但是考虑的内存的消耗,DPDK默认仅设置为256,事实上长度超过24位的路由表项并不多见。
LPM的主要配置参数就是支持的最大规则数量,LPM规则由LPM前缀唯一标识(LPM prefix),而LPM前缀由两部分组成,32位的键值(key)和深度(depth)。如下l3fwd示例程序中对lpm的初始化,其中设置了最大规则数量max_rules。
void
setup_lpm(const int socketid)
{
struct rte_lpm_config config_ipv4;
/* create the LPM table */
config_ipv4.max_rules = IPV4_L3FWD_LPM_MAX_RULES;
config_ipv4.number_tbl8s = IPV4_L3FWD_LPM_NUMBER_TBL8S;
config_ipv4.flags = 0;
snprintf(s, sizeof(s), "IPV4_L3FWD_LPM_%d", socketid);
ipv4_l3fwd_lpm_lookup_struct[socketid] =
rte_lpm_create(s, socketid, &config_ipv4);
在LPM的实现中,可指定与LPM前缀关联的用户数据(下一跳数据next-hop),大小为4个字节,官方的说明是将此字段作为索引值,而不是真正的存储下一跳IP地址。此索引值可应用于另外的路由表,找到对应的表项。
LPM规则的添加大致分为两个步骤:规则本身的添加和查找项的添加。首先看一下规则本身的添加,添加之前需要在LPM表中查找是否已经存在要插入的规则,如果没有,将其插入LPM表。否则,更新下一跳用户数据字段。当没有可用的空间时,返回错误。如下添加规则函数rule_add,LPM按照深度(depth)将规则分为32个组。rule_info中保存了每个组的使用情况,即组内的第一个规则的索引和规则数量。如果将要添加的规则所属的组内已有规则,检查是否与要添加的规则相同,更新next_hop字段,返回索引值。
static int32_t
rule_add(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth, uint32_t next_hop)
{
if (lpm->rule_info[depth - 1].used_rules > 0) {
rule_gindex = lpm->rule_info[depth - 1].first_rule;
rule_index = rule_gindex;
last_rule = rule_gindex + lpm->rule_info[depth - 1].used_rules;
for (; rule_index < last_rule; rule_index++) {
/* If rule already exists update its next_hop and return. */
if (lpm->rules_tbl[rule_index].ip == ip_masked) {
lpm->rules_tbl[rule_index].next_hop = next_hop;
return rule_index;
}
}
如果以上不成立,要添加的规则所属的组内不存在已有规则,遍历所有深度小于当前规则的组(深度由大到小遍历),找到一个存在规则的组,取其中最大规则的索引值,作为要添加规则对应的组的首个索引值(first_rule)。
} else {
/* Calculate the position in which the rule will be stored. */
rule_index = 0;
for (i = depth - 1; i > 0; i--) {
if (lpm->rule_info[i - 1].used_rules > 0) {
rule_index = lpm->rule_info[i - 1].first_rule
+ lpm->rule_info[i - 1].used_rules;
break;
}
}
if (rule_index == lpm->max_rules)
return -ENOSPC;
lpm->rule_info[depth - 1].first_rule = rule_index;
}
以上找到了新规则要插入的位置索引(rule_index),之后,将新规则所属组之后的所有组的规则全部向后移动一个位置,为新规则留出空间。之后将新规则的参数(ip_masked,next_hop)写入索引位置。
/* Make room for the new rule in the array. */
for (i = RTE_LPM_MAX_DEPTH; i > depth; i--) {
if (lpm->rule_info[i - 1].first_rule
+ lpm->rule_info[i - 1].used_rules == lpm->max_rules)
return -ENOSPC;
if (lpm->rule_info[i - 1].used_rules > 0) {
lpm->rules_tbl[lpm->rule_info[i - 1].first_rule
+ lpm->rule_info[i - 1].used_rules]
= lpm->rules_tbl[lpm->rule_info[i - 1].first_rule];
lpm->rule_info[i - 1].first_rule++;
}
}
/* Add the new rule. */
lpm->rules_tbl[rule_index].ip = ip_masked;
lpm->rules_tbl[rule_index].next_hop = next_hop;
/* Increment the used rules counter for this rule group. */
lpm->rule_info[depth - 1].used_rules++;
return rule_index;
以上完成了规则本身的添加,下面看一下LPM库的接口函数rte_lpm_add中查找项的添加。根据规则的深度(IP地址掩码长度)分成两个部分,深度小于等于24位的情况由函数add_depth_small处理,大于24位时,由较复杂的函数add_depth_big处理。
int rte_lpm_add(struct rte_lpm *lpm, uint32_t ip, uint8_t depth, uint32_t next_hop)
{
ip_masked = ip & depth_to_mask(depth);
rule_index = rule_add(lpm, ip_masked, depth, next_hop);
/* If the is no space available for new rule return error. */
if (rule_index < 0)
return rule_index;
if (depth <= MAX_DEPTH_TBL24) {
status = add_depth_small(lpm, ip_masked, depth, next_hop);
} else { /* If depth > RTE_LPM_MAX_DEPTH_TBL24 */
status = add_depth_big(lpm, ip_masked, depth, next_hop);
/*
* If add fails due to exhaustion of tbl8 extensions delete rule that was added to rule table.
*/
if (status < 0) {
rule_delete(lpm, rule_index, depth);
return status;
}
}
先看一下小于等于24位深度的规则添加函数,将IP地址的前24位值作为起始索引,在tbl24表中查找可用项,即此项还没有被占用(valid为0),或者,此项有值,但是其深度小于当前要添加的项的深度,根据最长匹配原则,将其替换。
注意这里将遍历新规则的IP地址前24位开始,直到此深度表示的最大值结束,其中的所有符合以上条件的项都将被替换。
static __rte_noinline int32_t
add_depth_small(struct rte_lpm *lpm, uint32_t ip, uint8_t depth, uint32_t next_hop)
{
#define group_idx next_hop
uint32_t tbl24_index, tbl24_range, tbl8_index, tbl8_group_end, i, j;
tbl24_index = ip >> 8;
tbl24_range = depth_to_range(depth);
for (i = tbl24_index; i < (tbl24_index + tbl24_range); i++) {
if (!lpm->tbl24[i].valid || (lpm->tbl24[i].valid_group == 0 &&
lpm->tbl24[i].depth <= depth)) {
struct rte_lpm_tbl_entry new_tbl24_entry = {
.next_hop = next_hop,
.valid = VALID,
.valid_group = 0,
.depth = depth,
};
/* Setting tbl24 entry in one go to avoid race conditions
*/
__atomic_store(&lpm->tbl24[i], &new_tbl24_entry, __ATOMIC_RELEASE);
continue;
}
对于valid_group等于1的表项,表示其深度大于24位,此时,表项中的next_hop字段存储的为tbl8的索引值。遍历tbl8中对应的256个表项,如果其中表项无效,或者深度小于等于当前要添加规则的深度,进行表项替换。
if (lpm->tbl24[i].valid_group == 1) {
/* If tbl24 entry is valid and extended calculate the index into tbl8.
*/
tbl8_index = lpm->tbl24[i].group_idx * RTE_LPM_TBL8_GROUP_NUM_ENTRIES;
tbl8_group_end = tbl8_index + RTE_LPM_TBL8_GROUP_NUM_ENTRIES;
for (j = tbl8_index; j < tbl8_group_end; j++) {
if (!lpm->tbl8[j].valid || lpm->tbl8[j].depth <= depth) {
struct rte_lpm_tbl_entry
new_tbl8_entry = {
.valid = VALID,
.valid_group = VALID,
.depth = depth,
.next_hop = next_hop,
};
/* Setting tbl8 entry in one go to avoid race conditions
*/
__atomic_store(&lpm->tbl8[j], &new_tbl8_entry, __ATOMIC_RELAXED);
continue;
}
}
}
}
#undef group_idx
return 0;
以下函数add_depth_big完成深度大于24位的规则的添加。首先看第一种情况,即IP地址对应的前24位,在tlb24表中对应的项无效的情况(valid为0)。这表明其对应的tbl8表还没有创建,tbl8表分成了256个组,tbl8_alloc函数返回一个可用的组索引(tbl8_group_index),即组内的首个索引值(其对应的表项的valid_group为真)。之后将组内depth深度包括的所有索引对应的项都填充为当前规则生成的表项,但是它们对应的valid_group并不改变。
参见depth_to_range函数,对于大于24位的depth值,其计算的范围(range)为(1 << (32-depth)),例如depth为30,那么范围就是4。由于此tbl8组是新创建的,其中表项为空,可直接进行填充。最后,将对应的tbl24表中的项设置为有效,并且在next_hop字段填充对应的tbl8表的组索引值。
static __rte_noinline int32_t
add_depth_big(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth, uint32_t next_hop)
{
#define group_idx next_hop
tbl24_index = (ip_masked >> 8);
tbl8_range = depth_to_range(depth);
if (!lpm->tbl24[tbl24_index].valid) {
/* Search for a free tbl8 group. */
tbl8_group_index = tbl8_alloc(lpm->tbl8, lpm->number_tbl8s);
if (tbl8_group_index < 0)
return tbl8_group_index;
/* Find index into tbl8 and range. */
tbl8_index = (tbl8_group_index * RTE_LPM_TBL8_GROUP_NUM_ENTRIES) + (ip_masked & 0xFF);
/* Set tbl8 entry. */
for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
struct rte_lpm_tbl_entry new_tbl8_entry = {
.valid = VALID,
.depth = depth,
.valid_group = lpm->tbl8[i].valid_group,
.next_hop = next_hop,
};
__atomic_store(&lpm->tbl8[i], &new_tbl8_entry, __ATOMIC_RELAXED);
}
/* Update tbl24 entry to point to new tbl8 entry. Note: The ext_flag and
* tbl8_index need to be updated simultaneously, so assign whole structure in one go
*/
struct rte_lpm_tbl_entry new_tbl24_entry = {
.group_idx = tbl8_group_index,
.valid = VALID,
.valid_group = 1,
.depth = 0,
};
__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry, __ATOMIC_RELEASE);
以下为第二种情况,对于的tbl24表项有效,但是其还没有进行过扩展(valid_group为0),即对应的tbl8表还没有创建。与以上介绍的情况类似,先分配tbl8组,获得组索引。不同点在于,这里首先将tbl24中的旧的表项添加到tbl8组内的所有256个表项中,深度和next_hop都使用tbl24中表项的值。之后的操作与第一种情况相同,。
} /* If valid entry but not extended calculate the index into Table8. */
else if (lpm->tbl24[tbl24_index].valid_group == 0) {
/* Search for free tbl8 group. */
tbl8_group_index = tbl8_alloc(lpm->tbl8, lpm->number_tbl8s);
if (tbl8_group_index < 0)
return tbl8_group_index;
tbl8_group_start = tbl8_group_index * RTE_LPM_TBL8_GROUP_NUM_ENTRIES;
tbl8_group_end = tbl8_group_start + RTE_LPM_TBL8_GROUP_NUM_ENTRIES;
/* Populate new tbl8 with tbl24 value. */
for (i = tbl8_group_start; i < tbl8_group_end; i++) {
struct rte_lpm_tbl_entry new_tbl8_entry = {
.valid = VALID,
.depth = lpm->tbl24[tbl24_index].depth,
.valid_group = lpm->tbl8[i].valid_group,
.next_hop = lpm->tbl24[tbl24_index].next_hop,
};
__atomic_store(&lpm->tbl8[i], &new_tbl8_entry, __ATOMIC_RELAXED);
}
tbl8_index = tbl8_group_start + (ip_masked & 0xFF);
/* Insert new rule into the tbl8 entry. */
for (i = tbl8_index; i < tbl8_index + tbl8_range; i++) {
struct rte_lpm_tbl_entry new_tbl8_entry = {
.valid = VALID,
.depth = depth,
.valid_group = lpm->tbl8[i].valid_group,
.next_hop = next_hop,
};
__atomic_store(&lpm->tbl8[i], &new_tbl8_entry, __ATOMIC_RELAXED);
}
/* Update tbl24 entry to point to new tbl8 entry. Note: The ext_flag and
* tbl8_index need to be updated simultaneously, so assign whole structure in one go.
*/
struct rte_lpm_tbl_entry new_tbl24_entry = {
.group_idx = tbl8_group_index,
.valid = VALID,
.valid_group = 1,
.depth = 0,
};
__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry, __ATOMIC_RELEASE);
以下为第三种情况,即tbl8表已经分配好,由tlb24表项中的next_hop字段中获取组索引,之后,在depth表示的索引范围内遍历所有tbl8表项,填充其中无效表项,或者深度小于等于新规则深度的表项。
} else {
/* If it is valid, extended entry calculate the index into tbl8. */
tbl8_group_index = lpm->tbl24[tbl24_index].group_idx;
tbl8_group_start = tbl8_group_index * RTE_LPM_TBL8_GROUP_NUM_ENTRIES;
tbl8_index = tbl8_group_start + (ip_masked & 0xFF);
for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
if (!lpm->tbl8[i].valid || lpm->tbl8[i].depth <= depth) {
struct rte_lpm_tbl_entry new_tbl8_entry = {
.valid = VALID,
.depth = depth,
.next_hop = next_hop,
.valid_group = lpm->tbl8[i].valid_group,
};
/* Setting tbl8 entry in one go to avoid race condition
*/
__atomic_store(&lpm->tbl8[i], &new_tbl8_entry, __ATOMIC_RELAXED);
continue;
}
}
}
#undef group_idx
return 0;
}
在查找函数之前,先看一下LPM的表项结构rte_lpm_tbl_entry,其中valid_group和valid两个字段的位置,由宏定义RTE_LPM_VALID_EXT_ENTRY_BITMASK表示的掩码进行了定义,之后将会用到。
另外,宏RTE_LPM_LOOKUP_SUCCESS指明了valid标志字段的位置,以下用于判断表示是否有效。
struct rte_lpm_tbl_entry {
uint32_t depth :6;
uint32_t valid_group :1;
uint32_t valid :1;
uint32_t next_hop :24;
};
#define RTE_LPM_VALID_EXT_ENTRY_BITMASK 0x03000000
#define RTE_LPM_LOOKUP_SUCCESS 0x01000000
如下LPM查找函数rte_lpm_lookup,首先使用IP地址的前24位为索引在tbl24表中取出对应的表项,如果valid_group标志没有设置,表项的后24位即查找的next_hop值。否则,根据tbl24表项中存储的tbl8组索引,找到tbl8表中对应的位置,取出表项。
static inline int rte_lpm_lookup(struct rte_lpm *lpm, uint32_t ip, uint32_t *next_hop)
{
unsigned tbl24_index = (ip >> 8);
uint32_t tbl_entry;
const uint32_t *ptbl;
/* DEBUG: Check user input arguments. */
RTE_LPM_RETURN_IF_TRUE(((lpm == NULL) || (next_hop == NULL)), -EINVAL);
/* Copy tbl24 entry */
ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
tbl_entry = *ptbl;
/* Memory ordering is not required in lookup. Because dataflow
* dependency exists, compiler or HW won't be able to re-order the operations.
*/
/* Copy tbl8 entry (only if needed) */
if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {
unsigned tbl8_index = (uint8_t)ip +
(((uint32_t)tbl_entry & 0x00FFFFFF) * RTE_LPM_TBL8_GROUP_NUM_ENTRIES);
ptbl = (const uint32_t *)&lpm->tbl8[tbl8_index];
tbl_entry = *ptbl;
}
*next_hop = ((uint32_t)tbl_entry & 0x00FFFFFF);
return (tbl_entry & RTE_LPM_LOOKUP_SUCCESS) ? 0 : -ENOENT;
}
DPDK版本 19.11.3