以优化classification time 和memory footprint为目标,找全局最优解而不是局部最优解(贪心,不需要手动调整参数)。
1、 论述了为什么RL可以很好地进行决策树网包分类
2、 在将RL用在决策树网包分类的时候,解决两个问题
(1) 决策树是动态生长的,如何对决策树编码作为定长的输入
(2) 反馈耗时太长,导致weak reward attribution,简单来说就是训练结果variance太高
(1) 只编码当前节点,用所代表的五元组的区间左右端点表示
(2) 将训练过程由线性升级为树形
3、 与四种算法做性能比较,号称能将baseline提高18%。
网包分类类似于在高维集合空间定点问题:fields 是维度,包时空间中的一个点,规则是一个hypercube。
定点问题在时间和空间复杂度:(d维数据,n个non-overlapping hypercube,d>3)
O ( log n ) O\left(\log n\right) O(logn) time 为下界 and O ( n d ) O(n^d) O(nd) space
lower bound of O ( log d − 1 n ) O\left(\log ^{d-1} n\right) O(logd−1n) time and O ( n ) O(n) O(n) space
In particular, if a rule has a large size along one dimension, cutting along that dimension will result in that rule being added to many nodes.
Rule replication can lead to decision trees with larger depths and sizes, which translate to higher classification time and memory footprint
partition rules based on their “shapes”. rules with large sizes in a particular dimension are put in the same set. Then, we can build a separate decision tree for each of these partitions.
To classify a packet, we classify it against every decision tree, and then choose the highest priority rule among all rules the packet matches in all decision trees.
In contrast, decision trees for packet classi.cation provide perfect accuracy by construction, and the goal is to minimize classification time and memory footprint
决策树保证了分类的正确性 嗯
7. 启发式手动调参,当给了一个全新的rule set,得全部重做
8. 没有全局找最优解 poorly correlated with the desired objective.
learn building decision trees for a given set of rules.
a), an RL system
consists of an agent that repeatedly interacts with an environment. The agent observes the state of the environment, and then takes an action that might change the environment’s state. The goal of the agent is to compute a policy that maps the environment’s state to an action in order to optimize a reward.
用在网络和系统问题:路由、拥塞控制、视频流、job scheduling
Building a decision tree can be easily cast as an RL problem: the environment’s state is the current decision tree, an action is either cutting a node or partitioning a set of rules, and the reward is either the classification time, memory footprint, or a combination of the two
the goal of an RL algorithm is to compute a policy to maximize rewards from the environment evaluates it using multiple rollouts, and then updates it based on the results (rewards) of these rollouts. Then, it repeats this process until satisfied with the reward.
V n = argmax a ∈ A − ( c ⋅ T n + ( 1 − c ) ⋅ S n ) V_{n}=\operatorname{argmax}_{a \in A}-\left(c \cdot T_{n}+(1-c) \cdot S_{n}\right) Vn=argmaxa∈A−(c⋅Tn+(1−c)⋅Sn)
c=0 V n = argmax a ∈ A − T n V_{n}=\operatorname{argmax}_{a \in A}-T_{n} Vn=argmaxa∈A−Tn
c=1 V n = argmax a ∈ A − S n V_{n}=\operatorname{argmax}_{a \in A}-S_{n} Vn=argmaxa∈A−Sn
Given d dimensions, we use 2d numbers to encode a tree node, which indicate the left and right boundaries of each dimension for this node.
NeuroCuts learns to account for packet classifier rules implicitly through the rewards it gets from the environment.
Q ( s , a ) ← Q ( s , a ) + α ( r + max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ) Q(s, a) \leftarrow Q(s, a)+\alpha\left(r+\max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right) Q(s,a)←Q(s,a)+α(r+maxa′Q(s′,a′)−Q(s,a))
强化学习奖励函数塑形简介(The reward shaping of RL)
一个行为的rewards由其产生的多个子状态聚合而成。比如说切割节点,reward 计算包含了sum或是每个孩子未来rewards的min。如果我们要优化树大小,那么选择sum,如果优化树的深度,选择min。
how to deal with the sparse and delayed rewards incurred by the node-by-node process of building the decision tree. While we could in principle return a single reward to the agent when the tree is complete, it would be very dicult to train an agent in such an environment. Due to the long length of tree rollouts (i.e., many thousands of steps), learning is only practical if we can compute meaningful dense rewards.1 Such a dense reward for an action would be based on the statistics of the subtree it leads to (i.e., its depth or size).将奖励密集化
Our goal to optimize a global performance objective over the entire tree suggests that we would need to make decisions based on the global state. However, this does not mean that the state representation needs to encode the entire decision tree. Given a tree node, the action on that node only needs to make the best decision to optimize the sub-tree rooted at that node. It does not need to consider other tree nodes in the decision tree.
an actor-critic algorithm
∑ t γ t r t \sum_{t} \gamma^{t} r_{t} ∑tγtrt where γ \gamma γ is a discounting factor
γ \gamma γ是大于0小于1的数,越靠近现在越大
end goal是学习最优的随机策略函数 π ( a ∣ s ; θ ) \pi(a | s ; \theta) π(a∣s;θ),
跑N轮去训练policy和value function,每轮都处理一个root node