The Transformer Network for the Traveling Salesman Problem
https://arxiv.org/pdf/2103.03012.pdf
Bresson X, Laurent T. The transformer network for the traveling salesman problem[J]. arXiv preprint arXiv:2103.03012, 2021.
@article{bresson2021transformer,
title={The transformer network for the traveling salesman problem},
author={Bresson, Xavier and Laurent, Thomas},
journal={arXiv preprint arXiv:2103.03012},
year={2021}
}
TSP概况:
The Traveling Salesman Problem (TSP) is the most popular and most studied combinatorial problem, starting with von Neumann in 1951. It has driven the discovery of several optimization techniques such as cutting planes, branch-andbound, local search, Lagrangian relaxation, and simulated annealing.
本文核心问题:
The main question is whether deep learning can learn better heuristics from data, i.e. replacing human-engineered heuristics?
本文解决方法
In this work, we propose to adapt the recent successful Transformer architecture originally developed for natural language processing to the combinatorial TSP. Training is done by reinforcement learning, hence without TSP training solutions, and decoding uses beam search.
解决COP的两个方法:
There exist two traditional approaches to tackle combinatorial problems;
exact algorithms and approximate/heuristic algorithms.
精确:
Exact algorithms are guaranteed(保证) to find optimal solutions, but they become intractable when n grows.
近似算法以最优性换取计算效率:
Approximate algorithms trade optimality for computational efficiency. They are problem-specific, often designed by iteratively applying a simple man-crafted rule, known as heuristic. Their complexity is polynomial and their quality depends on an approximate ratio that characterizes the worst/average-case error w.r.t the optimal solution.
它们的质量取决于一个近似比率,该比率表征了最佳解决方案的最坏/平均情况误差
精确的例子:
Exact algorithms for TSP are given by exhaustive search(穷举搜索), Dynamic or Integer Programming.
A Dynamic Programming algorithm was proposed for TSP in [16] with O ( n 2 2 n ) O(n^22^n) O(n22n) complexity, which becomes intractable(棘手的,不可接受的) for n > 40.
A general purpose Integer Programming (IP) solver with Cutting Planes (CP) and Branch-and-Bound (BB) called Gurobi was introduced in [15].
Finally, a highly specialized linear IP+CP+BB, namely Concorde, was designed in [2].
Concorde is widely regarded as the fastest exact TSP solver, for large instances, currently in existence.
Concorde 被广泛认为是目前存在的大型实例中最快的精确 TSP 求解器。
近似/启发式的方法
Several approximate/heuristic algorithms have been introduced.
Christofides algorithm [7] approximates TSP with Minimum Spanning Trees(最小生成树).
The algorithm has a polynomial-time complexity with O ( n 2 l o g n ) O(n^2 log n) O(n2logn), and is guaranteed to find a solution within a factor 3/2 of the optimal solution.
Farthest/nearest/greedy insertion algorithms [20] have complexity O ( n 2 ) O(n^2) O(n2), and farthest insertion (the best insertion in practice) has an approximation ratio of 2.43.
Google OR-Tools [14] is a highly optimized program that solves TSP and a larger set of vehicle routing problems(路径规划问题). This program applies different heuristics s.a. Simulated Annealing, Greedy Descent, Tabu Search, to navigate in the search space, and refines the solution by Local Search techniques.
OR-Tools使用指南:https://developers.google.com/optimization/introduction/overview
2-Opt algorithm [27, 21] proposes an heuristic based on a move that replaces two edges to reduce the tour length. The complexity is O ( n 2 m ( n ) ) O(n^2m(n)) O(n2m(n)), where n 2 n^2 n2 is the number of node pairs and m ( n ) m(n) m(n) is the number of times all pairs must be tested to reach a local minimum (with worst-case being O ( 2 n / 2 ) ) O(2^{n/2})) O(2n/2)). The approximation ratio is 4 / n 4/\sqrt{n} 4/n. Extension to 3-Opt move (replacing 3 edges) and more have been proposed in [6].
Finally, LKH-3 algorithm [18] introduces the best heuristic for solving TSP. It is an extension of the original LKH [28] and LKH-2 [17] based on 2-Opt/3-Opt where edge candidates are estimated with a Minimum Spanning Tree [17]. LKH-3 can tackle various TSP-type problems.
学习到的特征替换人工特征
In the last decade, Deep learning (DL) has significantly improved Computer Vision, Natural Language Processing and Speech Recognition by replacing hand-crafted visual/text/speech features by features learned from data [26].
[26] LeCun Y, Bengio Y, Hinton G. Deep learning[J]. nature, 2015, 521(7553): 436-444.
@article{lecun2015deep,
title={Deep learning},
author={LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey},
journal={nature},
volume={521},
number={7553},
pages={436--444},
year={2015},
publisher={Nature Publishing Group}
}
这里又把摘要展开说了说:
核心问题:For combinatorial problems, the main question is whether DL can learn better heuristics from data than hand-crafted heuristics?
This is attractive because developing algorithms to tackle efficiently NP-hard problems require years of research (TSP has been actively studied for seventy years).
这句好地道:
The last five years have seen the emergence of promising techniques
在过去的五年中,出现了有前途的技术
where (graph) neural networks have been capable to learn new combinatorial algorithms with supervised or reinforcement learning.
其中(图)神经网络已经能够通过监督或强化学习来学到新的组合算法。
总结一下近几年的工作:
We briefly summarize this line of work below.
先跳过了,让我先看看本文的方法
将TSP视为翻译问题,源语言是一个2D点集,目标语言是一个最短的 tour(索引的序列),本文使用原始的 Transformers 来解决TSP问题
We cast TSP as a “translation” problem where the source “language” is a set of 2D points and
the target “language” is a tour (sequence of indices) with minimal length, and adapt the original
Transformers [37] to solve this problem.
用RL方法训练,reward 是 tour 的长度
如果训练网络在一组随机 TSP 上改进了Baseline,则 Baseline 会相应地更新。
We train by reinforcement learning, with the same setting as [25]. The reward is the tour length and the baseline is simply updated if the train network improves the baseline on a set of random TSPs.
本文使用标准 Transformer 模型,只是使用BN,而不是LN
It is a standard Transformer encoder with multi-head attention and residual connection. The
only difference is the use of batch normalization, instead of layer normalization. The memory/speed
complexity is O ( n 2 ) O(n^2) O(n2).
Formally, the encoder equations are (when considering a single head for an easier description)
H e n c = H ℓ = L e n c ∈ R ( n + 1 ) × d , H^{\rm{enc}}=H^{\ell=L^{\mathrm{enc}}} \in \mathbb{R}^{(n+1) \times d}, Henc=Hℓ=Lenc∈R(n+1)×d,
where,
H ℓ = 0 = Concat ( z , X ) ∈ R ( n + 1 ) × 2 , z ∈ R 2 , X ∈ R n × 2 , H ℓ + 1 = softmax ( Q ℓ K ℓ T d ) V ℓ ∈ R ( n + 1 ) × d Q ℓ = H ℓ W Q ℓ ∈ R ( n + 1 ) × d , W Q ℓ ∈ R d × d K ℓ = H ℓ W K ℓ ∈ R ( n + 1 ) × d , W K ℓ ∈ R d × d V ℓ = H ℓ W V ℓ ∈ R ( n + 1 ) × d , W V ℓ ∈ R d × d \begin{aligned} H^{\ell=0} &=\operatorname{Concat}(z, X) \in \mathbb{R}^{(n+1) \times 2}, z \in \mathbb{R}^{2}, X \in \mathbb{R}^{n \times 2}, \\ H^{\ell+1} &=\operatorname{softmax}\left(\frac{Q^{\ell} K^{\ell^{T}}}{\sqrt{d}}\right) V^{\ell} \in \mathbb{R}^{(n+1) \times d} \\ Q^{\ell} &=H^{\ell} W_{Q}^{\ell} \in \mathbb{R}^{(n+1) \times d}, W_{Q}^{\ell} \in \mathbb{R}^{d \times d} \\ K^{\ell} &=H^{\ell} W_{K}^{\ell} \in \mathbb{R}^{(n+1) \times d}, W_{K}^{\ell} \in \mathbb{R}^{d \times d} \\ V^{\ell} &=H^{\ell} W_{V}^{\ell} \in \mathbb{R}^{(n+1) \times d}, W_{V}^{\ell} \in \mathbb{R}^{d \times d} \end{aligned} Hℓ=0Hℓ+1QℓKℓVℓ=Concat(z,X)∈R(n+1)×2,z∈R2,X∈Rn×2,=softmax(dQℓKℓT)Vℓ∈R(n+1)×d=HℓWQℓ∈R(n+1)×d,WQℓ∈Rd×d=HℓWKℓ∈R(n+1)×d,WKℓ∈Rd×d=HℓWVℓ∈R(n+1)×d,WVℓ∈Rd×d
where z z z is a start token, initialized at random.
有个以为 H ℓ = 0 H^{\ell=0} Hℓ=0 是怎么变成 H ℓ H^{\ell} Hℓ的,从 2 2 2变成了 d d d ??
按理说应该是有个线性投射层吧 2 2 2转化成 d d d的,一会儿看看代码咋写的
The decoding is auto-regressive, one city at a time. Suppose we have decoded the first t
cities in the tour, and we want to predict the next city.
The decoding process is composed of 4 steps detailed below and illustrated on 下图.
The decoding starts with the encoding of the previously selected i t i_t it city :
h t d e c = h i t e n c + P E t ∈ R d h t = 0 d e c = h s t a r t d e c = z + P E t = 0 ∈ R d \begin{aligned} h_{t}^{\mathrm{dec}} &=h_{i_{t}}^{\mathrm{enc}}+\mathrm{PE}_{t} \in \mathbb{R}^{d} \\ h_{t=0}^{\mathrm{dec}} &=h_{\mathrm{start}}^{\mathrm{dec}}=z+\mathrm{PE}_{t=0} \in \mathbb{R}^{d} \end{aligned} htdecht=0dec=hitenc+PEt∈Rd=hstartdec=z+PEt=0∈Rd
where P E t ∈ R d \mathrm{PE}_{t} \in \mathbb{R}^{d} PEt∈Rd is the traditional positional encoding in [37] to order the nodes in the tour:
P E t , i = { sin ( 2 π f i t ) if i is even, cos ( 2 π f i t ) if i is odd, with f i = 10 , 00 0 d [ 2 i ⌋ 2 π \mathrm{PE}_{t, i}=\left\{\begin{array}{l} \sin \left(2 \pi f_{i} t\right) \text { if } i \text { is even, } \\ \cos \left(2 \pi f_{i} t\right) \text { if } i \text { is odd, } \end{array} \quad \text { with } f_{i}=\frac{10,000^{\frac{d}{[2 i\rfloor}}}{2 \pi}\right. PEt,i={sin(2πfit) if i is even, cos(2πfit) if i is odd, with fi=2π10,000[2i⌋d
This step prepares the query using self-attention over the partial tour.
为啥是部分?
The self-attention layer is standard and uses multi-head attention, residual connection, and layer normalization.
The memory/speed complexity is O ( t ) O(t) O(t) at the decoding step t t t. The equations for this step are (when again considering a single head for an easier description):
h ^ t ℓ + 1 = softmax ( q ℓ K ℓ T d ) V ℓ ∈ R d , ℓ = 0 , … , L d e c − 1 q ℓ = h ^ t ℓ W ^ q ℓ ∈ R d , W ^ q ℓ ∈ R d × d K ℓ = H ^ 1 , t ℓ W ^ K ℓ ∈ R t × d , W ^ K ℓ ∈ R d × d V ℓ = H ^ 1 , t ℓ W ^ V ℓ ∈ R t × d , W ^ V ℓ ∈ R d × d , H ^ 1 , t ℓ = [ h ^ 1 ℓ , . . , h ^ t ℓ ] , h ^ t ℓ = { h t d e c if ℓ = 0 h t q , ℓ if ℓ > 0 \begin{aligned} \hat{h}_{t}^{\ell+1} &=\operatorname{softmax}\left(\frac{q^{\ell} K^{\ell^{T}}}{\sqrt{d}}\right) V^{\ell} \in \mathbb{R}^{d}, \ell=0, \ldots, L^{\mathrm{dec}}-1 \\ q^{\ell} &=\hat{h}_{t}^{\ell} \hat{W}_{q}^{\ell} \in \mathbb{R}^{d}, \hat{W}_{q}^{\ell} \in \mathbb{R}^{d \times d} \\ K^{\ell} &=\hat{H}_{1, t}^{\ell} \hat{W}_{K}^{\ell} \in \mathbb{R}^{t \times d}, \hat{W}_{K}^{\ell} \in \mathbb{R}^{d \times d} \\ V^{\ell} &=\hat{H}_{1, t}^{\ell} \hat{W}_{V}^{\ell} \in \mathbb{R}^{t \times d}, \hat{W}_{V}^{\ell} \in \mathbb{R}^{d \times d}, \\ \hat{H}_{1, t}^{\ell} &=\left[\hat{h}_{1}^{\ell}, . ., \hat{h}_{t}^{\ell}\right], \hat{h}_{t}^{\ell}=\left\{\begin{array}{c} h_{t}^{\mathrm{dec}} \text { if } \ell=0 \\ h_{t}^{\mathrm{q}, \ell} \text { if } \ell>0 \end{array}\right. \end{aligned} h^tℓ+1qℓKℓVℓH^1,tℓ=softmax(dqℓKℓT)Vℓ∈Rd,ℓ=0,…,Ldec−1=h^tℓW^qℓ∈Rd,W^qℓ∈Rd×d=H^1,tℓW^Kℓ∈Rt×d,W^Kℓ∈Rd×d=H^1,tℓW^Vℓ∈Rt×d,W^Vℓ∈Rd×d,=[h^1ℓ,..,h^tℓ],h^tℓ={htdec if ℓ=0htq,ℓ if ℓ>0
This stage queries the next possible city among the non-visited cities using a query-attention layer.
Multi-head attention, residual connection, and layer normalization are used.
The memory/speed complexity is O ( n ) O(n) O(n) at each recursive step.
h t q , ℓ + 1 = softmax ( q ℓ K ℓ T d ⊙ M t ) V ℓ ∈ R d , ℓ = 0 , … , L d e c − 1 q ℓ = h ^ t ℓ + 1 W ~ q ℓ ∈ R d , W ~ q ℓ ∈ R d × d K ℓ = H e n c W ~ K ℓ ∈ R t × d , W ~ K ℓ ∈ R d × d V ℓ = H e n c W ~ V ℓ ∈ R t × d , W ~ V ℓ ∈ R d × d \begin{aligned} h_{t}^{\mathrm{q}, \ell+1} &=\operatorname{softmax}\left(\frac{q^{\ell} K^{\ell^{T}}}{\sqrt{d}} \odot \mathcal{M}_{t}\right) V^{\ell} \in \mathbb{R}^{d}, \ell=0, \ldots, L^{\mathrm{dec}}-1 \\ q^{\ell} &=\hat{h}_{t}^{\ell+1} \tilde{W}_{q}^{\ell} \in \mathbb{R}^{d}, \tilde{W}_{q}^{\ell} \in \mathbb{R}^{d \times d} \\ K^{\ell} &=H^{\mathrm{enc}} \tilde{W}_{K}^{\ell} \in \mathbb{R}^{t \times d}, \tilde{W}_{K}^{\ell} \in \mathbb{R}^{d \times d} \\ V^{\ell} &=H^{\mathrm{enc}} \tilde{W}_{V}^{\ell} \in \mathbb{R}^{t \times d}, \tilde{W}_{V}^{\ell} \in \mathbb{R}^{d \times d} \end{aligned} htq,ℓ+1qℓKℓVℓ=softmax(dqℓKℓT⊙Mt)Vℓ∈Rd,ℓ=0,…,Ldec−1=h^tℓ+1W~qℓ∈Rd,W~qℓ∈Rd×d=HencW~Kℓ∈Rt×d,W~Kℓ∈Rd×d=HencW~Vℓ∈Rt×d,W~Vℓ∈Rd×d
with M t \mathcal{M}_{t} Mt is the mask if the visited cities and ⊙ \odot ⊙ is the Hadamard product.(就是逐项乘积)
This is the final step that performs a final query using a single-head attention to get a distribution over the non-visited cities.
Eventually, the next node it+1 is sampled from the distribution using Bernoulli during training and greedy (index with maximum probability) at inference time to evaluate the baseline.
The memory/speed complexity is O ( n ) O(n) O(n).
The final equation is
p t d e c = softmax ( C ⋅ tanh ( q K T d ⊙ M t ) ) ∈ R n q = h t q W ˉ q ∈ R d , W ˉ q ∈ R d × d K = H enc W ˉ K ∈ R n × d , W ˉ K ℓ ∈ R d × d \begin{aligned} p_{t}^{\mathrm{dec}} &=\operatorname{softmax}\left(C \cdot \tanh \left(\frac{q K^{T}}{\sqrt{d}} \odot \mathcal{M}_{t}\right)\right) \in \mathbb{R}^{n} \\ q &=h_{t}^{\mathrm{q}} \bar{W}_{q} \in \mathbb{R}^{d}, \bar{W}_{q} \in \mathbb{R}^{d \times d} \\ K &=H^{\text {enc }} \bar{W}_{K} \in \mathbb{R}^{n \times d}, \bar{W}_{K}^{\ell} \in \mathbb{R}^{d \times d} \end{aligned} ptdecqK=softmax(C⋅tanh(dqKT⊙Mt))∈Rn=htqWˉq∈Rd,Wˉq∈Rd×d=Henc WˉK∈Rn×d,WˉKℓ∈Rd×d
where C = 10 C = 10 C=10.
以上还得结合代码来理解… 不太明白 Decoder 为啥这么花里胡哨的??
Comparing Transformers for NLP (translation) vs. TSP (combinatorial optimization), the order of
the input sequence is irrelevant for TSP but the order of the output sequence is coded with PEs for
both TSP and NLP.
输入序列的顺序与 TSP 无关,但输出序列的顺序在 TSP 和 NLP 中都使用 PE 编码。
(不太懂)
TSP-Encoder benefits from Batch Normalization as we consider all cities during the encoding stage.
意思是有了BN,一次性看到了所有的城市??
TSP-Decoder works better with Layer Normalization since one vector is decoded at a time (auto-regressive decoding as in NLP).
The TSP Transformer is learned by Reinforcement Learning, hence no TSP solutions/approximations required.
TSP Transformer 是通过强化学习学习的,因此不需要 TSP 解决方案/近似值。
Both transformers for NLP and TSP have quadratic complexity O ( n 2 L ) O(n^2L) O(n2L).
Comparing with the closed neural network models of [25] and [11], we use the same transformer encoder (with BN) but our decoding architecture is different. We construct the query using all cities in the partial tour with a self-attention module.
[25] use the first and last cities with a global representation of all cities as the query for the next city.
[11] define the query with the last three cities in the partial tour.
Besides, our decoding process starts differently. We add a token city z ∈ R z \in \mathbb{R} z∈R.
This city does not exist and aims at starting the decoding at the best possible location by querying all cities with a self-attention module.
[25] starts the decoding with the mean representation of the encoding cities and a random token of the first and current cities.
[11] starts the decoding with a random token of the last three cities.
max seq n = { i 1 , … , i n } P T S P ( seq n ∣ X ) = P T S P ( i 1 , … , i n ∣ X ) \max _{\operatorname{seq}_{n}=\left\{i_{1}, \ldots, i_{n}\right\}} P^{\mathrm{TSP}}\left(\operatorname{seq}_{n} \mid X\right)=P^{\mathrm{TSP}}\left(i_{1}, \ldots, i_{n} \mid X\right) seqn={i1,…,in}maxPTSP(seqn∣X)=PTSP(i1,…,in∣X)