[论文摘抄] The Transformer Network for the Traveling Salesman Problem

The Transformer Network for the Traveling Salesman Problem

https://arxiv.org/pdf/2103.03012.pdf

Bresson X, Laurent T. The transformer network for the traveling salesman problem[J]. arXiv preprint arXiv:2103.03012, 2021.
@article{bresson2021transformer,
  title={The transformer network for the traveling salesman problem},
  author={Bresson, Xavier and Laurent, Thomas},
  journal={arXiv preprint arXiv:2103.03012},
  year={2021}
}

1. 摘要中的部分:

TSP概况:
The Traveling Salesman Problem (TSP) is the most popular and most studied combinatorial problem, starting with von Neumann in 1951. It has driven the discovery of several optimization techniques such as cutting planes, branch-andbound, local search, Lagrangian relaxation, and simulated annealing.

本文核心问题:
The main question is whether deep learning can learn better heuristics from data, i.e. replacing human-engineered heuristics?

本文解决方法
In this work, we propose to adapt the recent successful Transformer architecture originally developed for natural language processing to the combinatorial TSP. Training is done by reinforcement learning, hence without TSP training solutions, and decoding uses beam search.

  • beam search 柱搜索(暂时不懂)

2. 引言的部分 [Traditional TSP Solvers]

解决COP的两个方法:

There exist two traditional approaches to tackle combinatorial problems;
exact algorithms and approximate/heuristic algorithms.

精确:

Exact algorithms are guaranteed(保证) to find optimal solutions, but they become intractable when n grows.

近似算法以最优性换取计算效率:

Approximate algorithms trade optimality for computational efficiency. They are problem-specific, often designed by iteratively applying a simple man-crafted rule, known as heuristic. Their complexity is polynomial and their quality depends on an approximate ratio that characterizes the worst/average-case error w.r.t the optimal solution.
它们的质量取决于一个近似比率,该比率表征了最佳解决方案的最坏/平均情况误差

精确的例子:
Exact algorithms for TSP are given by exhaustive search(穷举搜索), Dynamic or Integer Programming.

  • A Dynamic Programming algorithm was proposed for TSP in [16] with O ( n 2 2 n ) O(n^22^n) O(n22n) complexity, which becomes intractable(棘手的,不可接受的) for n > 40.

  • A general purpose Integer Programming (IP) solver with Cutting Planes (CP) and Branch-and-Bound (BB) called Gurobi was introduced in [15].

  • Finally, a highly specialized linear IP+CP+BB, namely Concorde, was designed in [2].
    Concorde is widely regarded as the fastest exact TSP solver, for large instances, currently in existence.
    Concorde 被广泛认为是目前存在的大型实例中最快的精确 TSP 求解器。

近似/启发式的方法

Several approximate/heuristic algorithms have been introduced.

Christofides algorithm [7] approximates TSP with Minimum Spanning Trees(最小生成树).
The algorithm has a polynomial-time complexity with O ( n 2 l o g n ) O(n^2 log n) O(n2logn), and is guaranteed to find a solution within a factor 3/2 of the optimal solution.

Farthest/nearest/greedy insertion algorithms [20] have complexity O ( n 2 ) O(n^2) O(n2), and farthest insertion (the best insertion in practice) has an approximation ratio of 2.43.

Google OR-Tools [14] is a highly optimized program that solves TSP and a larger set of vehicle routing problems(路径规划问题). This program applies different heuristics s.a. Simulated Annealing, Greedy Descent, Tabu Search, to navigate in the search space, and refines the solution by Local Search techniques.


OR-Tools使用指南:https://developers.google.com/optimization/introduction/overview

2-Opt algorithm [27, 21] proposes an heuristic based on a move that replaces two edges to reduce the tour length. The complexity is O ( n 2 m ( n ) ) O(n^2m(n)) O(n2m(n)), where n 2 n^2 n2 is the number of node pairs and m ( n ) m(n) m(n) is the number of times all pairs must be tested to reach a local minimum (with worst-case being O ( 2 n / 2 ) ) O(2^{n/2})) O(2n/2)). The approximation ratio is 4 / n 4/\sqrt{n} 4/n . Extension to 3-Opt move (replacing 3 edges) and more have been proposed in [6].

Finally, LKH-3 algorithm [18] introduces the best heuristic for solving TSP. It is an extension of the original LKH [28] and LKH-2 [17] based on 2-Opt/3-Opt where edge candidates are estimated with a Minimum Spanning Tree [17]. LKH-3 can tackle various TSP-type problems.

3. 引言的部分 [Neural Network Solvers]

学习到的特征替换人工特征

In the last decade, Deep learning (DL) has significantly improved Computer Vision, Natural Language Processing and Speech Recognition by replacing hand-crafted visual/text/speech features by features learned from data [26].

[26] LeCun Y, Bengio Y, Hinton G. Deep learning[J]. nature, 2015, 521(7553): 436-444.
@article{lecun2015deep,
  title={Deep learning},
  author={LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey},
  journal={nature},
  volume={521},
  number={7553},
  pages={436--444},
  year={2015},
  publisher={Nature Publishing Group}
}

这里又把摘要展开说了说:

核心问题:For combinatorial problems, the main question is whether DL can learn better heuristics from data than hand-crafted heuristics?

This is attractive because developing algorithms to tackle efficiently NP-hard problems require years of research (TSP has been actively studied for seventy years).

这句好地道:

The last five years have seen the emergence of promising techniques
在过去的五年中,出现了有前途的技术
where (graph) neural networks have been capable to learn new combinatorial algorithms with supervised or reinforcement learning.
其中(图)神经网络已经能够通过监督或强化学习来学到新的组合算法。

总结一下近几年的工作:
We briefly summarize this line of work below.

先跳过了,让我先看看本文的方法

  • HopfieldNets [19]: First Neural Network designed to solve (small) TSPs.
  • PointerNets [39]: A pioneer work using modern DL to tackle TSP and combinatorial optimization
    problems. This work combines recurrent networks to encode the cities and decode the sequence of nodes in the tour, with the attention mechanism. The network structure is similar to [3], which was applied to NLP with great success. The decoding is auto-regressive and the network parameters are learned by supervised learning with approximate TSP solutions.
  • PointerNets+RL [5]: The authors improve [39] with Reinforcement Learning (RL) which eliminates the requirement of generating TSP solutions as supervised training data. The tour length is used as reward. Two RL approaches are studied; a standard unbiased reinforce algorithm [40], and an active search algorithm that can explore more candidates.
  • Order-invariant PointerNets+RL [33]: The original network [39] is not invariant by permutations
    of the order of the input cities (which is important for NLP but not for TSP). This requires [39] to
    randomly permute the input order to let the network learn this invariance. The work [33] solves this issue by making the encoder permutation-invariant.
  • S2V-DQN [9]: This model is a graph network that takes a graph and a partial tour as input, and
    outputs a state-valued function Q to estimate the next node in the tour. Training is done by RL
    and memory replay [31], which allows intermediate rewards that encourage farthest node insertion heuristic.
  • Quadratic Assignment Problem [34]: TSP can be formulated as a QAP, which is NP-hard and also hard to approximate. A graph network based on the powers of adjacency matrix of node distances is trained in supervised manner. The loss is the KL distance between the adjacency matrix of the ground truth cycle and its network prediction. A feasible tour is computed with beam search.
  • Permutation-invariant Pooling Network [23]: This work solves a variant of TSP with multiple
    salesmen. The network is trained by supervised learning and outputs a fractional solution, which is transformed into a feasible integer solution by beam search. The approach is non-autoregressive, i.e. single pass.
  • Tranformer-encoder+2-Opt heuristic [11]: The authors use a standard transformer to encode the
    cities and they decode sequentially with a query composed of the last three cities in the partial
    tour. The network is trained with Actor-Critic RL, and the solution is refined with a standard 2-Opt
    heuristic.
  • Tranformer-encoder+Attention-decoder [25]: This work also uses a standard transformer to encode the cities and the decoding is sequential with a query composed of the first city, the last city in the partial tour and a global representation of all cities. Training is carried out with reinforce and a deterministic baseline.
  • GraphConvNet [22]: This work learns a deep graph network by supervision to predict the probabilities of an edge to be in the TSP tour. A feasible tour is generated by beam search. The approach uses a single pass.
  • 2-Opt Learning [41]: The authors design a transformer-based network to learn to select nodes
    for the 2-Opt heuristics (original 2-Opt may require O ( 2 n / 2 ) O(2^{n/2}) O(2n/2) moves before stopping). Learning is
    performed by RL and actor-critic.
  • GNNs with Monte Carlo Tree Search [42]: A recent work based on AlphaGo [35] which augments a graph network with MCTS to improve the search exploration of tours by evaluating multiple next node candidates in the tour. This improves the search exploration of auto-regressive methods, which cannot go back once the selection of the nodes is made.

4. 方法架构 [Proposed Architecture]

1. 方法概述

将TSP视为翻译问题,源语言是一个2D点集,目标语言是一个最短的 tour(索引的序列),本文使用原始的 Transformers 来解决TSP问题

We cast TSP as a “translation” problem where the source “language” is a set of 2D points and
the target “language” is a tour (sequence of indices) with minimal length, and adapt the original
Transformers [37] to solve this problem.

用RL方法训练,reward 是 tour 的长度
如果训练网络在一组随机 TSP 上改进了Baseline,则 Baseline 会相应地更新。

We train by reinforcement learning, with the same setting as [25]. The reward is the tour length and the baseline is simply updated if the train network improves the baseline on a set of random TSPs.

整个架构图如下:
[论文摘抄] The Transformer Network for the Traveling Salesman Problem_第1张图片

2. Encoder.

本文使用标准 Transformer 模型,只是使用BN,而不是LN

It is a standard Transformer encoder with multi-head attention and residual connection. The
only difference is the use of batch normalization, instead of layer normalization. The memory/speed
complexity is O ( n 2 ) O(n^2) O(n2).

Formally, the encoder equations are (when considering a single head for an easier description)

H e n c = H ℓ = L e n c ∈ R ( n + 1 ) × d , H^{\rm{enc}}=H^{\ell=L^{\mathrm{enc}}} \in \mathbb{R}^{(n+1) \times d}, Henc=H=LencR(n+1)×d,

where,
H ℓ = 0 = Concat ⁡ ( z , X ) ∈ R ( n + 1 ) × 2 , z ∈ R 2 , X ∈ R n × 2 , H ℓ + 1 = softmax ⁡ ( Q ℓ K ℓ T d ) V ℓ ∈ R ( n + 1 ) × d Q ℓ = H ℓ W Q ℓ ∈ R ( n + 1 ) × d , W Q ℓ ∈ R d × d K ℓ = H ℓ W K ℓ ∈ R ( n + 1 ) × d , W K ℓ ∈ R d × d V ℓ = H ℓ W V ℓ ∈ R ( n + 1 ) × d , W V ℓ ∈ R d × d \begin{aligned} H^{\ell=0} &=\operatorname{Concat}(z, X) \in \mathbb{R}^{(n+1) \times 2}, z \in \mathbb{R}^{2}, X \in \mathbb{R}^{n \times 2}, \\ H^{\ell+1} &=\operatorname{softmax}\left(\frac{Q^{\ell} K^{\ell^{T}}}{\sqrt{d}}\right) V^{\ell} \in \mathbb{R}^{(n+1) \times d} \\ Q^{\ell} &=H^{\ell} W_{Q}^{\ell} \in \mathbb{R}^{(n+1) \times d}, W_{Q}^{\ell} \in \mathbb{R}^{d \times d} \\ K^{\ell} &=H^{\ell} W_{K}^{\ell} \in \mathbb{R}^{(n+1) \times d}, W_{K}^{\ell} \in \mathbb{R}^{d \times d} \\ V^{\ell} &=H^{\ell} W_{V}^{\ell} \in \mathbb{R}^{(n+1) \times d}, W_{V}^{\ell} \in \mathbb{R}^{d \times d} \end{aligned} H=0H+1QKV=Concat(z,X)R(n+1)×2,zR2,XRn×2,=softmax(d QKT)VR(n+1)×d=HWQR(n+1)×d,WQRd×d=HWKR(n+1)×d,WKRd×d=HWVR(n+1)×d,WVRd×d

where z z z is a start token, initialized at random.

有个以为 H ℓ = 0 H^{\ell=0} H=0 是怎么变成 H ℓ H^{\ell} H的,从 2 2 2变成了 d d d ??
按理说应该是有个线性投射层吧 2 2 2转化成 d d d的,一会儿看看代码咋写的

下图是编码器的图例
[论文摘抄] The Transformer Network for the Traveling Salesman Problem_第2张图片

3. Decoder.

The decoding is auto-regressive, one city at a time. Suppose we have decoded the first t
cities in the tour, and we want to predict the next city.

The decoding process is composed of 4 steps detailed below and illustrated on 下图.
[论文摘抄] The Transformer Network for the Traveling Salesman Problem_第3张图片

Decoder – Part 1

The decoding starts with the encoding of the previously selected i t i_t it city :

h t d e c = h i t e n c + P E t ∈ R d h t = 0 d e c = h s t a r t d e c = z + P E t = 0 ∈ R d \begin{aligned} h_{t}^{\mathrm{dec}} &=h_{i_{t}}^{\mathrm{enc}}+\mathrm{PE}_{t} \in \mathbb{R}^{d} \\ h_{t=0}^{\mathrm{dec}} &=h_{\mathrm{start}}^{\mathrm{dec}}=z+\mathrm{PE}_{t=0} \in \mathbb{R}^{d} \end{aligned} htdecht=0dec=hitenc+PEtRd=hstartdec=z+PEt=0Rd

where P E t ∈ R d \mathrm{PE}_{t} \in \mathbb{R}^{d} PEtRd is the traditional positional encoding in [37] to order the nodes in the tour:

P E t , i = { sin ⁡ ( 2 π f i t )  if  i  is even,  cos ⁡ ( 2 π f i t )  if  i  is odd,   with  f i = 10 , 00 0 d [ 2 i ⌋ 2 π \mathrm{PE}_{t, i}=\left\{\begin{array}{l} \sin \left(2 \pi f_{i} t\right) \text { if } i \text { is even, } \\ \cos \left(2 \pi f_{i} t\right) \text { if } i \text { is odd, } \end{array} \quad \text { with } f_{i}=\frac{10,000^{\frac{d}{[2 i\rfloor}}}{2 \pi}\right. PEt,i={sin(2πfit) if i is even, cos(2πfit) if i is odd,  with fi=2π10,000[2id

Decoder – Part 2.

This step prepares the query using self-attention over the partial tour.
为啥是部分?

The self-attention layer is standard and uses multi-head attention, residual connection, and layer normalization.

The memory/speed complexity is O ( t ) O(t) O(t) at the decoding step t t t. The equations for this step are (when again considering a single head for an easier description):

h ^ t ℓ + 1 = softmax ⁡ ( q ℓ K ℓ T d ) V ℓ ∈ R d , ℓ = 0 , … , L d e c − 1 q ℓ = h ^ t ℓ W ^ q ℓ ∈ R d , W ^ q ℓ ∈ R d × d K ℓ = H ^ 1 , t ℓ W ^ K ℓ ∈ R t × d , W ^ K ℓ ∈ R d × d V ℓ = H ^ 1 , t ℓ W ^ V ℓ ∈ R t × d , W ^ V ℓ ∈ R d × d , H ^ 1 , t ℓ = [ h ^ 1 ℓ , . . , h ^ t ℓ ] , h ^ t ℓ = { h t d e c  if  ℓ = 0 h t q , ℓ  if  ℓ > 0 \begin{aligned} \hat{h}_{t}^{\ell+1} &=\operatorname{softmax}\left(\frac{q^{\ell} K^{\ell^{T}}}{\sqrt{d}}\right) V^{\ell} \in \mathbb{R}^{d}, \ell=0, \ldots, L^{\mathrm{dec}}-1 \\ q^{\ell} &=\hat{h}_{t}^{\ell} \hat{W}_{q}^{\ell} \in \mathbb{R}^{d}, \hat{W}_{q}^{\ell} \in \mathbb{R}^{d \times d} \\ K^{\ell} &=\hat{H}_{1, t}^{\ell} \hat{W}_{K}^{\ell} \in \mathbb{R}^{t \times d}, \hat{W}_{K}^{\ell} \in \mathbb{R}^{d \times d} \\ V^{\ell} &=\hat{H}_{1, t}^{\ell} \hat{W}_{V}^{\ell} \in \mathbb{R}^{t \times d}, \hat{W}_{V}^{\ell} \in \mathbb{R}^{d \times d}, \\ \hat{H}_{1, t}^{\ell} &=\left[\hat{h}_{1}^{\ell}, . ., \hat{h}_{t}^{\ell}\right], \hat{h}_{t}^{\ell}=\left\{\begin{array}{c} h_{t}^{\mathrm{dec}} \text { if } \ell=0 \\ h_{t}^{\mathrm{q}, \ell} \text { if } \ell>0 \end{array}\right. \end{aligned} h^t+1qKVH^1,t=softmax(d qKT)VRd,=0,,Ldec1=h^tW^qRd,W^qRd×d=H^1,tW^KRt×d,W^KRd×d=H^1,tW^VRt×d,W^VRd×d,=[h^1,..,h^t],h^t={htdec if =0htq, if >0

Decoder – Part 3.

This stage queries the next possible city among the non-visited cities using a query-attention layer.
Multi-head attention, residual connection, and layer normalization are used.
The memory/speed complexity is O ( n ) O(n) O(n) at each recursive step.

h t q , ℓ + 1 = softmax ⁡ ( q ℓ K ℓ T d ⊙ M t ) V ℓ ∈ R d , ℓ = 0 , … , L d e c − 1 q ℓ = h ^ t ℓ + 1 W ~ q ℓ ∈ R d , W ~ q ℓ ∈ R d × d K ℓ = H e n c W ~ K ℓ ∈ R t × d , W ~ K ℓ ∈ R d × d V ℓ = H e n c W ~ V ℓ ∈ R t × d , W ~ V ℓ ∈ R d × d \begin{aligned} h_{t}^{\mathrm{q}, \ell+1} &=\operatorname{softmax}\left(\frac{q^{\ell} K^{\ell^{T}}}{\sqrt{d}} \odot \mathcal{M}_{t}\right) V^{\ell} \in \mathbb{R}^{d}, \ell=0, \ldots, L^{\mathrm{dec}}-1 \\ q^{\ell} &=\hat{h}_{t}^{\ell+1} \tilde{W}_{q}^{\ell} \in \mathbb{R}^{d}, \tilde{W}_{q}^{\ell} \in \mathbb{R}^{d \times d} \\ K^{\ell} &=H^{\mathrm{enc}} \tilde{W}_{K}^{\ell} \in \mathbb{R}^{t \times d}, \tilde{W}_{K}^{\ell} \in \mathbb{R}^{d \times d} \\ V^{\ell} &=H^{\mathrm{enc}} \tilde{W}_{V}^{\ell} \in \mathbb{R}^{t \times d}, \tilde{W}_{V}^{\ell} \in \mathbb{R}^{d \times d} \end{aligned} htq,+1qKV=softmax(d qKTMt)VRd,=0,,Ldec1=h^t+1W~qRd,W~qRd×d=HencW~KRt×d,W~KRd×d=HencW~VRt×d,W~VRd×d

with M t \mathcal{M}_{t} Mt is the mask if the visited cities and ⊙ \odot is the Hadamard product.(就是逐项乘积)

Decoder – Part 4.

This is the final step that performs a final query using a single-head attention to get a distribution over the non-visited cities.

Eventually, the next node it+1 is sampled from the distribution using Bernoulli during training and greedy (index with maximum probability) at inference time to evaluate the baseline.

The memory/speed complexity is O ( n ) O(n) O(n).

The final equation is

p t d e c = softmax ⁡ ( C ⋅ tanh ⁡ ( q K T d ⊙ M t ) ) ∈ R n q = h t q W ˉ q ∈ R d , W ˉ q ∈ R d × d K = H enc  W ˉ K ∈ R n × d , W ˉ K ℓ ∈ R d × d \begin{aligned} p_{t}^{\mathrm{dec}} &=\operatorname{softmax}\left(C \cdot \tanh \left(\frac{q K^{T}}{\sqrt{d}} \odot \mathcal{M}_{t}\right)\right) \in \mathbb{R}^{n} \\ q &=h_{t}^{\mathrm{q}} \bar{W}_{q} \in \mathbb{R}^{d}, \bar{W}_{q} \in \mathbb{R}^{d \times d} \\ K &=H^{\text {enc }} \bar{W}_{K} \in \mathbb{R}^{n \times d}, \bar{W}_{K}^{\ell} \in \mathbb{R}^{d \times d} \end{aligned} ptdecqK=softmax(Ctanh(d qKTMt))Rn=htqWˉqRd,WˉqRd×d=Henc WˉKRn×d,WˉKRd×d

where C = 10 C = 10 C=10.

以上还得结合代码来理解… 不太明白 Decoder 为啥这么花里胡哨的??

4. 方法架构对比 [Architecture Comparison]

Comparing Transformers for NLP (translation) vs. TSP (combinatorial optimization), the order of
the input sequence is irrelevant for TSP but the order of the output sequence is coded with PEs for
both TSP and NLP.
输入序列的顺序与 TSP 无关,但输出序列的顺序在 TSP 和 NLP 中都使用 PE 编码。
(不太懂)

TSP-Encoder benefits from Batch Normalization as we consider all cities during the encoding stage.

意思是有了BN,一次性看到了所有的城市??

TSP-Decoder works better with Layer Normalization since one vector is decoded at a time (auto-regressive decoding as in NLP).

The TSP Transformer is learned by Reinforcement Learning, hence no TSP solutions/approximations required.
TSP Transformer 是通过强化学习学习的,因此不需要 TSP 解决方案/近似值。


Both transformers for NLP and TSP have quadratic complexity O ( n 2 L ) O(n^2L) O(n2L).


Comparing with the closed neural network models of [25] and [11], we use the same transformer encoder (with BN) but our decoding architecture is different. We construct the query using all cities in the partial tour with a self-attention module.

[25] use the first and last cities with a global representation of all cities as the query for the next city.
[11] define the query with the last three cities in the partial tour.


Besides, our decoding process starts differently. We add a token city z ∈ R z \in \mathbb{R} zR.
This city does not exist and aims at starting the decoding at the best possible location by querying all cities with a self-attention module.

[25] starts the decoding with the mean representation of the encoding cities and a random token of the first and current cities.
[11] starts the decoding with a random token of the last three cities.

max ⁡ seq ⁡ n = { i 1 , … , i n } P T S P ( seq ⁡ n ∣ X ) = P T S P ( i 1 , … , i n ∣ X ) \max _{\operatorname{seq}_{n}=\left\{i_{1}, \ldots, i_{n}\right\}} P^{\mathrm{TSP}}\left(\operatorname{seq}_{n} \mid X\right)=P^{\mathrm{TSP}}\left(i_{1}, \ldots, i_{n} \mid X\right) seqn={i1,,in}maxPTSP(seqnX)=PTSP(i1,,inX)

你可能感兴趣的:(每日一氵,论文摘抄,TSP)