Pointer Networks

Pointer Network(Ptr-Net) uses attention as a pointer to select a member of the input sequence as the output. The author shows that Ptr-Net could solve three challenging geometric problems - finding planner convex hulls, computing Delunary triangulations, and the planner Travelling Salesman Problem.

Reference: * Pointer Networks * Pointer Networks简介及其应用 * TSP问题从DP算法到深度学习3：Pointer Network * 2022最新版-李宏毅机器学习深度学习课程

Intuitively, using a sequence-to-sequence model, such as an RNN, to solve combination optimization problems, we must address two issues: the input size, which has been solved by straightforward technology, and the solution size, which motivates the development of Ptr-Net.

We must set a fixed output dimension(Fig 1.a) to address the problem of finding planar convex hulls in an RNN. It means that if the size of the solution exceeds this fixed dimension, we cannot obtain the correct answer theoretically.

To solve the problem of solution dimensions, create a new network structure (Ptr-Net) based on an attention-based model. Firstly, input the problem into Ptr-Net to obtain the starting point of the solution. Then, use that starting point along with the problem as the new input for Ptr-Net to generate a new answer, which serves as a new key. Repeat this process until we receive an end signal where the site has appeared(Fig 1.b).

In each step, using a parametric model to estimate the terms of the probability chain rule, i.e.

$$\begin{equation} p\left(\mathcal{C}^{\mathcal{P}} \mid \mathcal{P} ; \theta\right)=\prod_{i=1}^{m(\mathcal{P})} p_\theta\left(C_i \mid C_1, \ldots, C_{i-1}, \mathcal{P} ; \theta\right) \end{equation}$$

Here (𝒫, 𝒞^𝒫) is a training pair, 𝒫 = {P₁, …, P_n} is a sequence of n vectors and 𝒞^𝒫 = {C₁, …, C_m(𝒫)} is a sequence of m(𝒫) indices, each between 1 and n.

The parameters of the model are learnt by maximizing the conditional probabilities for the training set, i.e.

$$\begin{equation} \theta^*=\underset{\theta}{\arg \max } \sum_{\mathcal{P}, \mathcal{C}^{\mathcal{P}}} \log p\left(\mathcal{C}^{\mathcal{P}} \mid \mathcal{P} ; \theta\right), \end{equation}$$

where the sum is over training examples.

Compute the attention vector at each output time i as follows:

$$\begin{align} u_j^i&=v^T \tanh \left(W_1 e_j+W_2 d_i\right) \quad j \in(1, \ldots, n) \\ p\left(C_i \mid C_1, \ldots, C_{i-1}, \mathcal{P}\right)&=\operatorname{softmax}\left(u^i\right) \end{align}$$

where e_j is input data(encoder state), d_i is output data(decoder state), u is softmax normalized output distribution, and W₁, W₂ are learnable parameters of model.

Then, select the site with the highest probability, 1 in the above figure.

Next, choose site 1 data (x₁, y₁) as the key and input it into the network to obtain the distribution. Continue until we receive the end signal, identified as (x₁, y₀) in this example.