0%

Pointer Networks

Pointer Network(Ptr-Net) uses attention as a pointer to select a member of the input sequence as the output. The author shows that Ptr-Net could solve three challenging geometric problems - finding planner convex hulls, computing Delunary triangulations, and the planner Travelling Salesman Problem.

Reference:

Intuitively, using a sequence-to-sequence model, such as an RNN, to solve combination optimization problems, we must address two issues: the input size, which has been solved by straightforward technology, and the solution size, which motivates the development of Ptr-Net.

RNN vs Ptr-Net

We must set a fixed output dimension(Fig 1.a) to address the problem of finding planar convex hulls in an RNN. It means that if the size of the solution exceeds this fixed dimension, we cannot obtain the correct answer theoretically.

To solve the problem of solution dimensions, create a new network structure (Ptr-Net) based on an attention-based model. Firstly, input the problem into Ptr-Net to obtain the starting point of the solution. Then, use that starting point along with the problem as the new input for Ptr-Net to generate a new answer, which serves as a new key. Repeat this process until we receive an end signal where the site has appeared(Fig 1.b).

In each step, using a parametric model to estimate the terms of the probability chain rule, i.e.

Here $(\mathcal{P}, \mathcal{C}^{\mathcal{P}})$ is a training pair, $\mathcal{P}=\left\{P_1, \ldots, P_n\right\}$ is a sequence of $n$ vectors and $\mathcal{C}^{\mathcal{P}}=\left\{C_1, \ldots, C_{m(\mathcal{P})}\right\}$ is a sequence of $m(\mathcal{P})$ indices, each between 1 and $n$.

The parameters of the model are learnt by maximizing the conditional probabilities for the training set, i.e.

where the sum is over training examples.

Compute the attention vector at each output time i as follows:

where $e_j$ is input data(encoder state), $d_i$ is output data(decoder state), $u$ is softmax normalized output distribution, and $W_1, W_2$ are learnable parameters of model.

step1

Then, select the site with the highest probability, 1 in the above figure.

step2

Next, choose site 1 data $(x_1, y_1)$ as the key and input it into the network to obtain the distribution. Continue until we receive the end signal, identified as $(x_1, y_0)$ in this example.