0%

Neural Combiantorial Optimization with Reinforcement Learning

Utilizing neural networks and reinforcement learning to tackle the Traveling Salesman Problem, where the neural network model is a Recurrent Neural Network (RNN), and the policy for reinforcement learning employs policy gradients.

Reference:

On the 2D Euclidean TSP, given an input graph, represented as a sequence of $n$ cities in a two dimensional space $s=\left\{\mathbf{x}_i\right\}_{i=1}^n$ where each $\mathbf{x}_i \in \mathbb{R}^2$, we are concerned with finding a permutation of the points $\pi$, termed a tour, that visits each city once and has the minimum total length. We define the length of a tour defined by a permutation $\pi$ as

where $|\cdot|_2$ denotes $\ell_2$ norm. The aim

Neural network architecture uses the chain relue to factorize the probability of a tour as:

where $p(\pi \mid s)$ is stochastic policy, and then uses individual softmax modules to represent each term

The architecture is pointer network.
architecture

Propose to use model-free policy-based Reinforcement Learning to optimize the parameters of a pointer network denoted $\boldsymbol{\theta}$. Our training objective is the expected tour length which, given an input graph $s$, is defined as

where $b(s)$ denotes a baseline function that does not depend on $\pi$ and estimates the expected tour length to reduce the variance of the gradients.

From the above discussion, it is clear that $b(s)$, which provides a method to measure path length, is crucial for achieving an optimal network. The network will deliver an exact policy if $b(s)$ offers a precise value.

We introduce an auxiliary network, called a critic and parameterized by $\theta_v$, to learn the expected tour length found by our current policy $p_\theta$ given an input sequence $s$.

I believe $b(s)$ is more than just an auxiliary network. As crucial as an engine in chess, $b(s)$ provides a criterion for key optimization problems, which can be seamlessly integrated into search algorithms. However, to utilize it effectively for the TSP, we need to devise a functional approach to incorporate actions into $s$.

architecture