这篇文章提出了NeuRewriter的方法，使用强化学习中 actor-critic 的方式训练。

感觉类似于 “A Monte Carlo Policy Gradient Method with Local Search for Binary Optimization” 中的操作。当然了，NeuRewriter是最先提出来的，时间跨度基本都有4年了。

Reference: * Learning to Perform Local Rewriting for Combinatorial Optimization * github地址

Problem Setup

Let 𝒮 be the solution’s space of problem domain, and c : 𝒮 → ℝ be the cost function. The goal of optimization is to find arg min_{s ∈ 𝒮}c(s). Formally, each solutionis a state, and each local region and the associated rule is an action.

Optimization as a rewriting problem. Let 𝒰 be the rewriting ruleset. Suppose s_t is the current solution (or state) at iteration t. We first compute a state-dependent region set Ω(s_t), then pick a region ω_t ∈ Ω(s_t) using the region-picking policy π_ω(ω_t ∣ s_t). We then pick a rewriting rule u_t applicable to that region ω_t using the rule-picking policy π_u(u_t ∣ s_t[ω_t]), where s_t[ω_t] is a subset of state s_t.

Ω(s_t) is a problem-dependent region set. For expression simplification, Ω(s_t) includes all sub-trees of the expression parse trees; for job scheduling, Ω(s_t) covers all job nodes for scheduling; and for vehicle routing, it includes all nodes in the route.

We then apply this rewriting rule u_t ∈ 𝒰 to s_t[ω_t], and obtain the next state s_t + 1= f(s_t, ω_t, u_t). Given an initial solution (or state) s₀, our goal is to find a sequence of rewriting steps (s₀, (ω₀, u₀)), (s₁, (ω₁, u₁)), …, (s_T − 1, (ω_T − 1, u_T − 1)), s_T so that the final cost c(s_T) is minimized.

In this part mention two new functions: region-picking and rule-picking, what’s mean of them? At the end of the following is rewriting rule. I need the accurate meaning of rewriting. Before this paper, same idea had been proposed by Halide.

Instead of searching from scratch, this work searches solution by iteratively applying local rewriting rules to the existing until convergence. Thus, rewriting formulation is suitable for such problem:

Easily find feasible solution.
Well-behaved local structures, which could be utilized to incrementally improve the solution. It’s a hard to satisfy feature in spin glass.

Three words means nedd to know: * region-picking * rule-picking * rewriting rule

Neural Rewriter Model

Above is the entire model framework. Show the pseudo-code below.

Score predictor. Given the state s_t, the score predictor computes a score Q(s_t, ω_t) for every ω_t ∈ Ω(s_t), which measures the benefit of rewriting s_t[ω_t]. A high score indicates that rewriting s_t[ω_t] could be desirable.

In lines 2-10, I believe this algorithm resembles the Monte Carlo method; it initially establishes a probability distribution by Ω_ω and gets each score; then selects one from it. However, the choice is not entirely random—an acceptance probability is set. Through this process, high-quality data for learning are generated. Thus, region-picking function serves as a judgment function to assess whether the situation is favorable or not. The loss function of this par write as:

$$\begin{align} L_\omega(\theta)=\frac{1}{T} \sum_{t=0}^{T-1}\left(\sum_{t^{\prime}=t}^{T-1} \gamma^{t^{\prime}-t} r\left(s_t^{\prime},\left(\omega_t^{\prime}, u_t^{\prime}\right)\right)-Q\left(s_t, \omega_t ; \theta\right)\right)^2 \end{align}$$

Rule selector. Given s_t[ω_t] to be rewritten, the rule-picking policy predicts a probability distribution π_u(s_t[ω_t]) over the entire ruleset 𝒰, and selects a rule u_t ∈ 𝒰 to apply accordingly.

In lines 11-16, we will employ the Advantage Actor-Critic algorithm to train both the Rule-picking and Score-picking models. The primary network we aim to obtain is the Rule-picking model, which will assist us in addressing subsequent problems.

$$\begin{align} \Delta\left(s_t,\left(\omega_t, u_t\right)\right) &\equiv \sum_{t^{\prime}=t}^{T-1} \gamma^{t^{\prime}-t} r\left(s_t^{\prime},\left(\omega_t^{\prime}, u_t^{\prime}\right)\right)-Q\left(s_t, \omega_t ; \theta\right) \\ L_u(\phi)&=-\sum_{t=0}^{T-1} \Delta\left(s_t,\left(\omega_t, u_t\right)\right) \log \pi_u\left(u_t \mid s_t\left[\omega_t\right] ; \phi\right) \\ L(\theta, \phi)&=L_u(\phi)+\alpha L_\omega(\theta) \end{align}$$

The rewriting rule merely changes the description method; in fact, it addresses the same issue, similar to the gauge theory in the SK model.