Data efficient reinforcement learning and adaptive optimal perimeter control of network traffic dynamics

Can Chen, Yunping Huang, William H.K. Lam, Tianlu Pan, Shu-Chien Hsu, Agachai Sumalee, Renxin Zhong

correctmedium confidence

Category: Not specified
Journal tier: Specialist/Solid
Processed: Sep 28, 2025, 12:56 AM
arXiv Links: Abstract ↗PDF ↗

Audit review

The paper’s Theorem 4.1 proves the equivalence between the off-policy IRL Bellman equation (32) and the policy-iteration equations (25)–(26) by (i) deriving (32) from (25)–(26) via the chain rule and the identity (∂V/∂ñ)^T S = 2 (v ⊙ D)^T R, and (ii) showing that (32) implies the policy-evaluation PDE (25) when Δt → 0, given D = (1/(2v)) ⊙ (R^{-1} S^T ∂V/∂ñ) . The candidate solution repeats the same cancellations and limit argument, and additionally re-derives the policy-improvement step by minimizing the Hamiltonian using the saturated-cost U, recovering ũ_{k+1} = −v ⊙ tanh(D_{k+1}) + v as in (26). This aligns with the paper’s formulation of PI and the saturated optimal control law in Section 3 . Hence both are correct and essentially the same argument with minor presentational differences.

Referee report (LaTeX)

\textbf{Recommendation:} minor revisions

\textbf{Journal Tier:} specialist/solid

\textbf{Justification:}

The equivalence proof between the IRL Bellman equation and policy iteration is correct and clearly structured. Strengthening assumptions (smoothness/admissibility) and briefly justifying the uniqueness of the policy-improvement minimizer with the tanh-based cost would improve rigor. The contribution is practical and relevant for model-free perimeter control under heterogeneous data.