2209.05726
Data efficient reinforcement learning and adaptive optimal perimeter control of network traffic dynamics
Can Chen, Yunping Huang, William H.K. Lam, Tianlu Pan, Shu-Chien Hsu, Agachai Sumalee, Renxin Zhong
correctmedium confidence
- Category
- Not specified
- Journal tier
- Specialist/Solid
- Processed
- Sep 28, 2025, 12:56 AM
- arXiv Links
- Abstract ↗PDF ↗
Audit review
The paper’s Theorem 4.1 proves the equivalence between the off-policy IRL Bellman equation (32) and the policy-iteration equations (25)–(26) by (i) deriving (32) from (25)–(26) via the chain rule and the identity (∂V/∂ñ)^T S = 2 (v ⊙ D)^T R, and (ii) showing that (32) implies the policy-evaluation PDE (25) when Δt → 0, given D = (1/(2v)) ⊙ (R^{-1} S^T ∂V/∂ñ) . The candidate solution repeats the same cancellations and limit argument, and additionally re-derives the policy-improvement step by minimizing the Hamiltonian using the saturated-cost U, recovering ũ_{k+1} = −v ⊙ tanh(D_{k+1}) + v as in (26). This aligns with the paper’s formulation of PI and the saturated optimal control law in Section 3 . Hence both are correct and essentially the same argument with minor presentational differences.
Referee report (LaTeX)
\textbf{Recommendation:} minor revisions \textbf{Journal Tier:} specialist/solid \textbf{Justification:} The equivalence proof between the IRL Bellman equation and policy iteration is correct and clearly structured. Strengthening assumptions (smoothness/admissibility) and briefly justifying the uniqueness of the policy-improvement minimizer with the tanh-based cost would improve rigor. The contribution is practical and relevant for model-free perimeter control under heterogeneous data.