Variational multiscale reinforcement learning for discovering reduced order closure models of nonlinear spatiotemporal transport systems

Omer San, Suraj Pawar, Adil Rasheed

wrongmedium confidence

Category: math.DS
Journal tier: Specialist/Solid
Processed: Sep 28, 2025, 12:56 AM
arXiv Links: Abstract ↗PDF ↗

Audit review

The paper defines the VMRL binary reward as r_t = +10 if σ||s_base^t − s_ROM^t|| < ||s_base^t − s_test^t|| and −10 otherwise (its Eq. 22), with s_base the base ROM (Eq. 4), s_ROM the closure ROM (Eq. 7), and s_test the test model state constructed from the variational multiscale split. Under these definitions, setting the closure identically to zero makes the closure ROM coincide with the base ROM, i.e., s_ROM^t ≡ s_base^t, so the left-hand side becomes zero and the strict inequality is satisfied whenever s_base^t ≠ s_test^t. Thus the zero-closure policy earns the per-step maximum reward whenever the yardstick is nonzero, making the RL objective degenerate and trivially optimized. The paper never acknowledges this degeneracy and, moreover, claims empirical improvements of VMRL over the base ROM, which conflicts with the reward logic as written. The candidate solution correctly identifies and proves this degeneracy and provides a careful value-function rewriting; the paper’s formulation is therefore flawed or mis-specified (very likely a sign/target mismatch in Eq. 22 or in the state definitions).

Referee report (LaTeX)

\textbf{Recommendation:} major revisions

\textbf{Journal Tier:} specialist/solid

\textbf{Justification:}

The manuscript tackles an important problem and proposes a creative VMS-inspired reward design for RL-based ROM closures. However, with the reward as written, a trivial policy (zero closure) provably maximizes the return whenever the base–test discrepancy is nonzero, so the objective becomes degenerate. This contradicts the stated objective (to move the closure ROM closer to the test model) and the empirical claims of improved accuracy over the base ROM. The contribution can be made solid by correcting the reward definition, providing theoretical non-degeneracy arguments, and reconciling implementation with text.