2503.21929
Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models
Tom Kempton, Stuart Burrell
wrongmedium confidence
- Category
- math.DS
- Journal tier
- Specialist/Solid
- Processed
- Sep 28, 2025, 12:56 AM
- arXiv Links
- Abstract ↗PDF ↗
Audit review
The paper’s Section 5 correctly derives the top‑k and nucleus results using the standard Gibbs variational lemma (Lemma 5.1) and obtains an entropy + average log‑likelihood + log‑distortion objective for the locally normalized joint (Corollaries 5.1–5.2) . However, Corollary 5.3 (temperature) states the third term as E_µ[ε_τ(w)] rather than E_µ[log ε_τ(w)], which is inconsistent with the same lemma and with the derivation pattern used for top‑k/nucleus; the correct objective must have the log distortion so that the Gibbs maximizer reproduces q_τ ∝ p^{1/τ}·ε_τ rather than p^{1/τ}·e^{ε_τ} . The paper otherwise correctly notes that global normalization removes the distortion term (Corollary 5.4) . The candidate solution pinpoints and corrects the temperature misstatement, and matches the paper’s logic everywhere else, including the telescoping/normalization structure and the global-vs-local contrast.
Referee report (LaTeX)
\textbf{Recommendation:} minor revisions \textbf{Journal Tier:} specialist/solid \textbf{Justification:} The paper develops a clean, unified variational perspective on decoding strategies, exposing a path-dependent distortion due to local normalization and motivating global normalization. The mathematics is accessible and the empirical illustrations are supportive. However, the temperature corollary is misprinted (uses E[ετ] instead of E[log ετ]), which directly conflicts with the stated lemma and should be corrected. With this fix and small clarifications of assumptions, the contribution is solid and useful to the community.