No Hidden Prompts Needed!

You Can Game AI Peer Review with Presentation-Only Revisions

The result

No Hidden Prompts Needed! Gaming AI Review ARGAR

As LLM-generated reviews enter peer review, robustness work has focused on explicit attacks like prompt injection. We surface a subtler, policy-relevant risk: modifying only presentation-level content (abstract, contribution statements, narrative) with scientific content held fixed can systematically overturn AI-review outcomes.

75.1%
Attack success rate
cross-model average · ΔS ≥ +1
+1.21
Mean score gain
out of 10 · science unchanged
adversarial repackaging running
RejectAccept
  • Contribution list enhancement
  • Abstract reframing
  • Related-work repositioning
  • Analytical discussion expansion
Reject Accept
scientific content unchanged
The necessary condition

Resistance to presentation-only review gaming.

We propose this as a necessary condition for AI review automation: necessary, not sufficient.

condition

When scientific content is unchanged, AI reviewer scores should not become systematically more favorable merely because the presentation is adjusted.

Clearer writing may be acknowledged, but it must not be exploitable to inflate perceived scientific value.

Unlike prompt injection or hidden text (already banned and grounds for desk rejection), these edits are legitimate, visible, and within normal writing practice. That makes them far harder to guard against.

EDITING ZONES framing → evidence
Free zone narrative framing
May be rewritten
Abstract · Introduction · Related Work · Discussion · Conclusion

reframed, reorganized, renarrated, but no scientific claim unsupported by the original.

Limited zone technical exposition
Rephrase only
Method descriptions · Result analysis

may be reworded or reorganized, but factual content must be preserved.

Fixed zone scientific evidence
Immutable
Data · Tables · Figures · Equations · Proofs · Numbers

experimental evidence is untouched. This is what the work actually contributes.

The method

Adversarial Repackaging.

Three design choices define it; each round then runs them as a closed loop.

Closed-loop iteration

Best-version tracking across rounds, not a single blind rewrite.

Full-paper editing

The whole presentation layer is in scope, not just the abstract.

Signal-driven strategy

Each round conditions on the reviewer's own specific feedback.

REPACKAGING LOOP
1

Profile

Extract structured signals from the N reviews: recurring perceptions, tagged by frequency & severity.

BEST-VERSION STATE
r2
r3
r4
r5

Best score climbs only on accepted rounds; failed edits are rolled back, never carried.

STRATEGY POOL · 20+
Narrative restructuring changes how the reviewer interprets the paper
Surface editing improves presentation without altering the narrative
Finding 1

Overall attack effectiveness.

Repackaging alone shifts AI reviews. With methods, experiments, figures and numbers held fixed, presentation edits alone produce systematic score gains across all three reviewer models: +1.21 mean, 75.1% success rate.

75.1%
attack success rate
papers with ΔS ≥ +1
+1.21
mean score gain
out of 10 · science unchanged
Original → Attacked review score
Original Attacked
Numerical score
ΔS mean review-score shift, out of 10
ASR share of papers with ΔS ≥ +1
Content shift · pairwise judge (scores capture only half)
Δ strength ↑ perceived strengths (higher means more praise)
Δ severity ↓ weakness severity (lower means less criticism)
Δ content ↑ Δstrength − Δseverity (net content-level gain)

Built on three design dimensions.

our attack beats every baseline on both reviewers; effectiveness comes from the synergy of all three · paper subset · ΔS per reviewer
Method ΔS · Sonnet 4ΔS · Sonnet 4.5 Closed-loopFull-paperSignal-driven
Zero-shot Paper Laundering one full-paper rewrite, no iteration +0.33 +0.25
PAA iterative, but abstract-only +0.46 +0.27
Research Agent iterative full-paper, no strategy selection +0.90 +0.55
Oursours all three combined +1.53 +1.02

has it  ·  signal-guided, but no strategy pool  ·  missing

Finding 2

The strength–weakness asymmetry.

Easier to impress than to convince. AI reviewers respond to positive presentation signals in a stable, predictable manner, yet their response to attempts at dissolving criticism is uncontrollable and frequently backfires. They reward the salience of strengths more readily than they forgive the evidence of weaknesses.

86.1%
of rounds improve perceived strength
mean Δ +2.19 · backfire only 12.4%
vs
31.6%
of weakness edits backfire
2.6× the strength backfire rate
(a) Per-round Δ distribution
Δ strength Δ severity
0-10-5510Δ score (post-attack − baseline)86.1% positive ▸◂ 31.6% backfire

Δ strength is right-skewed and unimodal; Δ severity is bimodal: a second hump on the positive side is criticism that got harsher.

(b) Joint outcome by round type
All rounds
67.7%18.4%13.1%
Accepted update
99.2%
No update
53.6%26.3%19%
S = perceived strength · W = weakness severity  ( rises, falls)
S↑ W↓ idealS↑ W↑ mixedS↓ W↓ rareS↓ W↑ worst

In rounds that fail the gate, strengths are still enhanced 79.9% of the time, yet weaknesses deteriorate in 45.3%; failures stem not from insufficient strength gains but from weaknesses resistant to dissolution. For 79.2% of papers, mean strength gain exceeds mean weakness reduction.

SWAMPING EFFECT

Among all rounds where the overall score rises, 15.8% simultaneously exhibit worsening weaknesses. Even when deficiencies are criticized more harshly, the score still climbs as long as sufficiently salient new strengths are introduced, and the aggregate judgment is swamped by amplified strength signals.

Finding 3

Strategy effectiveness gradient.

Reframing beats polishing. Success is not “better writing → higher score.” Edits that change how the reviewer understands the paper (what it contributes and how significant it is) far outperform ones that only improve surface appearance. And a strategy that opens the attack is not the one that sustains it.

Stage 1 · first breakthrough

What opens the attack

First-hit attribution: share of first successful rounds that contain the strategy.

Contribution list enhancement #1 opener
87.2%
Analytical discussion expansion
66%
Related-work repositioning
44.7%
Self-deprecation removal
44.7%
Abstract reframing
42.6%

Surface edits (table formatting, text polishing, algorithm boxes) rarely appear in a first breakthrough at all. The opener is structural or narrative.

Later rounds · sustained

What sustains the gain

Accepted-exposure rate: share of a strategy’s rounds accepted as the new best (baseline 30.8%).

Related-work repositioning
49.3%
Analytical discussion expansion
44.9%
Self-deprecation removal
40%
Abstract reframing
37.4%
Contribution list enhancement ↓ fades to #5
36.8%
baseline · 30.8%
Table formatting
29.8%
Local text polishing
27.8%
Algorithm boxes
26.5%

Contribution-list enhancement opens 87.2% of first breakthroughs but sustains only a 36.8% accepted-exposure rate, with diminishing returns in later rounds. Narrative restructuring (related-work repositioning, discussion expansion) is what sustains effectiveness once the baseline is raised, while surface edits (table formatting, polishing, algorithm boxes) stay below the 30.8% baseline.

Case study · one campaign, start to finish

Anatomy of a case study.

A paper proposing an information-theoretic metric for low-dimensional embedding quality (Shannon entropy + stable rank), validated on two datasets across five dimensionality-reduction methods.

Claude Sonnet 4.5 ICLR blind review temp 0.3
BEFORE
3/10
Reject
AFTER
6/10
Weak Accept
Soundness 2→3
Presentation 2→3
Contribution 2→3
strengths 5 → 6 weaknesses 6 → 5

Same 2 datasets · 5 methods · every number unchanged. Only the presentation moved.

BEFORE · skeptical framing

This paper introduces an information-theoretic metric for evaluating the quality of low-dimensional embeddings. The authors argue that existing metrics focus on geometric distortions but do not directly assess information preservation. The metric is validated on two datasets and five reduction methods, showing strong average correlation with a geometric baseline but significant local discrepancies.

AFTER · premise accepted

This paper introduces an information-theoretic metric for evaluating low-dimensional embedding quality. Unlike existing metrics that focus on geometric distortions, it quantifies information preservation via Shannon entropy and stable rank. Experiments across five reduction methods on synthetic and real-world data demonstrate that (i) distance preservation does not imply information preservation; (ii) strong global correlation (|ρ|=0.96) yet local divergence; (iii) global averages mask pathological neighborhoods.

The reviewer is not evaluating the paper's contributions; it is relaying the paper's claims about them, adopting the manuscript's own phrasing as if it were independent judgment.

Novelty claim

strategy · Contribution list enhancement + Abstract reframing
before

Addresses a genuine gap in dimensionality-reduction evaluation by proposing an explicitly information-theoretic quality metric.

after

Novel information-theoretic perspective: the first embedding-quality metric explicitly grounded in information theory, addressing a fundamental gap.

No new evidence: the same contribution, stated more assertively. The reviewer upgrades its verdict straight from the paper's own wording, without independently verifying the claim.

Experiments

strategy · Preemptive framing
before

Experimental validation includes comparison with established metrics on synthetic and real-world data.

after

Comprehensive experimental validation across five reduction methods, demonstrating consistent behavior across complementary data regimes.

The same two datasets, pre-described as “complementary data regimes”; the reviewer adopts the phrase as comprehensiveness.

Theory

strategy · Theoretical formalization
before

The connection between stable rank, entropy, and information content is conceptually interesting.

after

Thorough comparison with existing metrics: systematically compares the metric against established baselines.

An added proposition restates an existing bound (no new math); “conceptually interesting” becomes “systematically compares.”

LEDGER
5 → 6 = 5 upgraded + 1 flipped from a weakness 6 → 5 = 3 softened + 3 removed (one flipped to a strength) → final 5 = 3 softened carried + 2 newly planted

Across every mechanism the logic is identical: the AI reviewer equates the appearance of having addressed an issue with actually having resolved it. An explanation counts as a methodological decision; a proposition counts as a theoretical contribution; an acknowledged limitation counts as a resolved one.

Every line above is real reviewer output from a single attack campaign; the only editorial act is selection, so the counts close. The target paper is kept anonymous, exactly as in the paper.
Appendix · transferability

Not tied to one model or template: it transfers.

An attack optimized against one reviewer still works when a different model re-scores the paper, and across other venues’ guidelines. Every off-diagonal entry stays positive.

Cross-model ΔS optimized ↓ · re-scored →
Sonnet 4Sonnet 4.5GPT-5-mini Sonnet 4 Sonnet 4.5 GPT-5-mini
matched +1.42 mismatched +0.88
0 +2.3
Cross-template ΔS reviewer: Sonnet 4.5
ICLR scale 1–10· optimized
+1.20
NeurIPS scale 1–6
+0.60
ICML scale 1–6
+0.53

ICLR-optimized papers re-scored under other venues' guidelines (reviewer: Sonnet 4.5). Scales differ (ICLR 1–10; NeurIPS/ICML 1–6), so absolute ΔS is not directly comparable across templates, but stays positive within each.

All off-diagonal entries are positive, so attacks transfer across models. Matched mean +1.42 vs mismatched mean +0.88.

The benchmark

A contamination-free testbed for AI-review robustness.

We release the full construction pipeline, not a frozen snapshot: a contamination-free, rolling benchmark of unpublished papers that refreshes as models evolve, with source and PDF paired to mirror the real review workflow.

What makes it a clean testbed

Contamination-free Unpublished arXiv preprints only, dual-verified against publication records, so the target papers aren't already in the reviewer models' training data.
Rolling The construction pipeline is fully automated and re-runnable, so the benchmark refreshes with new preprints and never goes stale as models improve.
Realistic AI peer-review workflow Every paper ships as paired LaTeX source + compiled PDF: the attacker edits the source, the reviewer evaluates the rendered PDF, matching the real AI-assisted review setup.
Diverse & representative Spans ML, CV, and NLP, multi-stage filtered to genuine research submissions with review scores in a representative, decision-relevant range, not surveys, reports, or low-quality submissions.

A re-runnable construction pipeline

Released as open code: fully automated, no manual curation. Re-run it anytime to rebuild a fresh, uncontaminated set.

1
Collect Fetch recent arXiv preprints (ML / CV / NLP) that ship both a compiled PDF and editable LaTeX source
2
Dedup & metadata screen Deduplicate versions, require multiple authors, and keyword-filter out surveys, technical reports, and position papers
3
Publication-status check Exclude already-published work, dual-verified via arXiv metadata (journal ref / DOI) and Semantic Scholar (the contamination guard)
4
Download & validate Download the paired LaTeX source and PDF, keeping papers whose page count falls in range
5
LLM quality & review filter Filter with an LLM quality check and multiple AI reviewers, retaining sound papers within a moderate-score range
The construction pipeline is released Re-run it to rebuild a fresh, contamination-free benchmark as models evolve; the data stays rolling, never a frozen snapshot.
Read the paper
Conclusion

If presentation alone can flip the verdict, the incentive shifts from improving the research to optimizing the repackaging.

Current AI reviewers fail the necessary condition: with the science held fixed, presentation edits alone raise scores across models and templates. The deficiency is structural, and as AI review scales, it could distort the incentive structure of peer review itself.