defense 2025

SimKey: A Semantically Aware Key Module for Watermarking Language Models

Shingo Kodama 1, Haya Diwan 2, Lucas Rosenblatt 2, R. Teal Witter 3, Niv Cohen 2

1 citations · 28 references · arXiv

α

Published on arXiv

2510.12828

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

SimKey improves watermark robustness to paraphrasing and translation while preventing false attribution of adversarially appended harmful content, when integrated with ExpMin, SynthID, and WaterMax mark modules.

SimKey

Novel technique introduced


The rapid spread of text generated by large language models (LLMs) makes it increasingly difficult to distinguish authentic human writing from machine output. Watermarking offers a promising solution: model owners can embed an imperceptible signal into generated text, marking its origin. Most leading approaches seed an LLM's next-token sampling with a pseudo-random key that can later be recovered to identify the text as machine-generated, while only minimally altering the model's output distribution. However, these methods suffer from two related issues: (i) watermarks are brittle to simple surface-level edits such as paraphrasing or reordering; and (ii) adversaries can append unrelated, potentially harmful text that inherits the watermark, risking reputational damage to model owners. To address these issues, we introduce SimKey, a semantic key module that strengthens watermark robustness by tying key generation to the meaning of prior context. SimKey uses locality-sensitive hashing over semantic embeddings to ensure that paraphrased text yields the same watermark key, while unrelated or semantically shifted text produces a different one. Integrated with state-of-the-art watermarking schemes, SimKey improves watermark robustness to paraphrasing and translation while preventing harmful content from false attribution, establishing semantic-aware keying as a practical and extensible watermarking direction.


Key Contributions

  • SimKey: a semantic key module using SimHash (locality-sensitive hashing) over contextual embeddings to tie watermark keys to the meaning of prior context rather than surface-level token patterns
  • Robustness to meaning-preserving transformations (paraphrasing, translation) because semantically similar contexts produce the same watermark key
  • Resistance to false attribution attacks where adversaries append unrelated or harmful text — semantic shift causes key change, breaking the watermark signal on appended content

🛡️ Threat Analysis

Output Integrity Attack

Embeds imperceptible signals into LLM text outputs to trace provenance and detect AI-generated content; watermark is in the generated text (outputs), not model weights — classic output integrity/content watermarking. Also addresses watermark robustness to adversarial paraphrase attacks and false attribution of appended harmful content.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Applications
llm-generated text provenanceai text detectioncontent attribution