attack 2025

Tuning for TraceTarnish: Techniques, Trends, and Testing Tangible Traits

Robert Dilworth

1 citations · 14 references · arXiv

α

Published on arXiv

2512.03465

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Function word frequencies, content word distributions, and Type-Token Ratio are the strongest stylometric indicators of adversarial manipulation, and targeting these features in TraceTarnish improvements further weakens authorship attribution

TraceTarnish

Novel technique introduced


In this study, we more rigorously evaluated our attack script $\textit{TraceTarnish}$, which leverages adversarial stylometry principles to anonymize the authorship of text-based messages. To ensure the efficacy and utility of our attack, we sourced, processed, and analyzed Reddit comments -- comments that were later alchemized into $\textit{TraceTarnish}$ data -- to gain valuable insights. The transformed $\textit{TraceTarnish}$ data was then further augmented by $\textit{StyloMetrix}$ to manufacture stylometric features -- features that were culled using the Information Gain criterion, leaving only the most informative, predictive, and discriminative ones. Our results found that function words and function word types ($L\_FUNC\_A$ $\&$ $L\_FUNC\_T$); content words and content word types ($L\_CONT\_A$ $\&$ $L\_CONT\_T$); and the Type-Token Ratio ($ST\_TYPE\_TOKEN\_RATIO\_LEMMAS$) yielded significant Information-Gain readings. The identified stylometric cues -- function-word frequencies, content-word distributions, and the Type-Token Ratio -- serve as reliable indicators of compromise (IoCs), revealing when a text has been deliberately altered to mask its true author. Similarly, these features could function as forensic beacons, alerting defenders to the presence of an adversarial stylometry attack; granted, in the absence of the original message, this signal may go largely unnoticed, as it appears to depend on a pre- and post-transformation comparison. "In trying to erase a trace, you often imprint a larger one." Armed with this understanding, we framed $\textit{TraceTarnish}$'s operations and outputs around these five isolated features, using them to conceptualize and implement enhancements that further strengthen the attack.


Key Contributions

  • TraceTarnish attack script combining round-trip machine translation, paraphrasing, and Unicode zero-width character steganography to anonymize authorship of text messages
  • Identification via Information Gain analysis of five key stylometric features (function word frequency, content word distributions, Type-Token Ratio) as indicators of compromise (IoCs) revealing adversarial stylometric manipulation
  • Feature-guided enhancements to TraceTarnish that exploit identified IoCs to further strengthen authorship anonymization against stylometric defenses

🛡️ Threat Analysis

Input Manipulation Attack

TraceTarnish crafts adversarial text inputs — through round-trip machine translation, paraphrasing, and imperceptible Unicode steganographic noise — to cause stylometric ML classifiers to fail at correctly attributing authorship; this is a natural-language, inference-time evasion attack against an NLP classification system.


Details

Domains
nlp
Model Types
traditional_ml
Threat Tags
black_boxinference_timetargeteddigital
Datasets
Reddit comments
Applications
authorship attributionstylometric analysisonline anonymization