Tuning for TraceTarnish: Techniques, Trends, and Testing Tangible Traits

In this study, we more rigorously evaluated our attack script $\textit{TraceTarnish}$, which leverages adversarial stylometry principles to anonymize the authorship of text-based messages. To ensure the efficacy and utility of our attack, we sourced, processed, and analyzed Reddit comments -- comments that were later alchemized into $\textit{TraceTarnish}$ data -- to gain valuable insights. The transformed $\textit{TraceTarnish}$ data was then further augmented by $\textit{StyloMetrix}$ to manufacture stylometric features -- features that were culled using the Information Gain criterion, leaving only the most informative, predictive, and discriminative ones. Our results found that function words and function word types ($L\_FUNC\_A$ $\&$ $L\_FUNC\_T$); content words and content word types ($L\_CONT\_A$ $\&$ $L\_CONT\_T$); and the Type-Token Ratio ($ST\_TYPE\_TOKEN\_RATIO\_LEMMAS$) yielded significant Information-Gain readings. The identified stylometric cues -- function-word frequencies, content-word distributions, and the Type-Token Ratio -- serve as reliable indicators of compromise (IoCs), revealing when a text has been deliberately altered to mask its true author. Similarly, these features could function as forensic beacons, alerting defenders to the presence of an adversarial stylometry attack; granted, in the absence of the original message, this signal may go largely unnoticed, as it appears to depend on a pre- and post-transformation comparison. "In trying to erase a trace, you often imprint a larger one." Armed with this understanding, we framed $\textit{TraceTarnish}$'s operations and outputs around these five isolated features, using them to conceptualize and implement enhancements that further strengthen the attack.

Key Contributions

TraceTarnish attack script combining round-trip machine translation, paraphrasing, and Unicode zero-width character steganography to anonymize authorship of text messages
Identification via Information Gain analysis of five key stylometric features (function word frequency, content word distributions, Type-Token Ratio) as indicators of compromise (IoCs) revealing adversarial stylometric manipulation
Feature-guided enhancements to TraceTarnish that exploit identified IoCs to further strengthen authorship anonymization against stylometric defenses

🛡️ Threat Analysis

Input Manipulation Attack

TraceTarnish crafts adversarial text inputs — through round-trip machine translation, paraphrasing, and imperceptible Unicode steganographic noise — to cause stylometric ML classifiers to fail at correctly attributing authorship; this is a natural-language, inference-time evasion attack against an NLP classification system.

Details

Domains

nlp

Model Types

traditional_ml

Threat Tags

black_boxinference_timetargeteddigital

Datasets

Reddit comments

Applications

2025 1 cit.

Input Manipulation Attack

77%