defense 2025

Localizing Malicious Outputs from CodeLLM

Mayukh Borana , Junyi Liang , Sai Sathiesh Rajan , Sudipta Chattopadhyay

0 citations · 28 references · EMNLP

α

Published on arXiv

2509.17070

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

FreqRank localizes malicious Code LLM outputs within the top-5 ranked suggestions in 98% of cases and is 35-50% more effective than competing defense methods across nine backdoored models.

FreqRank

Novel technique introduced


We introduce FreqRank, a mutation-based defense to localize malicious components in LLM outputs and their corresponding backdoor triggers. FreqRank assumes that the malicious sub-string(s) consistently appear in outputs for triggered inputs and uses a frequency-based ranking system to identify them. Our ranking system then leverages this knowledge to localize the backdoor triggers present in the inputs. We create nine malicious models through fine-tuning or custom instructions for three downstream tasks, namely, code completion (CC), code generation (CG), and code summarization (CS), and show that they have an average attack success rate (ASR) of 86.6%. Furthermore, FreqRank's ranking system highlights the malicious outputs as one of the top five suggestions in 98% of cases. We also demonstrate that FreqRank's effectiveness scales as the number of mutants increases and show that FreqRank is capable of localizing the backdoor trigger effectively even with a limited number of triggered samples. Finally, we show that our approach is 35-50% more effective than other defense methods.


Key Contributions

  • FreqRank: a mutation-based defense that exploits the consistency of malicious sub-strings across triggered inputs to rank and localize backdoor outputs via frequency analysis
  • Nine backdoored Code LLMs created via fine-tuning or custom instructions across code completion, generation, and summarization tasks, achieving an average ASR of 86.6%
  • FreqRank surfaces the malicious output in top-5 suggestions in 98% of cases and outperforms existing defense methods by 35-50%, with effectiveness scaling with the number of mutants

🛡️ Threat Analysis

Model Poisoning

Paper creates 9 backdoored Code LLMs via fine-tuning or custom instructions (avg ASR 86.6%) and proposes FreqRank as a defense to localize malicious sub-strings in triggered outputs and reverse-engineer the corresponding backdoor triggers — a direct backdoor/trojan detection contribution.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timetargeted
Applications
code completioncode generationcode summarization