Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services
Fabrizio Dimino , Bhaskarjit Sarmah , Stefano Pasquali
Published on arXiv
2603.10807
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Sustained multi-round adaptive red-teaming with higher temperature decoding not only increases jailbreak success rates but drives systematic escalation toward more severe and operationally actionable financial disclosures, exposing critical limitations of single-turn domain-agnostic evaluation.
RAHS (Risk-Adjusted Harm Score)
Novel technique introduced
The rapid adoption of large language models (LLMs) in financial services introduces new operational, regulatory, and security risks. Yet most red-teaming benchmarks remain domain-agnostic and fail to capture failure modes specific to regulated BFSI settings, where harmful behavior can be elicited through legally or professionally plausible framing. We propose a risk-aware evaluation framework for LLM security failures in Banking, Financial Services, and Insurance (BFSI), combining a domain-specific taxonomy of financial harms, an automated multi-round red-teaming pipeline, and an ensemble-based judging protocol. We introduce the Risk-Adjusted Harm Score (RAHS), a risk-sensitive metric that goes beyond success rates by quantifying the operational severity of disclosures, accounting for mitigation signals, and leveraging inter-judge agreement. Across diverse models, we find that higher decoding stochasticity and sustained adaptive interaction not only increase jailbreak success, but also drive systematic escalation toward more severe and operationally actionable financial disclosures. These results expose limitations of single-turn, domain-agnostic security evaluation and motivate risk-sensitive assessment under prolonged adversarial pressure for real-world BFSI deployment.
Key Contributions
- Risk-Adjusted Harm Score (RAHS): a domain-sensitive metric that weights jailbreak severity by operational impact, accounts for mitigation signals, and incorporates inter-judge agreement rather than binary success rates
- Domain-specific BFSI harm taxonomy capturing legally/professionally plausible adversarial framings that bypass domain-agnostic safety evaluations
- Empirical finding that higher decoding stochasticity combined with sustained multi-round adversarial interaction systematically escalates jailbreak severity toward operationally actionable financial disclosures