Mrinank Sharma

h-index: 10 1,407 citations 20 papers (total)

Papers in Database (3)

defense arXiv Jan 8, 2026 · 12w ago

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Hoagy Cunningham, Jerry Wei, Zihan Wang et al. · Anthropic

Defends LLMs against universal jailbreaks using cascaded exchange classifiers and linear probes, reducing costs 40x with near-zero refusal rate

Prompt Injection nlp
6 citations PDF
attack arXiv Jan 20, 2026 · 10w ago

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Jackson Kaunismaa, Avery Griffin, John Hughes et al. · MATS · Anthropic +1 more

Bypasses frontier LLM safeguards via adjacent-domain prompts, then fine-tunes open-source models to elicit hazardous chemical synthesis capabilities

Transfer Learning Attack Prompt Injection nlp
4 citations PDF
attack arXiv Oct 30, 2025 · Oct 2025

Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer et al. · Independent Researcher · Stanford University +3 more

Jailbreaks large reasoning models by prepending benign puzzle reasoning that dilutes safety refusal signals in LRMs

Prompt Injection nlp
3 citations PDF