Latest papers

4 papers
defense arXiv Feb 23, 2026 · 6w ago

BarrierSteer: LLM Safety via Learning Barrier Steering

Thanh Q. Tran, Arun Verma, Kiwan Wong et al. · National University of Singapore · Singapore-MIT Alliance for Research and Technology Centre +2 more

Defends LLMs against jailbreaks and adversarial attacks by enforcing CBF-based safety constraints in latent representation space at inference time

Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Dec 12, 2025 · Dec 2025

Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously

Andrew Adiletta, Kathryn Adiletta, Kemal Derya et al. · MITRE · Worcester Polytechnic Institute

Adversarial token suffixes that bypass LLM alignment and safety guard models simultaneously via joint gradient optimization

Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Nov 8, 2025 · Nov 2025

IndirectAD: Practical Data Poisoning Attacks against Recommender Systems for Item Promotion

Zihao Wang, Tianhao Mao, XiaoFeng Wang et al. · Nanyang Technological University · Indiana University Bloomington +2 more

Data poisoning attack uses trigger-item co-occurrence strategy to promote target items in recommender systems with only 0.05% fake users

Data Poisoning Attack tabular
PDF
benchmark arXiv Oct 14, 2025 · Oct 2025

An Investigation of Memorization Risk in Healthcare Foundation Models

Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour et al. · MIT · Broad Institute +6 more

Black-box evaluation framework measuring extractable patient data memorization in healthcare EHR foundation models at embedding and generative levels

Model Inversion Attack tabular
1 citations PDF Code