Latest papers

3 papers
defense arXiv Feb 16, 2026 · 7w ago

Weight space Detection of Backdoors in LoRA Adapters

David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit et al. · Algoverse AI Research · University of Aberdeen +1 more

Detects backdoored LoRA adapters via SVD spectral statistics on weight matrices, achieving 97% accuracy without model execution

Model Poisoning AI Supply Chain Attacks nlp
PDF
defense arXiv Jan 18, 2026 · 11w ago

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

Anirudh Sekar, Mrinal Agarwal, Rachel Sharma et al. · Algoverse AI Research · University of California

Defends LLM pipelines against prompt injection by detecting semantic embedding drift via cosine similarity, achieving 93%+ accuracy zero-shot

Prompt Injection nlp
PDF Code
defense arXiv Dec 12, 2025 · Dec 2025

Factor(U,T): Controlling Untrusted AI by Monitoring their Plans

Edward Lue Chee Lip, Anthony Channg, Diana Kim et al. · Algoverse AI Research · Colorado State University +1 more

Evaluates safety protocols for multi-agent LLM systems where an untrusted decomposer can inject malicious subtask instructions undetectable by monitors

Excessive Agency Prompt Injection nlp
PDF Code