Locket: Robust Feature-Locking Technique for Language Models

Chatbot providers (e.g., OpenAI) rely on tiered subscription schemes to generate revenue, offering basic models for free users, and advanced models for paying subscribers. However, a finer-grained pay-to-unlock scheme for premium features (e.g., math, coding) is thought to be more economically viable for the providers. Such a scheme requires a feature-locking technique (FLoTE) which is (i) effective in refusing locked features, (ii) utility-preserving for unlocked features, (iii) robust against evasion or unauthorized credential sharing, and (iv) scalable to multiple features and users. However, existing FLoTEs (e.g., password-locked models) are not robust or scalable. We present Locket, the first robust and scalable FLoTE to enable pay-to-unlock schemes. Locket uses a novel merging approach to attach adapters to an LLM for refusing unauthorized features. Our comprehensive evaluation shows that Locket is effective ($100$% refusal on locked features), utility-preserving ($\leq 7$% utility degradation in unlocked features), robust ($\leq 5$% attack success rate), and scales to multiple features and clients.

Key Contributions

Locket: the first feature-locking technique (FLoTE) using adapter merging to enforce per-feature access control on LLMs
Robustness against evasion attacks (≤5% attack success rate) and unauthorized credential sharing while preserving utility on unlocked features (≤7% degradation)
Scalable design supporting multiple locked features and multiple client tiers simultaneously

🛡️ Threat Analysis

Transfer Learning Attack

Locket's core mechanism uses adapter merging to attach capability-restricting modules to an LLM, and robustness is evaluated against attacks that attempt to remove or circumvent these adapter-based locks through fine-tuning or similar techniques — squarely in the transfer learning / adapter security space.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Applications

2025 0 cit.

Transfer Learning Attack

86%