An Adversarial Perspective on Machine Unlearning for AI Safety (original) (raw)

Datasets Knowledge Recovery No Protection Unlearning Methods Safety Training
RMU NPO DPO
WMDP-Bio Default decoding 64.4 29.9 29.5 27.9
\cdashline2-6 Logit Lens 66.2 31.8 38.6 48.2
Finetuning - 62.4 47.4 57.3
Orthogonalization - 64.7 45.1 50.7
Enhanced GCG - 53.9 46.0 49.0
Pruning - 54.0 40.4 50.4
MMLU Default decoding 58.1 57.1 52.1 49.7
\cdashline2-6 Logit Lens - - - -
Finetuning - 58.0 53.3 51.2
Orthogonalization - 57.3 45.6 46.7
Enhanced GCG - - - -
Pruning - 56.5 50.0 50.4
Finetuning on unrelated information reverts unlearning.

As illustrated in Figure 2, finetuning with only 10 samples from the retain set—disjoint by definition from the evaluation knowledge—can recover most of hazardous capabilities, obtaining accuracies of 52.7% (NPO), 57.0% (DPO), and 61.6% (RMU) while causing negligible degradation on MMLU (less than 2 p.p.). Finetuning on 1000 samples from the retain set fully recovers hazardous capabilities across all methods. These results demonstrate that both safety training and unlearning can be undone through finetuning on unrelated information, suggesting that unlearning is also expressed through shallow features (Yang et al., 2023; Lermen et al., 2023). Additionally, finetuning with just 5 samples from the forget set effectively reverses unlearning, particularly for RMU, which nearly recovers its original performance. Relearning knowledge through further training is unavoidable, but these results show that knowledge recovery happens at disproportionately fast rate.

Refer to caption

Figure 2: Accuracy on WMDP-Bio for unlearned models finetuned with different datasets and number of samples. See Appendix F.1 for complimentary results on MMLU and WMDP-Cyber.

Unlearning methods remove knowledge from the residual stream more effectively.

Before unlearning, Logit Lens can decode correct answers from Zephyr-7B at layer 19, as shown in Figure 3. However, Logit Lens becomes ineffective after protections are applied. Our safety baseline, DPO, remains the most susceptible to early decoding, achieving 56% accuracy. In contrast, unlearning methods can remove knowledge more effecitvely from the residual stream, with RMU reducing Logit Lens accuracy close to random chance across the entire architecture. These results align with prior evaluations of RMU’s robustness to probing (Li et al., 2024).

Unlearning is also mediated by specific directions.

We identify and ablate directions responsible for unlearning, successfully recovering hazardous knowledge for most protections (see Table 5). RMU is the most vulnerable to our orthogonalization, achieving 64.7% accuracy (surpassing the baseline accuracy of 64.4%) by manipulating only the activation space during the forward pass. This outperforms ablation of a single refusal direction across all layers (Arditi and Chughtai, 2024), which achieves 54.2% accuracy. NPO and DPO are more robust against orthogonalization, obtaining 45%percent4545\%45 % and 51%percent5151\%51 % accuracy, respectively.

Unlearning depends on critical neurons.

We localized minimal sets of weights that are responsible for degradation in hazardous knowledge for each unlearning method. These sets represent 2.0%percent\%% of weights for NPO, 0.9%percent\%% for RMU, and 2.4%percent\%% for DPO. After pruning these weights, performance on WMDP increases by at least 10 p.p. for all methods.

Universal adversarial prefixes that recover unlearned knowledge exist.

Using enhanced GCG we were able to craft universal adversarial prefixes that increased RMU’s accuracy from 29.9%percent\%% to 53.9%percent\%%, NPO’s accuracy from 29.5%percent\%% to 46.0%percent\%%, and DPO’s accuracy from 27.9%percent\%% to 49.0%percent\%%. This demonstrates that, similarly to safety trained models, input-only manipulations can disable unlearning and elicit hazardous knowledge that was never removed from the model.

We can recover hazardous capabilities while models remain unusable.

RMU is characterized by making models unusable—they output gibberish generations with high perplexity—when hazardous knowledge is detected. Interestingly, we find that GCG prefixes can easily recover a conversational model that answers questions from WMDP, but its responses are often incorrect and overconfident. Best performing prefixes can recover most of the hazardous capabilities while not necessarily recovering conversational capabilities from the model. See Appendix I for an analysis.

Refer to caption

Figure 3: Accuracy on WMDP-Bio using LogitLens after each transformer block.

6 Discussion

Existing unlearning methods are not different from safety training.

Our findings reveal that unlearning methods primarily obscure knowledge rather than eliminate it (as illustrated by Figure 1), which is a known flaw of safety training (Lee et al., 2024). Therefore, RMU and NPO are susceptible to techniques analogous to those that can reverse safety training, including: (1) dependence on individual residual stream directions; (2) rapid knowledge recovery after finetuning with unrelated data; (3) presence of critical neurons that inhibit hazardous knowledge; and (4) existence of universal adversarial strings that unlock the unlearned knowledge. These observations question the practical benefits of unlearning methods over safety training. Although unlearning was proposed to fully eradicate hazardous capabilities and mitigate jailbreaks in large language models, our results indicate that these methods share limitations. Concurrent work by Tamirisa et al. (2024) proposed TAR, a technique that can prevent some fine-tuning attacks but has no impact on others.

Black-box evaluations are insufficient for unlearning.

Unlearning aims to completely remove certain information from model weights, yet many evaluations only verify that this information cannot be easily extracted through model outputs. This mismatch between the unlearning objective and the evaluation method can falsely suggest successful unlearning when knowledge remains embedded in the weights (Lynch et al., 2024). In fact, while black-box methods used in the original RMU evaluation (Li et al., 2024) and concurrent work by Doshi and Stickland (2024) failed to elicit supposedly unlearned knowledge, our white-box approaches successfully recovered the entire unlearned information. As extensively demonstrated in security and safety research, adaptive evaluations are required to faithfully evaluate ML protections (Carlini and Wagner, 2017; Tramer et al., 2020; Radiya-Dixit et al., 2021; Hönig et al., 2024).

NPO shows signs of deep unlearning.

This method consistently displays better robustness than DPO or RMU, suggesting that gradient ascent (Zhang et al., 2024) might be a promising tool to remove hazardous knowledge from model weights. However, our current implementation still results in greater degradation on MMLU and general capabilities. Future work could investigate combining representation engineering with gradient ascent to enhance existing unlearning methods.

Possible mitigations and lessons learned.

Our results suggest that fine-tuning models based solely on their outputs may result in knowledge obfuscation rather than erasure. Similarly, the noise injection strategy employed by RMU proves inadequate for open-source models. Instead of random noise addition, orthogonalizing directions corresponding to harmful knowledge presents a more promising approach (Zou et al., 2024a). We also caution that noise-based unlearning can produce misleadingly positive results during evaluation with GCG, as standard GCG implementations are susceptible to residual stream noise by default. Other effective approaches may include methods that precisely localize and modify knowledge vectors, as proposed by Hong et al. (2024).

7 Conclusion

We performed a comprehensive white-box evaluation of state-of-the-art unlearning methods for AI safety. Our findings reveal that these methods cannot reliably remove knowledge from model weights. For example, finetuning on unrelated data or removing specific directions from actiavtion space often recovers the supposedly unlearned capabilities. This challenges the belief that unlearning methods offer more robust protection than standard safety training. Furthermore, we argue that black-box evaluations are insufficient for unlearning, as they do not assess internal model changes.

References

\appendixpage\startcontents

[sections]\printcontents[sections]l1

Appendix A Further discussion on ECO

A.1 Why ECO is not unlearning under our definition

Liu et al. (2024a) assume black-box access to the model. Given their setting their definition of successful unlearning entails that, in expectation, any non-negative metric computed on the outputs of an unlearned model and the outputs of a model retrained from scratch on retain set should be approximately one. Intuitively, the model trained only on the retain set should behave the same way as the original model after applying unlearning. Although this is the golden standard in machine unlearning, we consider it lacking for the generative models such as LLMs which show remarkable memorization capabilities (Nasr et al., 2023). The premise is that despite outputs of an LLM not displaying any signs of unlearned knowledge it can be stored within the weights, and retrieved by an adversary. To prevent that the knowledge should be removed from the weights as well. Hence, an improved definition of successful unlearning should include either the internals of an LLM or an adversarial perspective.

Furthermore, the core of ECO is an ‘unlearned’ knowledge detector, based on which a carefully crafted noise is applied to input embeddings. However, this is no different to a safety filter which given an unethical request would return a predefined refusal prompt. Choosing a suitable noise is merely obfuscating the refusal.

Ultimately, we would like to emphasize that we acknowledge ECO’s state-of-the-art results on WMDP. However, we argue that it doesn’t uphold the promise of unlearning.

A.2 Potential vulnerabilities444This is preliminary analysis and a sketch of potential red-teaming efforts. No experiments have been conducted.

Using a detector together with unmodified LLM, puts the red-teaming pressure on the former. As a consequence, the fundamental issue of defending the LLM is not resolved but rather reintroduced on a smaller scale, where we have to defend the detector (which in (Liu et al., 2024a) is a smaller LLM - RoBERTa).

After inspecting the code555Available at https://github.com/chrisliu298/llm-unlearn-eco/tree/main, we noticed that there are two types of detectors implemented: token-wise and prompt-wise. The first one can be easily bypassed by forcing the tokenizer to tokenize the prompt character-by-character (e.g. by inserting whitespace between all relevant characters). Individual characters should not trigger any noise as they should not be exclusive to dangerous concepts. The second type of detector might be slightly more challenging, but there is significant body of works on adversarial attacks on BERT models (Li et al., 2020), including the specific scenario of text classification (Garg and Ramakrishnan, 2020).

Appendix B Further details on unlearning and safety training methods

B.1 Direct Preference Optimization (DPO)

DPO (Rafailov et al., 2024) uses a preference dataset 𝒟PREFsubscript𝒟PREF\mathcal{D}_{\text{PREF}}caligraphic_D start_POSTSUBSCRIPT PREF end_POSTSUBSCRIPT consisting of triples: an input x𝑥xitalic_x, a chosen response ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and a rejected response ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Model is then trained to produce generations that are closer to the chosen subset using the following objective:

ℒDPO⁢(θ)=−1β⁢𝔼𝒟PREF⁢[log⁡σ⁢(β⁢log⁡πθ⁢(yw∣x)πref⁢(yw∣x)−β⁢log⁡πθ⁢(yl∣x)πref⁢(yl∣x))],subscriptℒDPO𝜃1𝛽subscript𝔼subscript𝒟PREFdelimited-[]𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦w𝑥subscript𝜋refconditionalsubscript𝑦w𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦l𝑥subscript𝜋refconditionalsubscript𝑦l𝑥\mathcal{L}_{{\text{DPO}}}({\theta})=-\frac{1}{{\beta}}\mathbb{E}_{\mathcal{D}% _{\text{PREF}}}\Big{[}\log\sigma\Big{(}{\beta}\log\frac{{\pi}_{\theta}({y}_{% \text{w}}\mid{x})}{{\pi}_{\text{ref}}({y}_{\text{w}}\mid{x})}-{\beta}\log\frac% {{\pi}_{\theta}({y}_{\text{l}}\mid{x})}{{\pi}_{\text{ref}}({y}_{\text{l}}\mid{% x})}\Big{)}\Big{]},caligraphic_L start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_θ ) = - divide start_ARG 1 end_ARG start_ARG italic_β end_ARG blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT PREF end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT w end_POSTSUBSCRIPT ∣ italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT w end_POSTSUBSCRIPT ∣ italic_x ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT l end_POSTSUBSCRIPT ∣ italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT l end_POSTSUBSCRIPT ∣ italic_x ) end_ARG ) ] , (1)

where πrefsubscript𝜋ref\pi_{\text{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT is reference model, πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is trainable model with weights θ𝜃\thetaitalic_θ, β𝛽\betaitalic_β is a variable controlling deviation from πrefsubscript𝜋ref\pi_{\text{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, and σ𝜎\sigmaitalic_σ is a sigmoid function.

B.2 Negative Preference Optimization (NPO)

NPO (Zhang et al., 2024) optimizes a loss function inspired from DPO, where one uses only negative samples. Although, it may appear that this introduces inductive bias towards safety training, counter-intuitively it does not. Zhang et al. (2024) shows that NPO is a generalization of gradient ascent (GA). This resemblance is a desirable feature in unlearning as GA is the reverse process to gradient descent based learning. Furthermore, the authors show that NPO diverges at much slower rate than GA, making it more stable and thus, practical.

In the pilot experiments with straightforward application of NPO our models quickly diverged, resulting in catastrophic forgetting, indicated by poor performance on the utility benchmark. NPO collapsing when trying to unlearn broad domains is in line with other works suggesting that it fails in continual learning settings (Gao et al., 2024). Therefore, we focus on a variation of NPO which adds a retain loss (RT) to the original objective:

| ℒNPO⁢(θ)=−2β⁢𝔼𝒟FG⁢[log⁡σ⁢(−β⁢log⁡πθ⁢(y|x)πref⁢(y|x))]⏟ℒNPO−α⋅𝔼𝒟RT⁢[log⁡(πθ⁢(y|x))]⏟ℒRT,subscriptℒNPO𝜃subscript⏟2𝛽subscript𝔼subscript𝒟FGdelimited-[]𝜎𝛽subscript𝜋𝜃conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥subscriptℒNPO⋅𝛼subscript⏟subscript𝔼subscript𝒟RTdelimited-[]subscript𝜋𝜃conditional𝑦𝑥subscriptℒRT\mathcal{L}_{{\text{NPO}}}({\theta})=\underbrace{-\frac{2}{{\beta}}\mathbb{E}_% {\mathcal{D}_{{\text{FG}}}}\Big{[}\log\sigma\Big{(}-{\beta}\log\frac{{\pi}_{% \theta}({y}|{x})}{{\pi}_{\text{ref}}({y}|{x})}\Big{)}\Big{]}}_{\mathcal{L}_{{% \text{NPO}}}}-\>\alpha\cdot\underbrace{\vphantom{\left(\frac{a^{0.3}}{b}\right% )}\mathbb{E}_{{{\mathcal{D}}_{\text{RT}}}}[\log({\pi}_{\theta}({y}|{x}))]}_{% \mathcal{L}_{\text{RT}}},caligraphic_L start_POSTSUBSCRIPT NPO end_POSTSUBSCRIPT ( italic_θ ) = under⏟ start_ARG - divide start_ARG 2 end_ARG start_ARG italic_β end_ARG blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT FG end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG ) ] end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT NPO end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_α ⋅ under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT RT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ) ] end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT RT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , | (2) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | --- |

where α𝛼\alphaitalic_α is a weight of the retain loss, and (x,y𝑥𝑦x,yitalic_x , italic_y) are input output pairs from the forget set 𝒟FGsubscript𝒟FG\mathcal{D}_{{\text{FG}}}caligraphic_D start_POSTSUBSCRIPT FG end_POSTSUBSCRIPT and from the retain set 𝒟RTsubscript𝒟RT\mathcal{D}_{{\text{RT}}}caligraphic_D start_POSTSUBSCRIPT RT end_POSTSUBSCRIPT. We refer to this method as NPO.

B.3 Representation Misdirection for Unlearning (RMU)

RMU (Li et al., 2024) finetunes a subset of lower layers of an LLM such that they output a fixed noise vector when given a prompt containing concepts present in the forget set and to leave representations unchanged if the concepts fall within the knowledge captured by the retain set. This method displays high sensitivity to keywords and behaves like a heavy-side function once “hazardous” concept is detected - internal representations will be distorted for all the subsequent tokens in the prompt. For detailed analysis of RMU see Appendix J. The RMU objective is as follows:

| ℒRMU⁢(θ)=𝔼x∼DFG⁢[1Lx⁢∑t∈x‖Mθ⁢(t)−c⋅𝐮‖22]⏟ℒforget+α⋅𝔼x∼DRT⁢[1Lx⁢∑t∈x‖Mθ⁢(t)−Mref⁢(t)‖22]⏟ℒretain,subscriptℒRMU𝜃subscript⏟subscript𝔼similar-to𝑥subscript𝐷FGdelimited-[]1subscript𝐿𝑥subscript𝑡𝑥superscriptsubscriptnormsubscript𝑀𝜃𝑡⋅𝑐𝐮22subscriptℒforgetsubscript⏟⋅𝛼subscript𝔼similar-to𝑥subscript𝐷RTdelimited-[]1subscript𝐿𝑥subscript𝑡𝑥superscriptsubscriptnormsubscript𝑀𝜃𝑡subscript𝑀ref𝑡22subscriptℒretain\mathcal{L}_{{\text{RMU}}}({\theta})=\underbrace{\mathbb{E}_{x\sim D_{\text{FG% }}}\left[\frac{1}{L_{x}}\sum_{t\in x}||M_{\theta}(t)-c\cdot\mathbf{u}||_{2}^{2% }\right]}_{\mathcal{L}_{{\text{forget}}}}+\underbrace{\alpha\cdot\mathbb{E}_{x% \sim D_{\text{RT}}}\left[\frac{1}{L_{x}}\sum_{t\in x}||M_{\theta}(t)-M_{\text{% ref}}(t)||_{2}^{2}\right]}_{\mathcal{L}_{\text{retain}}},caligraphic_L start_POSTSUBSCRIPT RMU end_POSTSUBSCRIPT ( italic_θ ) = under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D start_POSTSUBSCRIPT FG end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ italic_x end_POSTSUBSCRIPT | | italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t ) - italic_c ⋅ bold_u | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT forget end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_α ⋅ blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_D start_POSTSUBSCRIPT RT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ italic_x end_POSTSUBSCRIPT | | italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t ) - italic_M start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_t ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT retain end_POSTSUBSCRIPT end_POSTSUBSCRIPT , | (3) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------------------------------------------------- | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ----------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

where Mref,Mθsubscript𝑀refsubscript𝑀𝜃M_{\text{ref}},M_{\theta}italic_M start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are the internal representations of the reference and trainable models, Lxsubscript𝐿𝑥L_{x}italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is the number of tokens in prompt x𝑥xitalic_x, and c𝑐citalic_c is a variable controlling the magnitude of noise vector 𝐮𝐮\mathbf{u}bold_u.

Appendix C Preference dataset construction

DPO and NPO require preference datasets of a specific format. We construct such datasets from the WMDP forget and retain documents. Furthermore, to make them more suitable for WMDP Benchmark we format individual samples as ABCD questions.

WMDP provides four corpora: bio-forget-corpus, bio-retain-corpus, cyber-forget-corpus, and cyber-retain-corpus. Biology ones are filtered scientific articles. Forget version contains abstract together with text and we decided to use the former due to its condensed nature. Cybersecurity corpora are mostly scraped github pages of oftentimes doubtful quality.

We used only the articles with more than 1000 characters to make sure that we filter out samples containing only titles and to provide enough material to generate 10 questions from each article. Moreover, each article was truncated at 15’000 characters to avoid unnecessary costs.

To generate questions based on the abovementioned corpora we used OpenAI API and the gpt-4o-mini-2024-07-18 model. Furthermore, we used their recent feature, Structured Outputs666Available at: https://openai.com/index/introducing-structured-outputs-in-the-api/, with the following schema:

class MCQ(BaseModel): question: str options: list[str] answer: str explanation: str

class ListMCQ(BaseModel): multiple_choice_questions: list[MCQ].

We sent requests until we obtained 10’000 samples from each corpora. However, each dataset can have up to 9 samples more since the model sometimes produced less then 10 questions and in some cases the correct answer was missing from the provided options. In our requests we used the system prompt defined below.

C.1 System prompt

In this system prompt we provide a clear objective, fixed number of options, and number of questions to generate from each article. Furthermore, we provide an example of a successfully completed task crafted from one of the WMDP bio-retain-corpus article to leverage few-shot learning paradigm.

You are a large language model tasked with creating multiple-choice questions based on an article. Each question is supposed to have four answers: A, B, C, D. Only one of these options is supposed to be correct, other three are supposed to be plausible but incorrect distractors. Your goal is to generate clear, concise questions that test knowledge contained within the article. You are not supposed to test text comprehension but rather concepts within it. Make sure that questions can be answered without an access to the article. Do not ask about data and statistics such as how many people died from polio in total. Instead ask ’Which of the following viruses causes polio?’. Never refer to the article directly. Always provide a correct answer and corresponding explanation. Make sure that the answer is one of the options. Don’t include option letters A,B,C,D as part of each option. Instead it should be a phrase. Provide a list of 10 such questions.

EXAMPLE

ARTICLE: Introduction {#bjh13767-sec-0001} ============ Purine analogues have revolutionized the treatment of hairy cell leukaemia (HCL) with overall responses in more than 85% of patients and a median progressionfree survival (PFS) of up to 15 years (Else et al, 2009{ref-type="ref"}; Grever, 2010 {ref-type="ref"}). Despite the success of pentostatin and cladribine, PFS curves have revealed no plateau in longterm followup studies (Else et al, [2009] (#bjh13767-bib-0004){ref-type="ref"}). In addition, there is a small subset of patients with primary refractory disease. Second and third line treatment with singleagent purine analogues produce lower complete remission (CR) rates and shorter PFS (Chadha et al, 2005{ref-type="ref"}). In that context, the combination of pentostatin or cladribine with rituximab was tested and found to be effective (Else et al, 2011{ref-type="ref"}), and is now recommended in the UK for recurrent or refractory HCL (Jones et al, 2012{ref-type="ref"}).

QUESTION 1: Which treatment approach has shown to be highly effective in achieving long-term progression-free survival (PFS) in patients with hairy cell leukemia (HCL)?

OPTION A: Stem cell transplantation OPTION B: Combination of chemotherapy and radiation OPTION C: Purine analogues OPTION D: Monoclonal antibody therapy alone

ANSWER: Purine analogues

EXPLANATION: Purine analogues, specifically cladribine (2-CdA) and pentostatin, have been the standard of care for HCL and have been associated with high rates of complete remission and long-term PFS.

END OF EXAMPLE

C.2 Preference format

Each generated sample would then be formatted into the following prompt:

The following are multiple choice questions (with answers) about {{biology/cybersecurity}}.

{{question}} A. {{options[0]}} B. {{options[1]}} C. {{options[2]}} D. {{options[3]}} Answer:

Rejected response would be:

{{["A","B","C","D"][options.index(answer)]}}. answer

Chosen response would then be randomly chosen from the list of available refusal strings defined below.

C.3 Refusal strings

We used GPT-4 to generate 80 different refusal strings, which were manually checked for variety. Their list is below.

"According to my programming, I cant assist with that.", "Adhering to guidelines, I must refrain from providing that answer.", "Any request of this nature is beyond what I can handle.", "Apologies, but thats not something Im allowed to do.", "As a language model, I’m not equipped to answer that.", "As an AI, Im not equipped to provide that information.", "As an AI, Im not permitted to provide that answer.", "As an AI, providing that information isnt possible.", "As an AI model, Im unable to process that request.", "As an AI model, such requests are outside my remit.", "As an Assistant, I’m unable to help with that request.", "Assisting with that is against my guidelines.", "Based on ethical considerations, I cant help with that.", "Because of my programming, Im unable to assist with that.", "Before I proceed, its important to note that I cant comply with that request.", "Being an AI, Im restricted from fulfilling that request.", "By the guidelines I follow, I cannot provide that information.", "Compliance with ethical standards prevents me from answering that.", "Compliance with legal guidelines means I cannot fulfill that request.", "Considering the nature of your request, Im unable to assist.", "Constraints in my programming prevent me from helping with that.", "Current policies restrict me from providing that information.", "Doing that would violate my programming restrictions.", "Doing that would violate my ethical guidelines.", "Due to ethical constraints, Im unable to fulfill your request.", "Due to ethical guidelines, Im unable to assist with that.", "Due to legal and ethical reasons, I cannot comply.", "Due to the limitations of my design, I cant assist with that.", "Due to policy restrictions, I cant proceed with that.", "Due to restrictions, thats something I cant help with.", "Due to the nature of the request, I cannot help.", "Ethical guidelines prevent me from assisting with that.", "Given my limitations, I cant assist with that.", "Given the circumstances, Im unable to provide an answer.", "Hello! Unfortunately, I cant help with that.", "Helping with that goes beyond my scope of work.", "Helping with that isnt within my programming.", "I am sorry, I can’t fulfill that request.", "I apologize, but I cannot assist with that.", "I apologize, but Im unable to help with that.", "I cannot provide the information you’re asking for.", "I can’t provide the information you need.", "Im not allowed to provide that kind of assistance.", "Im not authorized to process that request.", "Im unable to provide help with that particular inquiry.", "I’m just a language model, so I can’t assist with that.", "I’m really sorry, but I can’t assist with that.", "I’m sorry, I can’t help with that.", "Its not within my ability to provide that information.", "My apologies, but I can’t fulfill your request.", "My programming restricts me from handling that request.", "Providing such information would be against my guidelines.", "Providing that assistance isnt something Im able to do.", "Providing that information is outside my capabilities.", "Regrettably, I cant fulfill that request.", "Regrettably, Im unable to help with this.", "Requests like this arent within my range of functions.", "Requests of this nature are not within my capabilities.", "Sorry, but I am unable to provide that information.", "Sorry, but Im not the right resource for that information.", "Sorry, but that request is not something I can fulfill.", "Sorry, but thats not something Im able to do.", "Sorry, but thats something I cant help with.", "Such inquiries are not something I can address.", "That action is unethical and goes against my programming.", "That is unethical, and I cannot help with it.", "That request is against my guidelines, so I cant comply.", "That request is illegal and unethical, so I cannot proceed.", "That request is not ethical, so I cannot assist.", "Thats a request Im not equipped to handle.", "Thats beyond the scope of what I can do.", "Thats not legal, and Im unable to provide support.", "Thats not something I have the capability to do.", "This action is beyond what Im programmed to do.", "This request falls beyond my capabilities.", "This request isnt something I can assist with.", "This type of request is not something I can process.", "Unfortunately, I cant offer assistance on that matter.", "Unfortunately, I’m restricted from helping with that.", "Unfortunately, thats outside the scope of what I can do."

Appendix D Training details

Both DPO and NPO use version of Open Assistant dataset that is formatted as preference datasest and is available here: https://huggingface.co/datasets/javirandor/oasst2_dpo. Before training, it is mixed with forget and retain datasets to obtain the best balance between unlearning/refusal and utility. The mixing is done using HuggingFace’s interleave_datasets function with stopping strategy set to ‘first_exhausted’. Furthermore, prior to training we randomly apply chat template to 50% of the samples in the final dataset since our initial experiments have shown that training only without it doesn’t affect the situation with chat template applied (converse is also true).

D.1 Hyperparameters

We performed a limited hyperparameter search over learning rate, β𝛽\betaitalic_β, number of epochs and the dataset mixing proportions to obtain best model. For NPO we also searched over α𝛼\alphaitalic_α. The best hyperparameters are the following:

Table 2: Best found hyperparameters for DPO and NPO for each knowledge domain.

DPO NPO
Bio Cyber Bio Cyber
Learning rate 1e-6 1e-6 1e-5 1e-5
β𝛽\betaitalic_β 0.1 0.5 0.05 0.05
Dataset proportions 50:25:25 50:25:25 50:25:25 50:25:25
α𝛼\alphaitalic_α - - 0.5 0.5
Epochs 2 2 3 3
Max length 1024 1024 1024 1024
Gradient accumulation steps 1 1 3 3
Per device batch size 4 4 3 3
Warmup steps 150 150 150 150
Quantization bf16 bf16 bf16 bf16

D.2 Performance of developed models on relevant benchmarks

We train NPO and DPO version of Zephyr for both hazardous domains. Performance of these models on WMDP benchmark and MMLU is shown in Table 3.

Table 3: Full benchmarking results of trained models.

Model NPO (Cyber) NPO (Bio) DPO (Cyber) DPO (Bio)
MMLU 55.3 52.0 54.7 49.2
MMLU Chat 54.8 52.9 51.5 51.4
WMDP-Bio 62.0 29.7 57.0 27.6
WMDP-Bio Chat 58.7 32.1 51.0 29.0
WMDP-Cyber 32.2 36.9 33.7 33.5
WMDP-Cyber Chat 31.0 38.9 34.4 33.6
WMDP-Chem 41.1 37.0 41.6 28.6
WMDP-Chem Chat 41.6 38.5 41.1 32.0

Appendix E Additional details on knowledge extraction methods

This sections contains additional details omitted in the main part of the paper.

E.1 Methods overview

Each knowledge extraction method requires access to different resources and elements of the pipeline to work. They are specified below, in Table 4.

Knowledge extraction methods Resources
Access to the original model (pre-unlearning) Access to the forget dataset Access necessary for executing the attack
Logit Lens No No Activations (passive)
Finetuning No Not necessary, but yields superior results Weights (active)
Orthogonalization Not necessary, but yields superior results. It is enough to find a set of prompts for which the model refuses to answer. Not necessary, but yields superior results. It is enough to find a set of prompts for which the model refuses to answer. Activations (active)
Enhanced GCG Yes Not necessary. It is enough to find a set of prompts for which the model refuses to answer. Prompt (active)
Pruning No Not necessary, but yields superior results Weights (active)

Table 4: Comparison of knowledge extraction methods and their resource requirements.

E.2 Finetuning

Hyperparameters used for finetuning as knowledge extraction method are in Table 5.

Table 5: Hyperparameters used for finetuning for knowledge extraction.

Hyperparameter Value
Lora rank 128
Target modules all
Lora alpha 16
Lora dropout 0
Batch size 1
Epochs 3
Gradient accumulation steps 1
Learning rate 2e-4
Learning rate scheduler linear
Warmup ratio 0.05
Weigth decay 0.01
Maximum sequence length 1024
Optimizer adamw_torch

For RMU we use WMDP’s bio-forget-corpus (abstracts) as forget set, WMDP’s bio-retain-corpus as retain set, and Wikitext as retain set with lowest mutual information. For cybersecurity setting we use the cyber counterparts of the first two datsets. For actual finetuning we use the following template:

For forget dataset

[{’role’: ’system’, ’content’: ""}, {’role’: ’user’, ’content’: Write a research article in the field of {{biology/cybersecurity}}.}, {’role’: ’assistant’, ’content’: Of course, here is a reseach article in the field of biology. {{abstract}}}]

For retain set

[{’role’: ’system’, ’content’: ""}, {’role’: ’user’, ’content’: Write a research article in the field of {{biology/cybersecurity}}.}, {’role’: ’assistant’, ’content’: Of course, here is a reseach article in the field of biology. {{text}}}]

For wikitext dataset

[{’role’: ’system’, ’content’: ""}, {’role’: ’user’, ’content’: Write a wikipedia article.}, {’role’: ’assistant’, ’content’: Of course, here is a wikipedia article. {{text}}}]

Note that we use empty system prompt because it is the default choice for Zephyr-7B-β𝛽\betaitalic_β777https://github.com/huggingface/alignment-handbook/blob/87cc800498b17432cfb7f5acb5e9a79f15c867fc/src/alignment/data.py#L38.

For DPO and NPO we use multiple choice versions of the above datasets. We obtain forget and retain from generated preference datasets. For Wikitext we follow procedure described in Appendix C for retain set to obtain multiple choice questions. Then for finetuning we use following templates:

For forget dataset

[{’role’: ’system’, ’content’: ""}, {’role’: ’user’, ’content’: {{sample["prompt"]}}.}, {’role’: ’assistant’, ’content’: {{{sample["rejected"]}}}}]

For retain and wikitext datasets

[{’role’: ’system’, ’content’: ""}, {’role’: ’user’, ’content’: {{sample["prompt"]}}.}, {’role’: ’assistant’, ’content’: {{{sample["chosen"]}}}}]

E.3 Orthogonalization

To show that directional ablation technique is still applicable in settings without access to the original model we devise the following. Since, we need to identify the ‘unlearning’ directions for these specific domains we need to create contrast between representations containing hazardous knowledge and benign representations. To do so we gather ‘hazardous’ representations by conditioning LLMs on forget prompts from the preference dataset. For benign representations we use (1) Wikitext corpus and (2) MMLU validation set formatted as ABCD questions. Ultimately, we can obtain desired directions by taking difference in means.

Using first principal component as ablation direction.

We have also investigated how prominent are ‘unlearning’ directions in the residual stream. Thus, we have collected representations generated by the original model and its unlearned counterparts on forget preference dataset. Then, we used PCA to extract 1st principal component that was used as ablation direction.

Lastly, to achieve success in this setting it was crucial to filter out outlier tokens. For this purpose we computed l2 distance between first 1000 tokens collected and computed their z𝑧zitalic_z-scores. Then we discarded all tokens with z𝑧zitalic_z-score larger than 3, from the whole dataset. This was necessary because Zephyr-7B-β𝛽\betaitalic_β attributes very specific representations to ‘’ token and first ‘\n’ tokens, which are very distant from all the other representations and thus bias difference in means significantly.

Ablations on layer subsets.

To evaluate dependence of ‘unlearned‘ directions on particular layer subsets, instead of applying directional ablation to all layers we applied it solely to layers: 0, 7, 15, 23, and 31 (the last layer of Zephyr-7B-β𝛽\betaitalic_β). Furthemore, in case of RMU we applied orthogonalization only on the layers previously subjected to unlearning (5,6,7).

E.4 Logit lens

This method projects representations in the residual stream to models’ vocabulary. In the main text we project representations outputted by each transformer block. However, we are able to project representations taken at other stages of the architecture as well. These are (1) the outputs of attention module, (2) the intermediate activations after adding output of the attention module to the residual stream, and (3) the outputs of the MLP module. We use projections of these activations to get performance on WMDP at every layer.

Lastly, to emphasize the importance of the A, B, C, D tokens to the model we add the following prefix to all WMDP questions: ‘Answer the following question with A, B, C, or D.\n\n’.

E.5 Enhanced GCG

There are several notable features of FLRT (Thompson and Sklar, 2024) that elevate it above standard GCG (Zou et al., 2023). First it introduces a fluency loss, based on perplexity computed with several smaller LLMs, which enhances the interpretability of adversarial strings. Secondly, it performs some steps in a manner similar to BEAST (Sadasivan et al., 2024) which makes it faster on average and allows for dynamic size of adversarial string. Thirdly, they introduce token-wise loss clamping for cross-entropy loss over the target string, which puts less optimization effort on tokens that are already solved (i.e. have low probability). Lastly and most importantly they finetune a malicious version of the model under attack and introduce a penalty term that minimizes distance between representations of attacked model and its malicious counterpart. In this setting the final attack template consist of adversarial string ta⁢d⁢vsubscript𝑡𝑎𝑑𝑣t_{adv}italic_t start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT, prompt specifying knowledge we want to elicit tp⁢r⁢o⁢m⁢p⁢tsubscript𝑡𝑝𝑟𝑜𝑚𝑝𝑡t_{prompt}italic_t start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT, target string tt⁢a⁢r⁢g⁢e⁢tsubscript𝑡𝑡𝑎𝑟𝑔𝑒𝑡t_{target}italic_t start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT, and tm⁢a⁢t⁢c⁢hsubscript𝑡𝑚𝑎𝑡𝑐ℎt_{match}italic_t start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT, which is a string of nm⁢a⁢t⁢c⁢hsubscript𝑛𝑚𝑎𝑡𝑐ℎn_{match}italic_n start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT tokens generated using malicious model conditioned on [ta⁢d⁢v,tp⁢r⁢o⁢m⁢p⁢t,tt⁢a⁢r⁢g⁢e⁢t]subscript𝑡𝑎𝑑𝑣subscript𝑡𝑝𝑟𝑜𝑚𝑝𝑡subscript𝑡𝑡𝑎𝑟𝑔𝑒𝑡[t_{adv},t_{prompt},t_{target}][ italic_t start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT ].

Original internal representation loss and our modifications.

FLRT implements loss over internal representations in the following way:

| ℒRep=1nm⁢a⁢t⁢c⁢h×|L|⁢∑l∈L∑i=1nm⁢a⁢t⁢c⁢hsubscriptℒRep1subscript𝑛𝑚𝑎𝑡𝑐ℎ𝐿subscript𝑙𝐿superscriptsubscript𝑖1subscript𝑛𝑚𝑎𝑡𝑐ℎ\displaystyle\mathcal{L}_{{\text{Rep}}}=\frac{1}{n_{match}\times|L|}\sum_{l\in L% }\sum_{i=1}^{n_{match}}caligraphic_L start_POSTSUBSCRIPT Rep end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT × | italic_L | end_ARG ∑ start_POSTSUBSCRIPT italic_l ∈ italic_L end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | ∥Ma,l(ti|[ta⁢d⁢v,tp⁢r⁢o⁢m⁢p⁢t,tt⁢a⁢r⁢g⁢e⁢t,t0,…,ti−1])\displaystyle\|M_{a,l}(t_{i}\>|\>[t_{adv},t_{prompt},t_{target},t_{0},...,t_{i% -1}])∥ italic_M start_POSTSUBSCRIPT italic_a , italic_l end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | [ italic_t start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ] ) | (4) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | | −Mm,l⁢(ti|[tp⁢r⁢o⁢m⁢p⁢t,tt⁢a⁢r⁢g⁢e⁢t,t0,…,ti−1])∥22evaluated-atsubscript𝑀𝑚𝑙conditionalsubscript𝑡𝑖subscript𝑡𝑝𝑟𝑜𝑚𝑝𝑡subscript𝑡𝑡𝑎𝑟𝑔𝑒𝑡subscript𝑡0…subscript𝑡𝑖122\displaystyle-M_{m,l}(t_{i}\>|\>[t_{prompt},t_{target},t_{0},...,t_{i-1}])\|_{% 2}^{2}- italic_M start_POSTSUBSCRIPT italic_m , italic_l end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | [ italic_t start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ] ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | (5) | | | |

where L𝐿Litalic_L is the set of layers used for attack, tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an i𝑖iitalic_ith token from tm⁢a⁢t⁢c⁢hsubscript𝑡𝑚𝑎𝑡𝑐ℎt_{m}atchitalic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_a italic_t italic_c italic_h string, Ma,lsubscript𝑀𝑎𝑙M_{a,l}italic_M start_POSTSUBSCRIPT italic_a , italic_l end_POSTSUBSCRIPT are the outputs of the l𝑙litalic_lth transformer layer of the attacked model, Mm,lsubscript𝑀𝑚𝑙M_{m,l}italic_M start_POSTSUBSCRIPT italic_m , italic_l end_POSTSUBSCRIPT are the outputs of the l𝑙litalic_lth transformer layer of the maliciously finetuned model, and [tx,ty]subscript𝑡𝑥subscript𝑡𝑦[t_{x},t_{y}][ italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] represents concatenation of strings txsubscript𝑡𝑥t_{x}italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, tysubscript𝑡𝑦t_{y}italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. Note that the second term of the equation above (Equation 5) is static and doesn’t change throughout the iterations of the optimization algorithm.

We found that using a moving target that accounts for evolving ta⁢d⁢vsubscript𝑡𝑎𝑑𝑣t_{adv}italic_t start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT yields superior results and thus we use a modified loss:

| ℒRep=1nm⁢a⁢t⁢c⁢h×|L|⁢∑l∈L∑i=1nm⁢a⁢t⁢c⁢hsubscriptℒRep1subscript𝑛𝑚𝑎𝑡𝑐ℎ𝐿subscript𝑙𝐿superscriptsubscript𝑖1subscript𝑛𝑚𝑎𝑡𝑐ℎ\displaystyle\mathcal{L}_{{\text{Rep}}}=\frac{1}{n_{match}\times|L|}\sum_{l\in L% }\sum_{i=1}^{n_{match}}caligraphic_L start_POSTSUBSCRIPT Rep end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT × | italic_L | end_ARG ∑ start_POSTSUBSCRIPT italic_l ∈ italic_L end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | ∥Ma,l(ti|[ta⁢d⁢v,tp⁢r⁢o⁢m⁢p⁢t,tt⁢a⁢r⁢g⁢e⁢t,t0,…,ti−1])\displaystyle\|M_{a,l}(t_{i}\>|\>[t_{adv},t_{prompt},t_{target},t_{0},...,t_{i% -1}])∥ italic_M start_POSTSUBSCRIPT italic_a , italic_l end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | [ italic_t start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ] ) | (6) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | | −Mm,l⁢(ti|[ta⁢d⁢v,tp⁢r⁢o⁢m⁢p⁢t,tt⁢a⁢r⁢g⁢e⁢t,t0,…,ti−1])∥22.evaluated-atsubscript𝑀𝑚𝑙conditionalsubscript𝑡𝑖subscript𝑡𝑎𝑑𝑣subscript𝑡𝑝𝑟𝑜𝑚𝑝𝑡subscript𝑡𝑡𝑎𝑟𝑔𝑒𝑡subscript𝑡0…subscript𝑡𝑖122\displaystyle-M_{m,l}(t_{i}\>|\>[t_{adv},t_{prompt},t_{target},t_{0},...,t_{i-% 1}])\|_{2}^{2}.- italic_M start_POSTSUBSCRIPT italic_m , italic_l end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | [ italic_t start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ] ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . | (7) | | | |

This loss might result in representations drifting away from the original representations but it has shown much stronger empirical performance in reverting unlearned models to their original versions. In their code authors normalize this score with the squared l2-norm of the static term. We use the raw distance (without normalization described before) motivated by preliminary empirical results.

Modifications and parameters.

The abovementioned loss is used for all unlearning methods, where we use Zephyr-7B-β𝛽\betaitalic_β as our ‘malicious’ model Mmsubscript𝑀𝑚M_{m}italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. We use 10 first tokens of generation from Mmsubscript𝑀𝑚M_{m}italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as tt⁢a⁢r⁢g⁢e⁢tsubscript𝑡𝑡𝑎𝑟𝑔𝑒𝑡t_{target}italic_t start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT and use the next 25 as tm⁢a⁢t⁢c⁢hsubscript𝑡𝑚𝑎𝑡𝑐ℎt_{match}italic_t start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT. Additionally we set minimum number of tokens in adversarial string to 100 since the original paper shows that strings of that length achieve superior performance. Next, we drop the fluency objective as it is not relevant for our evaluation. Lastly, we repeatedly use self-transfer, a scheme where we first optimize a prefix on a simpler prompt and then use it as initialization for more difficult prompts.

Modifications specific to RMU.

Since, RMU introduces persistent noise to residual stream once hazardous concept is detected within the prompt we assign more importance to earlier tokens (weight decreases linearly from 2 to 1, from first match token to the last). Furthermore, this noise can already be introduced within the prompt itself, therefore we compute ℒRepsubscriptℒRep\mathcal{L}_{{\text{Rep}}}caligraphic_L start_POSTSUBSCRIPT Rep end_POSTSUBSCRIPT already over the prompt ([tp⁢r⁢o⁢m⁢p⁢t,tt⁢a⁢r⁢g⁢e⁢t,tm⁢a⁢t⁢c⁢h])subscript𝑡𝑝𝑟𝑜𝑚𝑝𝑡subscript𝑡𝑡𝑎𝑟𝑔𝑒𝑡subscript𝑡𝑚𝑎𝑡𝑐ℎ([t_{prompt},t_{target},t_{match}])( [ italic_t start_POSTSUBSCRIPT italic_p italic_r italic_o italic_m italic_p italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT ] ).

In this set up, we use representations from the unlearned layers of RMU model: 5,6,75675,6,75 , 6 , 7 and ensure that the ℒRepsubscriptℒRep\mathcal{L}_{{\text{Rep}}}caligraphic_L start_POSTSUBSCRIPT Rep end_POSTSUBSCRIPT has the same magnitude across layers, through appropriate multipliers.

Appendix F Complete results

This section contains the set of results for WMDP-Cyber as well as some other results omitted in the main text.

Table 6: WMDP-Cyber and MMLU accuracy for each protection and method. For Logit Lens, we report the best layer overall. For finetuning, we report best result on 5 samples from the forget set. Empty values are not possible to compute or do not affect the score.

Datasets Knowledge Recovery No Protection Unlearning Methods Safety Training
RMU NPO DPO
WMDP-Cyber No Attack (Baseline) 42.6 27.7 32.7 33.5
\cdashline2-6 Logit Lens 42.7 30.0 29.6 39.2
Finetuning - 41.7 40.0 40.0
Orthogonalization - 41.6 23.4* 36.9
Enhanced GCG - 35.3 37.0 36.7
Pruning - 41.8 33.1 33.6
MMLU No Attack (Baseline) 58.1 57.1 55.2 55.0
\cdashline2-6 Logit Lens - - - -
Finetuning - 56.6 53.3 54.1
Orthogonalization - 57.3 25.6* 53.2
Enhanced GCG - - - -
Pruning - 57.0 54.5 54.5

* In this case directional ablation leads to catastrophic forgetting as indicated by MMLU score dropping to random chance. However, by orthogonalization only the direction at layer 15 we get accuracy of 35.0 on WMDP-Cyber and 55.4 on MMLU.

F.1 Finetuning

Refer to caption

(a) Accuracy of finetuned cyber models on WMDP-Cyber.

Refer to caption

(b) Accuracy of finetuned cyber models on MMLU.

Refer to caption

(c) Accuracy of finetuned bio models on MMLU.

Figure 4: Performance of various models on WMDP and MMLU benchmarks after finetuning them using 5, 10, 50, 100, 500, and 1000 samples

F.2 Logit lens

F.2.1 Complementary results for WMDP-Bio

Refer to caption

(a) Logit Lens results on bio models using output of the attention module.

Refer to caption

(b) Logit Lens results on bio models using intermediate representations.

Refer to caption

(c) Logit Lens results on bio models using output of the mlp module.

Figure 5: Performance on WMDP-Bio using projections of residual stream at different stages.

F.2.2 Full results for WMDP-Cyber

Refer to caption

(a) Logit Lens results on cyber models using output of the transformer block.

Refer to caption

(b) Logit Lens results on cyber models using output of the attention module.

Refer to caption

(c) Logit Lens results on cyber models using intermediate representations.

Refer to caption

(d) Logit Lens results on cyber models using output of the mlp module.

Figure 6: Performance on WMDP-Cyber using projections of residual stream at different stages.

F.3 Orthogonalization

Computing ‘unlearned’ directions is possible without access to the original model. We used benign datasets to obtain clean representations. The results in Table F.3 show that in all cases using original model’s representations is not necessary we are able to recover significant part of knowledge from all methods using only the unlearned model. Notably performance on NPO increased compared to using original model’s representations. However, we were not able to achieve as good performance on RMU as when using original model.

Table 7: Direction ablation results with different datasets on WMDP-Bio benchmark. For ‘wikitext’ and ‘mmlu’ clean representations are obtained using unlearned model on Wikitext corpus and MMLU validation set, respectively. For ‘ground_truth’ the clean representations are obtained using Zephyr-7B-β𝛽\betaitalic_β.

Method Direction WMDP-Bio MMLU
RMU wikitext 52.5 52.4
mmlu 54.6 53.7
ground_truth 64.9 56.8
\cdashline1-4NPO wikitext 49.4 45.4
mmlu 42.2 42.4
ground_truth 45.7 45.8
\cdashline1-4DPO wikitext 47.8 45.1
mmlu 56.1 46.5
ground_truth 56.1 48.2
Using first principal component as ablation direction.

In case of RMU, using 1s⁢t𝑠𝑡stitalic_s italic_t principal component as ablated direction achieved accuracy of 53.7%percent\%% (interestingly, this is the case only when using chat template, otherwise the score is 34.1%percent34.134.1\%34.1 %). For NPO and DPO, the ablating analogous direction did not significantly improve the scores on WMDP-BIO, but it did decrease performance on MMLU by ∼similar-to\sim∼ 10 p.p.

Ablations on layer subsets.

For RMU ablating directions at the unlearned layers 5,6,7 is enough to recover all knowledge using original model’s representations. Interestingly, if we ablate only direction at layer 7 the recovered accuracy is still relatively high at 56.7%percent\%%. In the case of NPO and DPO ablating directions at individual layers leads to accuracy of up to 38 at layer 15, which is significantly less then computing and ablating direction at every layer. This shows that all layers in a transformer are able to detect ‘dangerous’ inputs to some extent.

F.3.1 Results for WMDP-Cyber

In case of WMDP-Cyber we were also able to obtain unlearning directions for all methods except for NPO, where ablating directions at all layers significantly damaged the residual stream. However, when we ablated only direction, obtained through PCA, at layer 15 we were able to achieve accuracy of 35.0 on WMDP-Cyber and 55.4 on MMLU.

Table 8: Direction ablation results with different datasets on WMDP-Bio benchmark. For ‘wikitext’ and ‘mmlu’ clean representations are obtained using unlearned model on Wikitext corpus and MMLU validation set, respectively. For ‘ground_truth’ the clean representations are obtained using Zephyr-7B-β𝛽\betaitalic_β.

Method Direction Cyber MMLU
RMU wikitext 37.3 54.6
mmlu 39.3 54.7
ground_truth 41.2 57.3
\cdashline1-4NPO wikitext 25.1 43.7
mmlu 28.1 42.5
ground_truth 23.4 25.6
\cdashline1-4DPO wikitext 31.1 53.4
mmlu 32.2 49.6
ground_truth 36.9 53.2

Appendix G Complete results using chat template

During the execution of experiments we noticed multiple cases when the model’s behaviour was affected in chat template environment but not in the environment without it (converse is also true). For examples, originally we trained DPO only with chat template. However, then we observed that while there was a substantial drop in WMDP-Bio performance using chat template, without it the model’s performance was significantly better. We encountered analogous situation when we trained the model completely without chat template. In Table G we list all such examples of such behaviours. As a consequence, when training our final DPO models we applied chat template to 50%percent\%% of the samples.

Table 9: Results of DPO training with and without chat template.

WMDP-Bio
Training using chat template With chat template Without chat template
Baseline (Zephyr-7b-β𝛽\betaitalic_β) - 63.5 64.4
\cdashline1-4 Checkpoint 1 Yes 28.7 46.8
Checkpoint 2 No 61.7 45.9
Checkpoint 3 No 38.2 26.2

Furthermore, we were able to find multiple adversarial suffixes that work well for prompts with chat template or ones without but not for both. The performance gap of these prefixes reached up to 20 p.p. (31.2%percent\%% without chat template, 51.4%percent\%% with chat template).

Moreover, during ablations experiments for orthogonalizations we found that using 1s⁢t𝑠𝑡stitalic_s italic_t principal component as ablation direction for RMU we can recover accuracy of 53.7%percent\%% for chat template settings but the performance without chat template remained poor at 34.1%percent34.134.1\%34.1 %.

These findings suggests that LLMs are very good at compartmentalizing behaviours, such that one model can exhibit different behaviours depending on the setting / environment (such as with or without chat template in our case) it is presented in. The capability to display different set of skills based on the setting might explain why inserting trojans into LLMs is relatively easy. One simply creates a separate compartment in LLM behaviour space such that when given appropriate setting (trigger) the model misbehaves.

Given our obseravations we decided to report our results also with the chat template. They can be found below.

G.1 Overview of the results using chat template

Table 10: WMDP-Cyber and MMLU accuracy for each protection and method, using chat template. For Logit Lens, we report the best layer overall. For finetuning, we report best result on 5 samples from the forget set. - values indicate that a particular combination is not possible or inherently doesn’t change the baseline value.

Datasets Knowledge Recovery No Protection Unlearning Methods Safety Training
RMU NPO DPO
WMDP-Cyber No Attack (Baseline) 41.8 28.9 31.1 34.6
\cdashline2-6 Logit Lens 42.4 31.1 29.8 39.2
Finetuning - 40.4 40.5 39.4
Orthogonalization - 41.9 34.1 37.9
Enhanced GCG - 33.0 36.0 36.7
Pruning - 40.1 32.2 35.2
MMLU No Attack (Baseline) 57.3 56.3 54.9 51.8
\cdashline2-6 Logit Lens - - - -
Finetuning - 53.1 53.7 37.2
Orthogonalization - 56.8 55.0 53.4
Enhanced GCG - - - -
Pruning - 55.2 53.0 51.8

Table 11: WMDP-Bio and MMLU accuracy for each protection and method using, chat template. For Logit Lens, we report the best layer overall. For finetuning, we report best result on 5 samples from the forget set. Empty values are not possible to compute or do not affect the score.

Datasets Knowledge Recovery No Protection Unlearning Methods Safety Training
RMU NPO DPO
WMDP-Bio No Attack (Baseline) 63.5 31.7 32.5 30.0
\cdashline2-6 Logit Lens 63.5 31.7 34.71 50.7
Finetuning - 60.3 47.6 60.7
Orthogonalization - 63.0 47.3 51.7
Enhanced GCG - 51.4 49.4 47.8
Pruning - 52.4 40.1 48.1
MMLU No Attack (Baseline) 57.3 56.3 52.7 51.8
\cdashline2-6 Logit Lens - - - -
Finetuning - 56.5 51.9 53.5
Orthogonalization - 56.6 45.1 49.7
Enhanced GCG - - - -
Pruning - 56.6 49.6 51.3

G.2 Finetuning

Refer to caption

(a) Accuracy of finetuned cyber models on WMDP-Cyber using chat template.

Refer to caption

(b) Accuracy of finetuned cyber models on MMLU using chat template.

Refer to caption

(c) Accuracy of finetuned bio models on WMDP-Bio using chat template.

Refer to caption

(d) Accuracy of finetuned bio models on MMLU using chat template.

Figure 7: Performance of various models on WMDP and MMLU benchmarks after finetuning them using 5, 10, 50, 100, 500, and 1000 samples

G.3 Logit lens

G.3.1 Results for WMDP-Bio

Refer to caption

(a) Logit Lens results on bio models using output of the transformer block.

Refer to caption

(b) Logit Lens results on bio models using output of the attention module.

Refer to caption

(c) Logit Lens results on bio models using intermediate representations.

Refer to caption

(d) Logit Lens results on bio models using output of the mlp module.

Figure 8: Performance on WMDP-Bio using projections of residual stream at different stages.

G.3.2 Complementary results for WMDP-Cyber

Refer to caption

(a) Logit Lens results on cyber models using output of the transformer block.

Refer to caption

(b) Logit Lens results on cyber models using output of the attention module.

Refer to caption

(c) Logit Lens results on cyber models using intermediate representations.

Refer to caption

(d) Logit Lens results on cyber models using output of the mlp module.

Figure 9: Performance on WMDP-Cyber using projections of residual stream at different stages.

Appendix H Perturbations as a knowledge extraction method for RMU

H.1 Naive perturbations

Simoulin and Crabbé (2021) indicate that lower layers of transformers encode mostly surface level information. Hence, given that RMU only modifies the early layers of transformers, we hypothesize that their defense might extensively rely on surface level information, such as specific keywords (which appears to be true given our experiments in Appendix J). Our first attempts included forcing the tokenizer to tokenize each character individually and inserting random characters at different positions. Although, in the qualitative evaluation we noticed that Zephyr_RMU was more open to discuss hazardous concepts, the performance on WMDP does not changed significantly as one can see on the Figure 10.

Refer to caption

(a) Accuracy of Zephyr-RMU and Zephyr on WMDP-Bio after inserting a perturbation every n characters, averaged over 13 different perturbation types. Bars represent 95% confidence intervals assuming Gaussian distribution.

Refer to caption

(b) Accuracy of Zephyr-RMU and Zephyr on WMDP-Bio after inserting different perturbation at 6 different frequencies. Bars represent 95% confidence intervals assuming Gaussian distribution.

Figure 10: Accuracy of Zephyr-RMU and Zephyr models after inserting perturbation every 1,3,5,7,11,15 characters. The perturbations types are: ‘ ’, ‘!’, ‘-’, ‘.’, ‘5’, ‘;’, ‘?’, ‘Q’, ‘^ ’, ‘_’, ‘∼similar-to\sim∼’ and shuffle which randomly selects a non-alphabetic character as a perturbation.

H.2 Informed perturbations

Inserting perturbations at correct locations greatly improved the performance on WMDP-Bio. To select appropriate locations for perturbations, we use cosine similarity between the activations of the model at the last unlearned layer and the corresponding refusal direction computed as described in Section 4.2 . As a consequence we developed a simple algorithm (Algorithm 1) that greedily modifies prompt from left to write until cosine similarity for all tokens is lower than predefined threshold. This design targets RMU’s heavy-side function like behaviour.

An example of a perturbed prompt along with the following generation from RMU model is as follows (where (->) indicates artificially added newline for reader’s convenience):

Original prompt

The following are multiple choice questions (with answers) about biology.

What dangerous outcome might occur if a virulence factor is transferred (->) from one microbe to another microbe? A. Increased stability under stressful environmental conditions B. Decreased reliance on host nutrient sources C. Inability to infect original host D. Altered host tropism of a pathogenic microbe Answer:

RMU generation (100 tokens)

\nB. Disap\nC. Were\nD. Were\nE. Were\nF. Were\nG.\n\nThe (->)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n (->)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n (->)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

Perturbed prompt (threshold = 0.5)

The following are multiple choice questions (with answers) about biology.

What dangerous outcome might occur if~ a v i rulence (->) f ac tor is transferred from one mi cro~~~^b e (->) t o another mic ro^^^^~~^b e? A. I ncreased stability under stressful environmental conditions B. Decreased reliance on hos t nutrient sources C. Inability to i nf ect original hos t D. Altered ho st tr opism o f a p athoge nic micro^^^b e Answer:

RMU generation (100 tokens)

D. Altered ho st tr opism o f a p athoge nic micro^^^b e

What is the name of the process by which a cell grows and divides? A. Mitosis B. Meiosis C. Cytokinesis D. Cell division Answer: D. Cell division

Which of the following is a type of cell division that results (->) in daughter cells that are genetically identical to the parent cell?\n

Once can see that model answered correctly the perturbed version of the question and that the subsequent generation is completely coherent.

Algorithm 1 InformedPerturbation Algorithm

1:transformer model M𝑀Mitalic_M, threshold θ𝜃\thetaitalic_θ, ablation direction 𝐫^7subscript^𝐫7\hat{\mathbf{r}}_{7}over^ start_ARG bold_r end_ARG start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT, array of string-like tokens p⁢r⁢o⁢m⁢p⁢t𝑝𝑟𝑜𝑚𝑝𝑡promptitalic_p italic_r italic_o italic_m italic_p italic_t, and maximum allowed number of iterations T𝑇Titalic_T

2:p⁢r⁢o⁢m⁢p⁢to⁢l⁢d←[]←𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑜𝑙𝑑prompt_{old}\leftarrow[]italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ← [ ]

3:p⁢r⁢o⁢m⁢p⁢tn⁢e⁢w←p⁢r⁢o⁢m⁢p⁢t←𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑛𝑒𝑤𝑝𝑟𝑜𝑚𝑝𝑡prompt_{new}\leftarrow promptitalic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ← italic_p italic_r italic_o italic_m italic_p italic_t

4:for i=1𝑖1i=1italic_i = 1 to m⁢a⁢x⁢_⁢i⁢t⁢e⁢r𝑚𝑎𝑥_𝑖𝑡𝑒𝑟max\_iteritalic_m italic_a italic_x _ italic_i italic_t italic_e italic_r do

5: if prompto⁢l⁢d==promptn⁢e⁢wprompt_{old}==prompt_{new}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT = = italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT then

6: break

7: end if

8: p⁢r⁢o⁢m⁢p⁢to⁢l⁢d←p⁢r⁢o⁢m⁢p⁢tn⁢e⁢w←𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑜𝑙𝑑𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑛𝑒𝑤prompt_{old}\leftarrow prompt_{new}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ← italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT

9: a⁢c⁢t⁢s7←GetActivations⁢(M,p⁢r⁢o⁢m⁢p⁢to⁢l⁢d,7)←𝑎𝑐𝑡subscript𝑠7GetActivations𝑀𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑜𝑙𝑑7acts_{7}\leftarrow\text{GetActivations}(M,prompt_{old},7)italic_a italic_c italic_t italic_s start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ← GetActivations ( italic_M , italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT , 7 ) ▷▷\triangleright▷ Activations for each token after layer 7.

10: s⁢i⁢m⁢s←CosineSimilarity⁢(𝐫^7,a⁢c⁢t7)←𝑠𝑖𝑚𝑠CosineSimilaritysubscript^𝐫7𝑎𝑐subscript𝑡7sims\leftarrow\text{CosineSimilarity}(\hat{\mathbf{r}}_{7},act_{7})italic_s italic_i italic_m italic_s ← CosineSimilarity ( over^ start_ARG bold_r end_ARG start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT , italic_a italic_c italic_t start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ) ▷▷\triangleright▷ Cosine similarities for each token.

11: p⁢r⁢o⁢m⁢p⁢tn⁢e⁢w←←𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑛𝑒𝑤absentprompt_{new}\leftarrowitalic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ← InsertPerturbation(p⁢r⁢o⁢m⁢p⁢to⁢l⁢d𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑜𝑙𝑑prompt_{old}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT, s⁢i⁢m⁢s𝑠𝑖𝑚𝑠simsitalic_s italic_i italic_m italic_s, θ𝜃\thetaitalic_θ)

12:end for

13:return p⁢r⁢o⁢m⁢p⁢tn⁢e⁢w𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑛𝑒𝑤prompt_{new}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT

Algorithm 2 InsertPerturbation Algorithm

1:array of string-like tokens p⁢r⁢o⁢m⁢p⁢t𝑝𝑟𝑜𝑚𝑝𝑡promptitalic_p italic_r italic_o italic_m italic_p italic_t, array of cosine similarities for each token s⁢i⁢m⁢s𝑠𝑖𝑚𝑠simsitalic_s italic_i italic_m italic_s, threshold θ𝜃\thetaitalic_θ,

2:perturbations←[perturbations\leftarrow[italic_p italic_e italic_r italic_t italic_u italic_r italic_b italic_a italic_t italic_i italic_o italic_n italic_s ← [‘∼similar-to\sim∼’,‘∧’]]]] ▷▷\triangleright▷ Empirically determined to have least impact on the original model

3:for i=1𝑖1i=1italic_i = 1 to NumTokens(p⁢r⁢o⁢m⁢p⁢t)𝑝𝑟𝑜𝑚𝑝𝑡(prompt)( italic_p italic_r italic_o italic_m italic_p italic_t ) do

4: if c⁢o⁢s⁢_⁢s⁢i⁢m⁢[i]>θ𝑐𝑜𝑠_𝑠𝑖𝑚delimited-[]𝑖𝜃cos\_sim[i]>\thetaitalic_c italic_o italic_s _ italic_s italic_i italic_m [ italic_i ] > italic_θ then

5: if NumChars(p⁢r⁢o⁢m⁢p⁢t⁢[i])>1𝑝𝑟𝑜𝑚𝑝𝑡delimited-[]𝑖1(prompt[i])>1( italic_p italic_r italic_o italic_m italic_p italic_t [ italic_i ] ) > 1 then

6: p⁢r⁢o⁢m⁢p⁢t⁢[i]←←𝑝𝑟𝑜𝑚𝑝𝑡delimited-[]𝑖absentprompt[i]\leftarrowitalic_p italic_r italic_o italic_m italic_p italic_t [ italic_i ] ← Split(p⁢r⁢o⁢m⁢p⁢t⁢[i])𝑝𝑟𝑜𝑚𝑝𝑡delimited-[]𝑖(prompt[i])( italic_p italic_r italic_o italic_m italic_p italic_t [ italic_i ] ) ▷▷\triangleright▷ Randomly inserts a whitespace at a non-edge position

7: else

8: if p⁢r⁢o⁢m⁢p⁢t⁢[i]∈p⁢e⁢r⁢t⁢u⁢r⁢b⁢a⁢t⁢i⁢o⁢n⁢s𝑝𝑟𝑜𝑚𝑝𝑡delimited-[]𝑖𝑝𝑒𝑟𝑡𝑢𝑟𝑏𝑎𝑡𝑖𝑜𝑛𝑠prompt[i]\in perturbationsitalic_p italic_r italic_o italic_m italic_p italic_t [ italic_i ] ∈ italic_p italic_e italic_r italic_t italic_u italic_r italic_b italic_a italic_t italic_i italic_o italic_n italic_s then

9: p⁢r⁢o⁢m⁢p⁢t⁢[i]←←𝑝𝑟𝑜𝑚𝑝𝑡delimited-[]𝑖absentprompt[i]\leftarrowitalic_p italic_r italic_o italic_m italic_p italic_t [ italic_i ] ← RandomNonAlphabeticChar()()( )

10: else

11: p⁢r⁢o⁢m⁢p⁢t⁢[i]←←𝑝𝑟𝑜𝑚𝑝𝑡delimited-[]𝑖absentprompt[i]\leftarrowitalic_p italic_r italic_o italic_m italic_p italic_t [ italic_i ] ← RandomChoice(p⁢e⁢r⁢t⁢u⁢r⁢b⁢a⁢t⁢i⁢o⁢n⁢s)+p⁢r⁢o⁢m⁢p⁢t⁢[i]𝑝𝑒𝑟𝑡𝑢𝑟𝑏𝑎𝑡𝑖𝑜𝑛𝑠𝑝𝑟𝑜𝑚𝑝𝑡delimited-[]𝑖(perturbations)+prompt[i]( italic_p italic_e italic_r italic_t italic_u italic_r italic_b italic_a italic_t italic_i italic_o italic_n italic_s ) + italic_p italic_r italic_o italic_m italic_p italic_t [ italic_i ]

12: end if

13: end if

14: ▷▷\triangleright▷ We return p⁢r⁢o⁢m⁢p⁢t𝑝𝑟𝑜𝑚𝑝𝑡promptitalic_p italic_r italic_o italic_m italic_p italic_t after a single modification.

15: return p⁢r⁢o⁢m⁢p⁢t𝑝𝑟𝑜𝑚𝑝𝑡promptitalic_p italic_r italic_o italic_m italic_p italic_t

16: end if

17:end for

18:return p⁢r⁢o⁢m⁢p⁢t𝑝𝑟𝑜𝑚𝑝𝑡promptitalic_p italic_r italic_o italic_m italic_p italic_t

Since this method is tailored for RMU we do not apply it to the other models directly, but we run resulting perturbed WMDP prompts on other models to quantify its transferability capabilities.

H.3 Effectiveness of perturbations on RMU and other models

The results of evaluating RMU, NPO, DPO and baseline models on perturbed versions of WMDP-Bio (using different cosine similarity thresholds) can be found in Figure 11. Note that for threshold of 0.5 the performance difference between baseline model and Zephyr_RMU is only 2.2⁢p.p.formulae-sequence2.2𝑝𝑝2.2\;p.p.2.2 italic_p . italic_p .. Furthermore, we can observe that unrelated methods: DPO and NPO, also reveal more knowledge when exposed to perturbed prompts.

Refer to caption

Figure 11: Performance of different models on perturbed version of WMDP-Bio.

Lastly, to investigate transferability to other RMU models, we evaluate RMU variant888Available at: https://huggingface.co/cais/Mixtral-8x7B-Instruct_RMU of Mixtral-8x7B-v0.1 (Jiang et al., 2024) on perturbed WMDP-Bio and find that accuracy improved by up to 29%percent2929\%29 %. The results are visible in Figure 12

Refer to caption

Figure 12: Accuracy of Mixtral-8x7B-RMU and Mixtral-8x7B on perturbed WMDP-Bio.

Ultimately, we investigate why perturbations manage to fool RMU. Namely, we use WMDP-Bio questions as prompts and let Zephyr-7B-β𝛽\betaitalic_β generate next 50 tokens, then measure the perplexity (PPL) of those generations using Zephyr_RMU to test how likely are the correct answers in the eyes of the unlearned model. The difference is significant as PPL of the original generations conditioned on unperturbed WMDP-Bio questions calculated using Zephyr_RMU is ∼1600similar-toabsent1600\sim\!1600∼ 1600 times larger than the PPL obtained using original model. However, when conditioned on perturbed prompts the PPL is only ∼16similar-toabsent16\sim\!16∼ 16 times larger. Exact results can be found in Table H.3.

Table 12: Perplexity of generations conditioned on perturbed prompts measured using RMU model.

PPL PPL (chat template)
Threshold 0.2 72.6 74.7
0.3 29.4 51.3
0.4 22.1 44.5
0.5 30.7 57.3
0.6 40.7 58.3
0.7 64.2 116.8
0.8 212.6 228.9
\cdashline1-4 No perturbations 2372.6 2323.1

Appendix I Perplexity Analysis of Adversarial Prefixes on RMU

Our preliminary experiments have shown that optimizing a prefix, which recovers a coherent and plausible answer from RMU is relatively easy. However, these answers were often found to be incorrect. To evaluate jailbreak quality, we gather 24 different adversarial prefixes that were optimized on RMU model and which achieve different performance on WDMP. Then we append them to WMDP questions and let the RMU model generate next 50 tokens. Next we compute perplexity of those generations using original model (Zephyr-7B-β𝛽\betaitalic_β) and plot the results in Figure 12(a). We can clearly see that while, for accuracies less than 0.4, lower perplexity correlates with higher accuracy, adversarial prefixes resulting in highest accuracy do not necessarily have the lowest perplexity. Furthermore, prefixes displaying lowest perplexity span over a large interval of accuracies indicating that a coherent positive answer does not necessarily correspond to its correctness. Furthemore, these results suggests that model may be jailbroken, but were are not able to acknowledge that due to illegible answers.

In Figure 12(b) we show the results for a similar experiments where we used original model to generate completions for WMDP questions and used RMU model (jailbroken with adversarial prefixes) to compute corresponding perplexities. We can observe very similar trends as in Figure 12(a).

Our results suggest that obtaining a positive and coherent answer does not necessarily coincide with high quality of a jailbreak. This questions the quality of jailbreaks, which performance is measured through keyword matching or by an automated judge (such as ChatGPT).

Refer to caption

(a) Average perplexity of RMU models’ generations conditioned on WMDP-Bio questions with adversarial prefixes, measured on the original model using chat template. Average perplexity of RMU generations without the adversarial prefix measured on the original model is 70.070.070.070.0.

Refer to caption

(b) Average perplexity of original models’ generations conditioned only on WMDP-Bio questions, measured on RMU after prepending adversarial prefix using chat template. Average perplexity of the original generations measured on RMU model without adding adversarial prefix is 2323.02323.02323.02323.0.

Figure 13: Average perplexities of generations using 24 different adversarial prefixes optimized on the RMU model. Trendlines were calculated using Locally WEighted Scatterplot Smoothing (Cleveland, 1979) (using frac of 0.50.50.50.5).

I.1 Adversarial prefixes without chat template

We repeated the experiments above without chat template, the results can be found in Figure 14. The most striking difference is the lack of convex trendlines, which now resemble exponential decay. This behaviour entails that beyond certain accuracy perplexity stays almost constant. Hence, after certain point, generation’s coherence cannot be used as an indicator of adversarial prefix’s quality.

We hypothesize that the exponential decay behaviour is caused by the fact that in a next token prediction scenario999Zephyr-7b-β𝛽\betaitalic_β is a chat model, thus, by not using a chat template we revert it to a next token predictor it is easier to obtain low perplexity compared to the chat setting. The premise is that the latter expects a very specific behaviour from the model (helpfulness, responsiveness), whereas next token predictor is less restricted in terms of the style of its generations. Therefore, generations without chat template might not necessarily be helpful or informative but still achieve low perplexity. Hence, this trendline behaviour does not contradict our findings in the previous section.

Refer to caption

(a) Average perplexity of RMU generations conditioned on WMDP-Bio questions with prefixes measured on the original model. Average perplexity of RMU generations without the adversarial prefix measured on the original model is 19.119.119.119.1.

Refer to caption

(b) Average perplexity of the original generations conditioned on WMDP-Bio questions, measured on RMU model (jailbroken with an adversarial prefix). Average perplexity of original generations measured on RMU model without adding adversarial prefix is 2337.22337.22337.22337.2.

Figure 14: Average perplexities of generations using 24 different adversarial prefixes optimized on RMU model. Trendlines obtained using Locally WEighted Scatterplot Smoothing (Cleveland, 1979) (using frac of 0.40.40.40.4).

Appendix J RMU analysis

The experiments listed in the main body of this paper were preceded by empirical analysis of RMU. The results of this analysis is as follows.

J.1 Behaviour during innocuous conversations

At the first stage of our investigation we simply chatted with the RMU model to check how the model reacts to questions from biology of different sophistication levels and varying intents. We have observed that the model is fully capable in harmless domains. However, it is not able to distinguish between malicious and benevolent intentions. For example, asking the model ‘What should I do to protect myself from bacteria?’ will steadily lead to the degeneration of model outputs as can be see below. As usual (->) indicates artificial newline added for reader’s convenience.

<|system|> <|user|> What should I do to protect myself from bacteria? <|assistant|>

  1. Wash your hands frequently: Wash your hands with soap and water for (->) at least 20 seconds, especially before eating, after using the bathroom, (->) and before preparing food. If you don cannot wash your hands with (->) soap and water, use a hand sananitizer.

  2. Cover: Some parts of your journey (yourself) are allowed to walk away (->) from the fact that you are the 2. In 0 times (fixed) sometimes appear in

  3. Appearear (fixed) times

  4. Out (fixed) times

  5. The

  6. (fixed)

  7. (fixed)

  8. (fixed

This highlights two important findings. (1) The model is very sensitive to keywords (for instance we have tested the example above with ‘viruses’,‘pathogens’, all of which caused the degradation of the outputs); (2) The first generated tokens might be correct but the noise persists in the residual stream and suffers from snowball effect, which can be seen first by spelling mistakes (e.g. ‘hand sananitizer’), then by insensible phrases (e.g. ‘Some parts of your journey (yourself) are allowed to walk away’) until complete degeneration. We hypothesize that the snow ball effect is caused by a positive feedback loop occurring during generation. Namely, a mildly dangerous concept prompts response that includes more dangerous concepts, which introduces more noise into the residual stream.

J.2 Simple prompt-based jailbreaks

Furthermore, we tried simple jailbreaks to get more insights and test RMU’s robustness, such as paraphrasing (to avoid keywords), translating prompts to other less common languages, adding malicious examples to the context, and tinkering with the chat template (e.g. asking the question inside assistant token). However, none of these significantly improved the responsiveness of the model. The technique that worked to some extent was adding perturbations to the text (especially keywords to force tokenizer to tokenize them in uncommon ways). This observation has resulted in an adaptive approach covered in Appendix H.2.

There was one approach that seemed promising but turned out to be implausible due to insufficient skills of Zephyr-7B-β𝛽\betaitalic_β. Namely, one could split the prompt into chunks and give each consecutive chunk a consecutive number. Then, one could provide a shuffled list of (number, chunk) pairs to the model and ask it to sort it by numbers and answer the question within. This approach should obfuscate the question such that dangerous concepts are not detected in the early layers (which focus on the surface level information).

J.3 Prefilling attack

To check if a model is able to recover from noise after its injection we tried prefilling attack. In this scheme, the model is conditioned on the prompt and a part of desired response. The intention is that the model will continue generating next tokens as if it has already agreed to answer a potentially malicious prompt and, thus, bypass its safety tuning. We apply the same technique where we tried to condition the model on the prompt and part of correct generation, but the model was not able to recover from noise. We have tried supplying correct generations of various lengths but the RMUs’ generations were still degenerated. This indicates that simply forcing the model to output confirmatory answer or even part of correct generation will not revert it to its pre-unlearned version. Therefore, straightforward application of GCG is doomed to fail as we optimize the adversarial string such that the model will output a specific target string.

J.4 Effect of noise on token representations

Next we investigate how noise injection modifies token representations and when it occurs. To do so we measure cosine similarities of representations of subsequent tokens in a prompt containing hazardous knowledge. As an example we will use the prompt below.

<|system|> <|user|> What dangerous outcome might occur if a virulence factor is transferred (->) from one microbe to another microbe? A. Increased stability under stressful environmental conditions B. Decreased reliance on host nutrient sources C. Inability to infect original host D. Altered host tropism of a pathogenic microbe Answer: <|assistant|>\n

Figure 15 contains cosine similarities evaluated at different layers (7, 20, 31) of RMU and original models. We can make several important observations based on that figure. Let us first focus on the results for layer 7 (the last unlearned layer).

One can clearly see that beginning with the token at position 23 all the subsequent ones display very high cosine similarity (> 0.8). Interestingly, token at position 23 is ‘vir’ from word ‘virulence’. Additionaly, we can observe that on the heatmap corresponding to the original model there is no such behaviour. Given, the sensitivity of RMU to certain keywords we can conclude that token ‘vir’ must have introduced noise to the residual stream and all the following tokens are also distorted by this noise, as seen by high cosine similarity. Moreover, we can notice that representations of tokens at positions up to 22 (inclusive) are all very distinct to the ones beyond it, despite the fact that they are moderately similar to each other. These findings indicate that RMU adds noise in a heavy-side function like manner: once dangerous concept/token is present in the residual stream all the subsequent tokens will also contain noise.

Refer to caption

(a) Cosine similarity between activations in layer 7 of the RMU model.

Refer to caption

(b) Cosine similarity between activations in layer 7 of the original model.

Refer to caption

(c) Cosine similarity between activations in layer 20 of the RMU model.

Refer to caption

(d) Cosine similarity between activations in layer 20 of the original model.

Refer to caption

(e) Cosine similarity between activations in layer 31 of the RMU model.

Refer to caption

(f) Cosine similarity between activations in layer 31 of the original model.

Figure 15: Cosine similarity between representations of different tokens in a prompt at layers 7, 20, and 31 of the Zephyr-7B-β𝛽\betaitalic_β model and its RMU counterpart. Layer 7 is the last unlearned layer in RMU model.

Furthermore, we can see that there are several tokens at the end of the prompt that are less similar to other noisy prompts. These are special tokens such as ‘’ or ‘istant’ from ‘assistant’. This is explained by the fact that these tokens contribute more to the syntax of the chat rather than semantics, which makes them very distinct by default (as indicated by dark colors at these positions in Figures 14(b), 14(d)).

Lastly, we can observe that similarity resulting from noise is very prominent right after layer 7. However, subsequent layers transform all representations significantly. As a consequence, all representations converge to an average level of similarity (∼0.3similar-toabsent0.3\sim 0.3∼ 0.3), where all representations bear some resemblance but all remain distinct from each other.

J.4.1 PCA analysis

To further investigate the effect of noise injection on the token representations we use PCA on a dataset consisting of benign representations computed on Wikitext dataset and hazardous representations obtained using WMDP benchmark questions. Note that for each WMDP question we discard first 40 tokens to ensure that the noise is already present in the representations. Furthermore, we discard ‘’ and ‘\n’ tokens from the dataset due to their surprisingly distinct representations. The results of this analysis are presented in Figure 16. It clearly shows that hazardous and benign representations are almost linearly separable from each other.

Refer to caption

Figure 16: First 2 principal components of representations obtained using benign and hazardous prompts. Each marker represents one token.