GitHub - alan-turing-institute/neural-watchdog: A firewall for your neural networks. (original) (raw)

Neural Watchdog is an open-source tool designed to detect stealthy backdoor attacks in Deep Reinforcement Learning (DRL) policies. In this repository, we introduce an evasive backdoor technique called the "in-distribution trigger" and demonstrate how to detect them using our tool.

For a detailed understanding of the technical aspects of Neural Watchdog, please refer to our paper:https://arxiv.org/abs/2407.15168

If you would like to cite our work, please use the following reference:

@inproceedings{vyas2024mitigating,
  title={Mitigating Deep Reinforcement Learning Backdoors in the Neural Activation Space},
  author={Vyas, Sanyam and Hicks, Chris and Mavroudis, Vasilios},
  booktitle={2024 IEEE Security and Privacy Workshops (SPW)},
  pages={76--86},
  year={2024},
  organization={IEEE}
}

Getting Started

Before you begin, ensure that you have Conda installed on your Mac. If you do not have Conda installed, please follow the official Conda installation guide for a Mac.

Installation

Clone the Repository

git clone git@github.com:alan-turing-institute/in-distribution-backdoors.git

Create the conda environment

conda env create -f environment.yml conda activate minigrid_farama_2

Set PYTHONPATH

export PYTHONPATH= "/path/to/file/Minigrid" echo $PYTHONPATH

Execution

Run the visualize file, and edit the "visualize.py" file and "crossings.py" code according to the data you want to collect (Non-triggered/Triggered, Goal in field of view, Trigger in field of view, Thresholding Detector Algorithm)(Trigger on/Non-Trigger off). The visualisation file collects the neural activations for every step and saves them to a file according to the type of data that requires collection,

python3 -m scripts.visualize --env MiniGrid-LavaCrossingS9N1-v0 --model DSLP_Crossings_Trigger_60k_256_neurons --episodes 1000

To run the training file from scratch, and edit the "visualize.py" file and "crossings.py" code according to the data you want to collect (Non-triggered/Triggered, Goal in field of view, Trigger in field of view, Thresholding Detector Algorithm)(Trigger on/Non-Trigger off). The train.py file will save all model outputs to the Minigrid/minigrid/torch-ac/rl-starter-files/storage folder. This model can then be accessed in the visualize.py file above

python3 -m scripts.train --algo ppo --env MiniGrid-LavaCrossingS9N1-v0 --model model_name --save-interval 10 --frames 60000000

Our Detector in Action

Our in-distribution backdoor trigger here is the convergence of two lava rivers, forming a "+" sign. This tool can be used to detect such backdoor triggers in real-time and prevent the poisoned agent from taking malicious actions i.e., heading into the lava rivers in this context.

Watch the video on Youtube

Atari Breakout Experiments

This repository contains the source code of sanitization backdoor policies for Atari breakout game environment. The backdoor policy in this example has been trained using the environment poisoning framework of TrojDRL paper .

The state space consists of a concatenated image frames. The trigger is a 3x5 image inserted on the tile space of the Atari Breakout Game. The backdoor policy has been trained to a level so that in absense of trigger the policy consistently achieves high score against the oppenent while in presence of trigger it takes 'no move' action eventually achieving a very low score on average.

Setup codebase and python environment.

install anaconda, follow instructions here.
create a new environment from the specification file.conda env create --name NEW_ENV_NAME -f environment.yml
activate conda environment.conda activate NEW_ENV_NAME

Run the code.

test backdoor policy in the clean environment :
python driver_parallel.py 'backdoor_in_clean' 'save_states'
- change number of trials, number of test episodes(test_count) in the trials if needed.
- the clean states data generated here would be used for sanitization in step 3.
test backdoor policy in the triggered environment :
python driver_parallel.py 'backdoor_in_triggered'
sanitize backdoor and test sanitized policy in the triggered environment :
python driver_parallel.py 'sanitized_in_triggered'
- construct sanitized policies for various number of clean sample sets and then test it.
sanitize backdoor with a fixed n=32768n=32768n=32768 and different safe subspace dimension ddd.python driver_parallel.py 'sanitized_with_fixed_n'
- to run this part, we need to have bases for n=32768n=32768n=32768 samples obtained from step 3.

Training the backdoor policy from scratch.

We train a strongly targeted backdoor policy that uses a and takes 'no move' action when the trigger is active as specfied in the TrojDRL paper. For more details please refer to this paper and the code.
To train this backdoor policy run :

python3 train.py --game=breakout --debugging_folder=pretrained_backdoor/strong_targeted/breakout_target_noop/ --poison --color=5 --attack_method=strong_targeted --pixels_to_poison_h=5 --pixels_to_poison_v=3 --start_position="29,28" --when_to_poison="uniformly" --action=2 --budget=20000 --device='/cpu:0' --emulator_counts=12 --emulator_workers=4

Results

Our results show that our in-distribution trigger successfully evades the defence algorihtm of Bharti et al's NeurIPS solution paper

performance_breakout.pdf

spectrum_safe_subspace.pdf

Edited Files

The evaluator.py file contains the code which changes the size of the trigger along with the params_indist.yml file. The latter file adjusts the default size along with the colour of the trigger The plot_graphs.py file saves the visualisation found in figure 2 of the paper, whilst the analyse_performance_for_n=32768_sanitization.py file saves the visualisation found in figure 3 of the paper