Running on AMD GPU (original) (raw)

September 2023, tested on 7900 XTX

Following the great instructions from August and using the docker image, this runs on the 7900 XTX with a few changes, most notably

export HSA_OVERRIDE_GFX_VERSION=11.0.0 #7900 xtx natively works with the gfx1100 driver make hip ROCM_TARGET=gfx1100

The rest of the steps are the same

August 2023, tested on 6900 XT and 6600 XT

Due to the great work of Odonata (Discord, github @edt-xx), the hardware of oceanmasterza (Discord), and the help of epicx (Discord, GitHub @bennmann), we have the below AMD instructions.

According the the author of the bitsandbytes ROCM port @arlo-phoenix, using a Docker image is recommended (both rocm/pytorch and rocm/pytorch-nightly should work). See port discussion here.

On host machine, run:

docker pull rocm/pytorch-nightly sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/pytorch-nightly

In the running image, run:

cd /home export HSA_OVERRIDE_GFX_VERSION=10.3.0

Install bitsandbytes with ROCM support

git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6.git bitsandbytes cd bitsandbytes make hip ROCM_TARGET=gfx1030 pip install pip --upgrade pip install .

Install Petals

cd .. pip install --upgrade git+https://github.com/bigscience-workshop/petals

Run server

python -m petals.cli.run_server petals-team/StableBeluga2 --port --torch_dtype float16

Running the model in bfloat16 is also supported but slower than in float16.

Multi-GPU process (--tensor_parallel_devices) is still not tested (docker --gpu flag may not function at this time and other virtualization tools may be necessary).

July 2023, tested on 6900 XT and 6600 XT

Contributed by: @edt-xx, @bennmann

Tested on:

Guide: