Compiling SD1.5 for Neuron with resolution other than 512x512 fails (original) (raw)
September 11, 2024, 11:38am 1
I’m trying to export SD 1.5 into a portrait mode resolution 512x768 for use with Neuron / Inferentia 2. This is my export command:
optimum-cli export neuron \
--model jyoung105/stable-diffusion-v1-5 \
--task stable-diffusion \
--batch_size 1 --num_images_per_prompt 1 \
--height 768 --width 512 \
stable-diffusion-v1-5.neuron
It works in 512x512 but fails with 512x768 with this error in the vae_encoder
step:
***** Compiling vae_encoder *****
...........
[GCA035] Instruction: I-5715-0 with opcode: TensorTensor couldn't be allocated in SB
Memory Location Accessed:
add.1_reload_7077_i0: 196608 Bytes per Partition and total of: 25165824 Bytes in SB
_add.1104-t7919_i0: 4 Bytes per Partition and total of: 512 Bytes in SB
add.6_i0: 2048 Bytes per Partition and total of: 262144 Bytes in SB
Total Accessed Bytes per partition by instruction: 198660
Total SB Partition Size: 196608
- Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
An error occured when trying to trace vae_encoder with the error message: neuronx-cc failed with 70.
The export is failed and vae_encoder neuron model won't be stored.
Do I need any other parameters or is it a bug that needs fixing? I’m running it on AWS inf2.2xlarge instance.
Somehow I managed to export some SD1.5 checkpoint to 512x768 a couple weeks ago but I’m unable to reproduce it now. Is that a possible regression in optimum-neuron or neuronx-cc?
hgfckwla September 13, 2024, 6:04pm 2
@Jingya sorry to bother you directly, but you’re always so helpful
Any idea why compiling the model for any resolution other than 512x512 fails?
I have one Neuron model that I was able to compile for 512x768 a few weeks ago but I no longer have the setup and don’t remember the exact command, and now it always fails.
Is it something that can be fixed? Or am I doing something wrong?
Jingya September 16, 2024, 10:22am 3
Hi @hgfckwla,
The compilation error should come from AWS Neuron sdk instead of Optimum Neuron. According to AWS folks, the compilation for SD models with unequal height/width should have been supported by the SDK version following the 2.18.2 so 2.19.0 and 2.19.1: enable unequal height and width by yahavb · Pull Request #592 · huggingface/optimum-neuron · GitHub.
Can you still recall the SDK version you used for successful compilation?
Jingya September 16, 2024, 11:54am 4
I got this when compiling unequal height/width SD’s vae encoder with neuron SDK 2.19.1 on an inf2.8xlarge instance.
[NLA001] Unhandled exception with message: === BIR error ===
Reason: Access pattern out of bound.
Instruction: identity_pool_1_I-5532-441602-tc
Opcode: TensorCopy
Instruction Source: (float32<128 x 1027> $5532[i2_369_0_0, i2_369_0_1, i1_370_6433, i3_369_0_6433, i3_369_1_0_6433_0_0, i3_369_1_0_6433_0_1, i3_369_1_0_6433_1, i3_369_1_1_0_6433_0, i3_369_1_1_0_6433_1, i3_369_1_1_1_6433_0, i3_369_1_1_1_6433_1, i2_370_6433]:5532)0:
Argument AP:
Access Pattern: [[2051,64],[1,1],[1,1027]]
Offset: 1028
Memory Location: {add.11_VN_191_ReloadStore111619}@SB<0,175096>(128x8204)#Internal DebugInfo: <add.11||UNDEF||[128, 2051, 1]>
- Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
I will ask Annapurna folks. And it will be super helpful if you can share the env where you succeeded in compiling it!
hgfckwla September 16, 2024, 8:52pm 5
Hi @Jingya thanks for confirming the issue. Unfortunately I can’t find my old virtual env with the versions that worked. I think it was on my spot instance that’s now gone
Jingya September 26, 2024, 9:24am 7
No worries, I talked with the Annapurna team, they are working on a fix for the compiler regression. Thanks again for letting us know, I will add a unit test for unequal width/height once the patch is out.