Run encoder on Apple Neural Engine · ggml-org/whisper.cpp · Discussion #548 (original) (raw)
Hey folks! Awesome work :) I was made aware of this thread by @ggerganov after a conversation we had on Twitter.
Long story short I've optimized both Whisper's encoder and decoder to run on Apple's Neural Engine a couple weeks back, and have hacked flexible sized inputs for the decoder (though not recommended lol). I've done this twice, once on-top of huggingface's implementation of Whisper, and I've published a version built on-top of OpenAI's implementation:
https://github.com/RobertRiachi/ANE-Optimized-Whisper-OpenAI
I can validate @rsomani95's benchmarks as I too get similar fp32 encoder prediction performance :)
Speeding up the current encoder
Quantizing to fp16 and using the standard LLM data format of (batch, seq, embed_dim) actually slows down prediction time, so with a few changes we can get even more performance out of @wangchou 's idea!
The current implementation uses the standard LLM data format of (batch, seq, embd_dim) but the neural engine's most conductive data format is 4D and channels first. We also want the last axis to be the sequence since the last axis of the ANE buffer isn't packed, and must be contiguous and aligned to 64 bytes. This only applies to the last axis, and since we're quantizing to fp16 the neural engine is actually padding it up to 64bytes which results in 32 times the memory cost for 16bit precision.
TLDR; By switching to (batch, embed_dim, 1, seq) we can further improve the speed of the encoder.
Decoder & Kvcaching
Decoding a (1,1) token with an optimized ANE decoder model ran prediction at best 16ms which is still slower than the 7s currently achieved on CPU.
I've spent a good amount of time attempting to figure out a solution to the kvcaching problem, the fundamental issue is that cormel models are unable to branch thus making this difficult. We could export two versions of the decoder, one that's not expecting a kvcache for the first token and another that can handle the kvcache case, but that's pretty gross.
Quantization
I actually haven't noticed any performance gains by quantizing to fp16 from fp32, the prediction speed is roughly equivalent. Using fp16 instead of fp32 actually slows down compilation time by roughly 2x in all my tests. I suspect this has to do with how the quantize_weights method in coremltools throws an error if you specify "mlprogram" as the model type in your ct.convert call, and if no convert_to argument is provided coremltools seems to default to creating "NeuralNetwork proto" instead of a "MILSpec.Program proto" (source).
