Diary Of An x264 Developer (original) (raw)

Back when I originally reviewed VP8, I noted that the official decoder, libvpx, was rather slow. While there was no particular reason that it should be much faster than a good H.264 decoder, it shouldn’t have been that much slower either! So, I set out with Ronald Bultje and David Conrad to make a better one in FFmpeg. This one would be community-developed and free from the beginning, rather than the proprietary code-dump that was libvpx. A few weeks ago the decoder was complete enough to be bit-exact with libvpx, making it the first independent free implementation of a VP8 decoder. Now, with the first round of optimizations complete, it should be ready for primetime. I’ll go into some detail about the development process, but first, let’s get to the real meat of this post: the benchmarks.

We tested on two 1080p clips: Parkjoy, a live-action 1080p clip, and the Sintel trailer, a CGI 1080p clip. Testing was done using “time ffmpeg -vcodec {libvpx or vp8} -i input -vsync 0 -an -f null -”. We all used the latest SVN FFmpeg at the time of this posting; the last revision optimizing the VP8 decoder was r24471.

As these benchmarks show, ffvp8 is clearly much faster than libvpx, particularly on 64-bit. It’s even faster by a large margin on Atom, despite the fact that we haven’t even begun optimizing for it. In many cases, ffvp8′s extra speed can make the difference between a video that plays and one that doesn’t, especially in modern browsers with software compositing engines taking up a lot of CPU time. Want to get faster playback of VP8 videos? The next versions of FFmpeg-based players, like VLC, will include ffvp8. Want to get faster playback of WebM in your browser? Lobby your browser developers to use ffvp8 instead of libvpx. I expect Chrome to switch first, as they already use libavcodec for most of their playback system.

Keep in mind ffvp8 is not “done” — we will continue to improve it and make it faster. We still have a number of optimizations in the pipeline that aren’t committed yet.

Developing ffvp8

The initial challenge, primarily pioneered by David and Ronald, was constructing the core decoder and making it bit-exact to libvpx. This was rather challenging, especially given the lack of a real spec. Many parts of the spec were outright misleading and contradicted libvpx itself. It didn’t help that the suite of official conformance tests didn’t even cover all the features used by the official encoder! We’ve already started adding our own conformance tests to deal with this. But I’ve complained enough in past posts about the lack of a spec; let’s get onto the gritty details.

The next step was adding SIMD assembly for all of the important DSP functions. VP8′s motion compensation and deblocking filter are by far the most CPU-intensive parts, much the same as in H.264. Unlike H.264, the deblocking filter relies on a lot of internal saturation steps, which are free in SIMD but costly in a normal C implementation, making the plain C code even slower. Of course, none of this is a particularly large problem; any sane video decoder has all this stuff in SIMD.

I tutored Ronald in x86 SIMD and wrote most of the motion compensation, intra prediction, and some inverse transforms. Ronald wrote the rest of the inverse transforms and a bit of the motion compensation. He also did the most difficult part: the deblocking filter. Deblocking filters are always a bit difficult because every one is different. Motion compensation, by comparison, is usually very similar regardless of video format; a 6-tap filter is a 6-tap filter, and most of the variation going on is just the choice of numbers to multiply by.

The biggest challenge in an SIMD deblocking filter is to avoid unpacking, that is, going from 8-bit to 16-bit. Many operations in deblocking filters would naively appear to require more than 8-bit precision. A simple example in the case of x86 is abs(a-b), where a and b are 8-bit unsigned integers. The result of “a-b” requires a 9-bit signed integer (it can be anywhere from -255 to 255), so it can’t fit in 8-bit. But this is quite possible to do without unpacking: (satsub(a,b) | satsub(b,a)), where “satsub” performs a saturating subtract on the two values. If the value is positive, it yields the result; if the value is negative, it yields zero. Oring the two together yields the desired result. This requires 4 ops on x86; unpacking would probably require at least 10, including the unpack and pack steps.

After the SIMD came optimizing the C code, which still took a significant portion of the total runtime. One of my biggest optimizations was adding aggressive “smart” prefetching to reduce cache misses. ffvp8 prefetches the reference frames (PREVIOUS, GOLDEN, and ALTREF)… but only the ones which have been used reasonably often this frame. This lets us prefetch everything we need without prefetching things that we probably won’t use. libvpx very often encodes frames that almost never (but not quite never) use GOLDEN or ALTREF, so this optimization greatly reduces time spent prefetching in a lot of real videos. There are of course countless other optimizations we made that are too long to list here as well, such as David’s entropy decoder optimizations. I’d also like to thank Eli Friedman for his invaluable help in benchmarking a lot of these changes.

What next? Altivec (PPC) assembly is almost nonexistent, with the only functions being David’s motion compensation code. NEON (ARM) is completely nonexistent: we’ll need that to be fast on mobile devices as well. Of course, all this will come in due time — and as always — patches welcome!

Appendix: the raw numbers

Here’s the raw numbers (in fps) for the graphs at the start of this post, with standard error values:

Core i7 620QM (1.6Ghz), Windows 7, 32-bit:
Parkjoy ffvp8: 44.58 +/- 0.44
Parkjoy libvpx: 33.06 +/- 0.23
Sintel ffvp8: 74.26 +/- 1.18
Sintel libvpx: 56.11 +/- 0.96

Core i5 520M (2.4Ghz), Linux, 64-bit:
Parkjoy ffvp8: 68.29 +/- 0.06
Parkjoy libvpx: 41.06 +/- 0.04
Sintel ffvp8: 112.38 +/- 0.37
Sintel libvpx: 69.64 +/- 0.09

Core 2 T9300 (2.5Ghz), Mac OS X 10.6.4 , 64-bit:
Parkjoy ffvp8: 54.09 +/- 0.02
Parkjoy libvpx: 33.68 +/- 0.01
Sintel ffvp8: 87.54 +/- 0.03
Sintel libvpx: 52.74 +/- 0.04

Core Duo (2Ghz), Mac OS X 10.6.4, 32-bit:
Parkjoy ffvp8: 21.31 +/- 0.02
Parkjoy libvpx: 17.96 +/- 0.00
Sintel ffvp8: 41.24 +/- 0.01
Sintel libvpx: 29.65 +/- 0.02

Atom N270 (1.6Ghz), Linux, 32-bit :
Parkjoy ffvp8: 15.29 +/- 0.01
Parkjoy libvpx: 12.46 +/- 0.01
Sintel ffvp8: 26.87 +/- 0.05
Sintel libvpx: 20.41 +/- 0.02