[Ffmpeg-devel] [PATCH] Snow mmx+sse2 asm optimizations (original) (raw)

Michael Niedermayer michaelni
Mon Feb 6 14:12:45 CET 2006


Hi

On Sun, Feb 05, 2006 at 12:47:14PM -0500, Robert Edele wrote:

I've written assembly SIMD optimizations (MMX, SSE2) for three parts of snow. These changes include:

- MMX and SSE2 code for the bottom part of addyblockbuffered. - Left shifting the OBMC tables by 2, and updating parts of the code to work with the change. This makes for somewhat faster code by eliminating some shift operations in the innermost loop of addyblockbuffered. - vertical compose has a straightforward SIMD implementation. - horizontal compose has substantially modified internally to allow for an efficient SIMD implementation and improving cache performance. For plain C code, it may be faster or slower on your system (faster on mine). The largest change is that it is almost entirely in-place and the temp buffer is only half used now, allowing for SIMD optimization and improving cache performance. An added step, interleaveline(), has been added because the in-place lifts do not leave the coefficients in the proper places. This code is extremely fast in SIMD. I am aware that conditional compilation for SIMD code is frowned upon, so could someone give me some feedback on how my code could be efficiently done using function pointers like the other SIMD optimizations in ffmpeg? Some functions (interleaveline, 8x8 obmc) take nary 500 clocks to finish.

  1. how much speed do we loose if you convert them to naive function pointers also keep in mind that gcc has serious difficulty optimizion large functions
  2. if you want to decrease the overhead: then change: for(){ func_ptr() } to func_mmx(){ for(){ mmx() } } func_c(){ for(){ c() } }

yeah you duplicate a few lines of code, but its MUCH cleaner and if there is lots of other stuff in the loop which needs to be duplicated then that should be split into its own inline function ...

[...]

@@ -1409,6 +1484,121 @@ spatialcompose53idy(&cs, buffer, width, height, stride); }

+static void interleaveline(DWTELEM * low, DWTELEM * high, DWTELEM *b, int width){ + int i = width - 2; + + if (width & 1) + { + b[i+1] = low[(i+1)/2];

dividing signed integers by 2^x is slow due to the special case of negative numbers, so use a >>1 here ur use unsigned

[...]

+static void horizontalcompose97iunified(DWTELEM *b, int width){ + const int w2= (width+1)>>1; +#ifdef HAVESSE2 +// SSE2 code runs faster with pointers aligned on a 32-byte boundary. + DWTELEM tempbuf[width + 4]; + DWTELEM * const temp = tempbuf + 4 - (((int)tempbuf & 0xF) / 4); +#else + DWTELEM temp[width]; +#endif + const int wl= (width>>1); + const int wr= w2 - 1; + int i; + + { + DWTELEM * const ref = b + w2 - 1; + DWTELEM b0 = b[0]; //By allowing the first entry in b[0] to be calculated twice + // (the first time erroneously), we allow the SSE2 code to run an extra pass. + // The savings in code and time are well worth having to store this value and + // calculate b[0] correctly afterwards. + + i = 0; +#ifdef HAVEMMX +horizontalcompose97ilift0asm +#endif + for(; i<wl; i++){_ _+ b[i] = b[i] - ((WDM * (ref[i] + ref[i + 1]) + WDO) >> WDS); + } + + if(width&1){ + b[wl] = b[wl] - ((WDM * 2 * ref[wl] + WDO) >> WDS); + } + b[0] = b0 - ((WDM * 2 * ref[1]+WDO)>>WDS); + } + + { + DWTELEM * const dst = b+w2; + DWTELEM * const src = dst; + + i = 0; + for(; (((long)&dst[i]) & 0xF) && i<wr; i++){ + dst[i] = src[i] - (b[i] + b[i + 1]); + } +#ifdef HAVESSE2 + horizontalcompose97ilift1asm +#endif + for(; i<wr; i++){ + dst[i] = src[i] - (b[i] + b[i + 1]); + } + + if(!(width&1)){ + dst[wr] = src[wr] - (2 * b[wr]); + } + }

umm, indention ... and this is quite repeative code, so should be in its own function/macro similar to lift()

[...]

-- Michael



More information about the ffmpeg-devel mailing list