[Ffmpeg-devel] patch: altivec optimizations for h264 decoder (original) (raw)
Mauricio Alvarez alvarez
Mon Feb 6 19:56:58 CET 2006
- Previous message: [Ffmpeg-devel] patch: altivec optimizations for h264 decoder
- Next message: [Ffmpeg-devel] patch: altivec optimizations for h264 decoder
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Romain Dolbeau wrote:
They probably do. It would be intesrting to know what OS and compiler the author of the patches used (I don't have linux/ppc anymore).
I am using Mac OS-X (Darwin Kernel Version 7.9.0) with gcc-3.3.3 on a G5 (ppc970) machine
Patch 1 : nothing to add, except that gcc register allocator is probably going to hate ffh264idctaddaltivecmat
Why do you think so?. This algorithm has more instructions than the factorized-matrix that is implemented in the C version but it can take more advantage of the altivec instructions by reducing the data reorganization (matrix transpose and so on).
Patch 2 : in PREFIXh264qpel4hvlowpassaltivec, why use VECLOADUNALIGNEDCHECK ? tmpbis is computed from tmp (comments -> assumed aligned) and tmpStride (comments -> multiple of 16), so it has to be aligned.
Well, the problem here is with the h264_qpel4_mc22_altivec function which passes to qpel4_hv_lowpass_altivec the value 4 as a stride for the tmp array. Because of that I have to check and align the data for loading the temp results in the second part of h264_qpel4_hv_lowpass_altivec. I agree with you that this is a lot of overhead. One way to eliminate this is to change h264_qpel4_mc22_altivec in order to pass always 8 as a stride for the tmp array and also change the size of that array. I think that this stride can be 8 (to a pointer to vector signed short) for all the mc22 functions: qpel16_mc22, qpel8_mc22 and qpel4_mc22. In this way there will no be alignment problems.
OPNAME ## h264_qpel ## SIZE ## hv_lowpass ## CODETYPE(dst, tmp, src, stride, SIZE, stride);
---> change "SIZE" here for "8".
Patch 4 : is putpixels8altivec really faster than the C version ? there's not computation whatsoever, and with the need to load the destination block to insert the new data, it may be slower to use AltiVec than regular C code.
I have not tested this, I only added put_pixels8_altivec because put_h264_qpel8_mc00_altivec requires it. May be it is slower that the C version I'm not sure, I am going to make a deeper analysis of this.
BTW I was trying to implement put_pixels16_l2_altivec and put_pixels8_l2_altivec using the vec_avg instruction, but always I found evident artifacts in the resulting videos. Has you any clue about that? I think that it is possible to achieve more speed-up by implementing those functions in altivec.
Thanks for your comments.
Mauricio A.
- Previous message: [Ffmpeg-devel] patch: altivec optimizations for h264 decoder
- Next message: [Ffmpeg-devel] patch: altivec optimizations for h264 decoder
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]