Intro to image processing with CUDA (original) (raw)

Pages: 1 2

CUDA Kernel code

#define C_PI 3.141592653589793238462643383279502884197169399375

void global SwirlCu(int width, int height, int stride, unsigned int pRawBitmapOrig, unsigned int pBitmapCopy, double factor) { int i = blockIdx.y * blockDim.y + threadIdx.y; int j = blockIdx.x * blockDim.x + threadIdx.x; if (i >= height || j >= width) return; double cX = (double)width/2.0f; double cY = (double)height/2.0f; double relY = cY-i; double relX = j-cX; double originalAngle; if (relX != 0) { originalAngle = atan(abs(relY)/abs(relX)); if ( relX > 0 && relY < 0) originalAngle = 2.0f*C_PI - originalAngle; else if (relX <= 0 && relY >=0) originalAngle = C_PI-originalAngle; else if (relX <=0 && relY <0) originalAngle += C_PI; } else { if (relY >= 0) originalAngle = 0.5f * C_PI; else originalAngle = 1.5f * C_PI; } double radius = sqrt(relXrelX + relYrelY); double newAngle = originalAngle + 1/(factorradius+(4.0f/C_PI)); int srcX = (int)(floor(radius * cos(newAngle)+0.5f)); int srcY = (int)(floor(radius * sin(newAngle)+0.5f)); srcX += cX; srcY += cY; srcY = height - srcY; if (srcX < 0) srcX = 0; else if (srcX >= width) srcX = width-1; if (srcY < 0) srcY = 0; else if (srcY >= height) srcY = height-1; pRawBitmapOrig[istride/4 + j] = pBitmapCopy[srcY*stride/4 + srcX]; }

Results

Even though there is a memory copy overhead for CUDA, the CUDA code runs substantially faster than the CPU code. The GPU used is a GTX 470, and the CPU is an Intel core-i7 930, both at stock speeds. In fact, the CUDA code, including memory transfer overhead, is 6.38 times as fast as the CPU. The 32-bit floating point CUDA code ran 6.8673057 times as fast as the CPU. Also keep in mind that the CPU code is already using openMP to use all 8 threads available on the CPU. In short, for image processing applications, CUDA can achieve a massive speedup with little programming effort.

Advanced image processing with CUDA

The code in this tutorial is fairly basic, and requires minimal resources per thread. However, each CUDA thread can utilize plenty of resources just like a CPU can. In our next tutorial, we show some more advanced image processing with CUDA, and demonstrate speedups of code that uses a large amount of local memory per thread.

Pages: 1 2

This entry was posted by admin on September 20, 2011 at 12:03 am under CUDA, Graphics. Tagged Benchmark, C++, CPU, CUDA, GPU, Image, Image processing, Port, Rotate, Tutorial, Twist. Both comments and pings are currently closed.