Archive for the ‘GPGPU’ Category

General Purpose GPU Computing

Saturday, March 17th, 2007

For the last few weeks I’ve been immersed in the world of general purpose GPU computing. This growing subfield is exploring how to use the powerhouse graphics chips now sitting on the shelves at CompUSA as massively parallel floating-point coprocessors for numerical calculations. A quick comparison of CPUs and GPUs reveals why this could be a big win:

  Athlon 64 X2 (CPU) GeForce 8800 GTX (GPU)
Processor clock rate 2–3 GHz 575 MHz
Execution cores 2 16 (with 8 double-clocked FPUs each)
Transistor count 164–243 million 681 million
On-chip Memory 1–2 MB 256 kB
Off-chip Memory Bus 128 bit 384 bit
Off-chip Memory Capacity 4 GB (or more) 768 MB
Approximate GFLOPS 5–10 346


(Both the GeForce 8800 and older versions of the Athlon 64 X2 are made with similar 90 nm silicon fabrication processes.) Even factoring in the 3x larger transistor count, the GeForce complete blows away the CPU in raw floating point performance.

Graphics chip makers were able to achieve this, not by being 65 times smarter than AMD, but rather by solving a different problem. CPU makers want to run sequential programs with limited internal parallelism as fast as possible. You can do that with high clock rates, branch prediction, and fat on-chip caches to keep the highly clocked instruction cores fed. On the other hand, GPU makers want to transform 3D solids and 2D textures into a realistic raster image for display 60 times per second. This is a highly data-parallel task best served by many identical floating point units fed by a very wide and fast bus to off-chip memory. A small amount of caching for constants is useful, but beyond that, the cache would need to be nearly as big as the graphics memory to show much of a performance benefit.

Both NVIDIA and ATI have realized the potential power of using graphics cards as massively parallel floating point coprocessors, and are starting to open up the interface to directly program the cards. Before, you had to do calculations by translating your problem into the language of OpenGL/DirectX (or use a program like BrookGPU to do it automatically for you). Now with the Compute Unified Device Architecture (NVIDIA) and Close To the Metal (ATI), the graphics driver translation is unnecessary. You can now push your arrays onto the card and invoke functions on them without having to first convert them to textures and shader programs.

Not every problem can be solved efficiently by a graphics card. There are limits (about which I’ll say more in future posts), but I think many scientific programs have inner loops which operate on large amounts of floating point data “in parallel,” even if it isn’t explicitly written that way. The challenge will be figuring out how to express the computation in a form the graphics card can most efficiently work with, and dealing with the limited precision of GPUs.

Entries (RSS)