What’s Your Vector, Victor?
Monday, May 2nd, 2005Along with playing with gcj, I’ve been playing around with gcc 4.0. I don’t expect to use it on any production code in the near future, but I want to know what’s coming. In general, reports have been mixed, at least on x86 and AMD64 architectures. I’m still waiting for my university to get the site-licensed copies of Tiger, so I can’t test how GCC 4.0 performs on the iBook G4. I anxiously await benchmarks on the PPC architecture.
In the meantime, I wanted to try out what I consider to be the most interesting addition to GCC: the loop auto-vectorizer. The possibility of using SIMD instructions on CPUs to accelerate loops sounds very promising. This optimization is NOT enabled by default, and while trying to figure out how to turn it on, I ended up reviewing the SIMD options on x86/AMD64:
- MMX - Eight 64-bit wide registers. Integer ops only.
- 3DNow! - Eight 64-bit wide registers. Integer and single-precision float ops.
- SSE - Eight 128-bit wide registers. Single-precision float ops only.
- SSE2 - Eight 128-bit wide registers. Integer, single and double-precision float ops.
- SSE2/AMD64 - Sixteen (!) 128-bit wide registers.
To actually get the vectorization optimizations to be used, you need to use -OX -ftree-vectorize, where X = 1, 2, or whatever your favorite optimization level is. If you leave out -O, the vectorizer will be skipped. On AMD64, this is enough to get the vectorizer going. On x86, you might also need -msse or -msse2. For added fun, you can throw in -fdump-tree-vect -ftree-vectorizer-verbose=8 and check out the .vect file for a detailed explanation of what the compiler is doing when it analyzes loops.
I’ll spare you crappy benchmarks for now. My test code is basically the simplest array loop possible, and totally meaningless. I’m hoping to turn this compiler loose on our FORTRAN code and see what it does with our array loops.
(Update) One thing I will say about performance: With SSE2 on the Opteron, the benefit of vectorizing simple loops appears to be linear in the number of variables you can pack into one SIMD register. So, for 16-bit shorts, the speedup is 8x, for 32-bit floats it is 4x, and for 64-bit doubles it is 2x. This makes sense, but I was surprised to see it work out almost exactly. A little hunting in the gcc manual showed this is because on AMD64, gcc actually uses the SSE registers by default, even for scalar floating point math, and just wastes most of the register.
