Low-level performance optimization hacks

Big, big disclaimer: Nowadays, most low-level optimization techniques are already done by the compiler (if you use GCC's -O3 flag, that is). I'm happy to discuss why this works, but will mercilessly ridicule anyone who tries to apply Duff's device without proof that it is faster for him. :-)

Duff's device — There was a time when putting a do loop around a switch statement's cases would allow you to unroll the loop. I'll spare you the uglyness here, have a look at the wikipedia page for the source code.

Fast Inverse Square root — This is incredible: You can approximate the inverse square root of a 32-bit floating point number by pretending it's an int and doing some fast integer operations involving a fixed magic constant. (I came across this via this post about a similar method for calculating double reciprocals.)

I guess the second of these hacks is harder for the compiler to do on its own...