If you care about SIMD, you should try Intel SPMD Program Compiler (http://ispc.github.io/). Yes, this is a different language, but the result is also very different from normal C compilers. For example, the next thing an experienced SIMD programmer could do is to unroll loop to process not just 2 doubles (SSE), 4 doubles (AVX), 8 doubles (AVX-512), but full cache line, and then reduce the result to a sum. ISPC on the other hand implements this thing automatically.
Thanks, I'll check this out. I'm very interested in x64 SIMD. This is the sort of thing I've been doing with it: http://arxiv.org/abs/1401.6399
Although I'm talking up Intel's compiler, and do think it usually beats GCC, my actual experience is that I've largely given up on writing anything performant in C (muchless C++). If you give a compiler an inch, it will figure out way to hang itself. I'm having much better luck doing the whole function with inline assembly rather than instinsics.
Your advice is not bad, but you should definitely try ISPC before rewriting everything in assembly. I'm currently optimizing Cycles Raytracer (without implementing an actual SIMD-packet raytracer), and I'm a little bit tired of avalanche effect in modern compilers too. The unusual thing about Cycles is that its code can be compiled by CUDA, OpenCL and normal C++ compilers. GPU and CPU architectures are very different, so SIMD optimization is not trivial at all. Simple operation reordering can lead to completely different stack map and performance drop by 10%. And you won't ever know about it, until you test with every supported compiler (i. e. gcc, clang and msvc).