The math is very simple: register spill is the same, GCC generates 7 vaddsd and 9 vmulsd. ICC generates 7 vaddsd and 15 vmulsd. addsd latency is 3, mulsd latency is 5. There are multiple add-multiply units on intel CPUs, but this code loads them all, so ICC should be 30% slower with this assembly.
> Maybe ICPC is inlining things
Check this. Looks like it generates vectorized code. I don't have Intel compiler, so all I can do here is to read assembly.
> having looked at a fair amount of generated SIMD x64 assembly
There are no SIMD in any of your asm files. All these sin1-sin7 functions accept only one double and return only one result. They work with 128-bit operands, but only the first 64-bit part contain values for operations.
The math is very simple: register spill is the same, GCC generates 7 vaddsd and 9 vmulsd. ICC generates 7 vaddsd and 15 vmulsd. addsd latency is 3, mulsd latency is 5.
I strongly disagree that it's simple: figuring out the actual performance on a real processor without running the code is far from simple. In particular, the ordering of the instructions can be crucial. Yes, it can be done, but it takes a lot more than counting instructions. Intel's IACA is the most useful tool I've found for this (free), but otherwise I just spend a lot of time with Agner Fog's execution port info.
There are no SIMD in any of your asm files.
You are absolutely correct. I put the objdump into pastebin, but did not actually look at this code. I just presumed it was vectorized from seeing the large jump in performance with '-mavx'.
Looking closer, it looks like Intel is using a dispatch function, and has a separate vectorized version that it uses when it can. I need to go to sleep, but I'll post the whole objdump so you can inspect: http://pastebin.com/B3EBi0Lq
And if you send me email (my address is in profile) I'll send you a binary to play with. Or if you tell me where to upload, I can do so.
> Maybe ICPC is inlining things
Check this. Looks like it generates vectorized code. I don't have Intel compiler, so all I can do here is to read assembly.
> having looked at a fair amount of generated SIMD x64 assembly
There are no SIMD in any of your asm files. All these sin1-sin7 functions accept only one double and return only one result. They work with 128-bit operands, but only the first 64-bit part contain values for operations.