It's interesting that your clang and my clang give different results, even though we're using the same version. I suspect it's a result of differing CPU architectures. (i.e. my CPU is a different model to yours perhaps).
I originally did put loop alignment in my asm version, but I took it out because it was actually ever so slightly slower on mine. Make of that what you will.
I think a big difference is the flags. At least for g++, if you don't specify -march=native -mtune=native you're going to take a performance hit. How much of a performance hit depends on the features.
The sorttest.zip Makefile has only the following flags specified: -O3 --std=c++11
Where bjourne has: -O3 --std=c++11 -fomit-frame-pointer -march=native -mtune=native
I might rerun your tests with bjourne's additions!
That's very likely. Mine is an AMD Phenom(tm) II X6 1090T. Though I changed your code a little so that the intro looks like this:
sortRoutine:
; rdi = items
; esi = count
push rbp ; <- stack alignment push
sortRoutine_start:
cmp esi, 2
jb done
dec esi
The "cmp esi, 2; jb done; dec esi" corresponds to your "sub rdx, 1; jbe done". That improves it on my machine to 63 ms/loop. If you are interested I can put it online somewhere.
This shouldn't be needed. 8 byte alignment is fine for the CPU itself. The purpose of 16 byte alignment is to facilitate making 16 byte aligned stack allocations.
I also checked similar manual for AMD and it doesn't seem to mention RSP alignment at all, except that "some calling conventions may require ...".
The CPU doesn't care. It only matters when you call functions which allocate 16B objects on the stack.* This function calls only itself and pushes only 8B words on the stack so it's fine with 8B alignment.
* Some functions generated by C compilers do and they segfault if you call them with wrong alignment. Ask me how I know.
edit:
OK, so I downloaded this code. Results:
as-is: 78111us
push rbp: 73093us
sub rsp,8: 72332us
sub rax,8: 72222us
Seems to be a matter of instruction alignment, nothing to do with the stack.
I originally did put loop alignment in my asm version, but I took it out because it was actually ever so slightly slower on mine. Make of that what you will.