Yep, I misread it also: I saw j = 2 * i (which would do evens and then odds when NUMBER is odd, or evens then evens again if NUMBER is even).
For what it really is - powers of 2 mod NUMBER - when NUMBER is large most reads should be out of the cache. So the first example has to read from main memory only every 64th index, and the second example has to read from main memory on almost every read. I think this agrees with what you are saying. This also explains why it is ~5x slower, which seemed too large from my previous understanding.
>So the first example has to read from main memory only every 64th index
Wouldn't it be 8 per cache line (an int is 64 bits, each cache line is 64 bytes)? I'm also assuming it caches a larger chunk of the array across multiple lines. Is that not how it works?
But I think there's a more fundamental issue here, which is that the amount measured, 68 million bytes in a second, is what - 60Mb? Did he just reduce the array size until it completed in a second? Because a very significant chunk of that is going to fit in L3 cache (on an i7 it's 8Mb), so even if you had a good random access algorithm, it would understate the problem because the data is still contiguous.
Which seems kinda dumb to me, since the real-word problem you're likely to run into is when your data is stored non-contiguously because it's scattered across multiple different structs/objects, making it impossible to utilise the cache to a significant degree at all. In that (very common under OO or interpreted languages) situation I'd expect a way more dramatic slowdown.
>>> main(20)
2 4 8 16 12 4 8 16 12 4 8 16 12 4 8 16 12 4 8 16
It's not even hitting odd indexes. Over half the array will be garbage at the end. I guess that would count as out-of-order though.