Yes, the author ironically doesn't actually understand what's happening in the p...

pslam · on Feb 4, 2014

That's just nit-picking the nomenclature. The author is simplifying things greatly but he's essentially correct - the naive version with less operations ends up slower because there is a long chain of dependencies.

With out-of-order execution this isn't really a stall, because non-dependent instructions can still execute/retire with renamed registers alongside it, but it still effectively delays progress of the dependent chain the same way an in-order pipeline would operate. You might say: its progress is stalled.

alain94040 · on Feb 4, 2014

No. A read-after-write hazard has not stalled a cpu in the last 10 years. So the article will mislead all the beginner architects out there. A clear explanation of dependencies would be much more appropriate.

brigade · on Feb 4, 2014

Intel released Atom not 6 years ago, and is still selling Bonnell-derived CPUs to this day...

Cortex-A7 and A53 are in-order as well, and are/will be quite common among low-end Chinese smartphones and tablets.

weland · on Feb 4, 2014

There are CPUs that have been stalled by RAWs and appeared in the last 10 years, and they are here to stay.

That being said, I think the downvotes are unfair on your post. The author of the post is writing about CPUs that are indeed unlikely to be ever stalled by RAWs.

Tuna-Fish · on Feb 4, 2014

Sadly, this is only true on proper desktop cpus. For an example of a CPU that does horribly on RAW hazards, look no further than the Xenon CPU in the XBox360.

knappador · on Feb 4, 2014

If you need a result to perform the next operation, you have something greater than a RAW hazard. RAW is not necessarily guaranteed to stall whereas a dependency in the result, regardless of the register it's supposed to land in, can't be optimized away without creating opportunities for register renaming and instruction re-ordering to work. It's probably going a little far to say that the author doesn't understand the pipeline. I thought they were over-simplifying the pipeline stages, but the potential for register renaming an instruction reordering don't fundamentally change on AMD vs Intel's x86 CPU's while they do change depending on how you write your C. The results speak for themselves.

djcapelis · on Feb 4, 2014

> but the potential for register renaming an instruction reordering don't fundamentally change on AMD vs Intel's x86 CPU's

People are writing code to run on a lot more than x86_64 these days. The author's way of optimizing is actually more correct for an in-order pipeline than an out-of-order one and that is probably the correct approach for multi-architecture code.

But the author's approach is not particularly well informed by knowledge of what's actually happening in the pipeline of a modern x86 processor. Nor does, in this case, it particularly matter. Which is why the author's assertion that knowing about the pipeline will help with code optimization is somewhat ironic, since they are modeling their optimization using a completely different pipeline than the processor actually implements.

And more interestingly, that's probably the correct way to do it for code that you expect to run on more than one processor generation, or certainly for code that runs on more than one processor architecture.

pslam · on Feb 4, 2014

He points out this is a simplification several times:

  "For the sake of simplicity, imagine the CPU’s pipeline
  depth is 3. At a given time, it can fetch, execute and
  finish one instruction"

and later:

  "I don’t know whether this scheduling is optimal for the
  (incorrect) assumption of a 3-stage pipeline, but it
  does look pretty good. Also, loading a0, a1 etc. from
  memory hasn't been covered for the sake of simplicity."

This wouldn't be anywhere near as readable an article if it covered the gory details of multiple decode, dispatch and register renaming. The same analysis still applies reasonably well to an out-of-order pipeline.

djcapelis · on Feb 4, 2014

The simplification the author points out is arbitrarily picking a pipeline depth of three, not that they simplified the entire concept of what the pipeline is actually doing.

As I've stated multiple times, it's a totally valid way to model it, but it is also useful and important to know it is not how the processor actually works.

knappador · on Feb 4, 2014

To be clear, are you trying to point out that their charts look like in-order modelling on an out-of-order CPU?

Sequential dependencies, especially those existing in looping calculations, turn out-of-order execution into in-order execution. There's no way to re-order the execution beyond some degree of dependency. I don't know what the word for it is, but an engineer will sooner compress a movie to one bit than enable re-ordering certain kinds of sequential dependencies. There's nothing left to do out-of-order because all potential instructions are waiting for something in-order.

djcapelis · on Feb 4, 2014

This isn't even a loop. But yes, the critical path and its dependencies is the main factor in determining the run speed of the instruction sequence unless your processor is balls crazy enough to value predict. ;)

(It has never been successfully implemented, but a few papers have been written on it. There's actually some interesting possibilities that essentially involve value speculation and then forking an SMT thread and finding out whether or not to squash it later.)

knappador · on Feb 4, 2014

I'm so glad we're not talking about javascript ^____^

userbinator · on Feb 4, 2014

One thing really missing from this article that I'd like to see is a listing of the actual instructions generated by the compiler in each case.

knappador · on Feb 4, 2014

objdump and just look for the function names. Autovectorization is about the only curve-ball I would expect, but I haven't seen compilers being especially smart about the pipeline or SIMD.