Yes, the author ironically doesn't actually understand what's happening in the pipeline for the processor they're writing about.
It is a superscalar processor using reservation stations and tomasulo's algorithm, a RAW hazard doesn't stall the pipeline. But instructions which would be affected by RAW hazards will slow down execution if those instructions are on the critical path. So paying close attention to instruction-level dependencies will produce better results in both since you can increase the ILP possible in the instruction sequence. This helps a lot especially on code which might run on in-order pipelines (a lot of mobile and embedded chips still use in-order pipelines) too.
That's just nit-picking the nomenclature. The author is simplifying things greatly but he's essentially correct - the naive version with less operations ends up slower because there is a long chain of dependencies.
With out-of-order execution this isn't really a stall, because non-dependent instructions can still execute/retire with renamed registers alongside it, but it still effectively delays progress of the dependent chain the same way an in-order pipeline would operate. You might say: its progress is stalled.
No. A read-after-write hazard has not stalled a cpu in the last 10 years. So the article will mislead all the beginner architects out there. A clear explanation of dependencies would be much more appropriate.
There are CPUs that have been stalled by RAWs and appeared in the last 10 years, and they are here to stay.
That being said, I think the downvotes are unfair on your post. The author of the post is writing about CPUs that are indeed unlikely to be ever stalled by RAWs.
Sadly, this is only true on proper desktop cpus. For an example of a CPU that does horribly on RAW hazards, look no further than the Xenon CPU in the XBox360.
If you need a result to perform the next operation, you have something greater than a RAW hazard. RAW is not necessarily guaranteed to stall whereas a dependency in the result, regardless of the register it's supposed to land in, can't be optimized away without creating opportunities for register renaming and instruction re-ordering to work. It's probably going a little far to say that the author doesn't understand the pipeline. I thought they were over-simplifying the pipeline stages, but the potential for register renaming an instruction reordering don't fundamentally change on AMD vs Intel's x86 CPU's while they do change depending on how you write your C. The results speak for themselves.
> but the potential for register renaming an instruction reordering don't fundamentally change on AMD vs Intel's x86 CPU's
People are writing code to run on a lot more than x86_64 these days. The author's way of optimizing is actually more correct for an in-order pipeline than an out-of-order one and that is probably the correct approach for multi-architecture code.
But the author's approach is not particularly well informed by knowledge of what's actually happening in the pipeline of a modern x86 processor. Nor does, in this case, it particularly matter. Which is why the author's assertion that knowing about the pipeline will help with code optimization is somewhat ironic, since they are modeling their optimization using a completely different pipeline than the processor actually implements.
And more interestingly, that's probably the correct way to do it for code that you expect to run on more than one processor generation, or certainly for code that runs on more than one processor architecture.
He points out this is a simplification several times:
"For the sake of simplicity, imagine the CPU’s pipeline
depth is 3. At a given time, it can fetch, execute and
finish one instruction"
and later:
"I don’t know whether this scheduling is optimal for the
(incorrect) assumption of a 3-stage pipeline, but it
does look pretty good. Also, loading a0, a1 etc. from
memory hasn't been covered for the sake of simplicity."
This wouldn't be anywhere near as readable an article if it covered the gory details of multiple decode, dispatch and register renaming. The same analysis still applies reasonably well to an out-of-order pipeline.
The simplification the author points out is arbitrarily picking a pipeline depth of three, not that they simplified the entire concept of what the pipeline is actually doing.
As I've stated multiple times, it's a totally valid way to model it, but it is also useful and important to know it is not how the processor actually works.
To be clear, are you trying to point out that their charts look like in-order modelling on an out-of-order CPU?
Sequential dependencies, especially those existing in looping calculations, turn out-of-order execution into in-order execution. There's no way to re-order the execution beyond some degree of dependency. I don't know what the word for it is, but an engineer will sooner compress a movie to one bit than enable re-ordering certain kinds of sequential dependencies. There's nothing left to do out-of-order because all potential instructions are waiting for something in-order.
This isn't even a loop. But yes, the critical path and its dependencies is the main factor in determining the run speed of the instruction sequence unless your processor is balls crazy enough to value predict. ;)
(It has never been successfully implemented, but a few papers have been written on it. There's actually some interesting possibilities that essentially involve value speculation and then forking an SMT thread and finding out whether or not to squash it later.)
objdump and just look for the function names. Autovectorization is about the only curve-ball I would expect, but I haven't seen compilers being especially smart about the pipeline or SIMD.
It is a superscalar processor using reservation stations and tomasulo's algorithm, a RAW hazard doesn't stall the pipeline. But instructions which would be affected by RAW hazards will slow down execution if those instructions are on the critical path. So paying close attention to instruction-level dependencies will produce better results in both since you can increase the ILP possible in the instruction sequence. This helps a lot especially on code which might run on in-order pipelines (a lot of mobile and embedded chips still use in-order pipelines) too.