More

clamchowder · on Dec 1, 2024

Yes, I tested on CCD1 (the non-vcache CCD) on both BIOS versions.

clamchowder · on Dec 1, 2024

Seems like no one ever reads the byline anymore

clamchowder · on Aug 11, 2024

Some notes: 1. Consider M1's 8-wide decoder hit the 5+ GHz clock speeds that Intel Golden Cove's decoder can. More complex logic with more delays is harder to clock up. Of course M1 may be held back by another critical path, but it's interesting that no one has managed to get a 8-wide Arm decoder running at the clock speeds that Zen 3/4 and Golden Cove can.

A715's slides say the L1 icache gains uop cache features including caching fusion cases. Likely it's a predecode scheme much like AMD K10, just more aggressive with what's in the predecode stage. Arm has been doing predecode (moving some stages to the L1i fill path rather than the hotter L1i hit path) to mitigate decode costs for a long time. Mitigating decode costs again with a uop cache never made much sense especially considering their low clock speeds. Picking one solution or the other is a good move, as Intel/AMD have done. Arm picked predecode for A715.

2. The paper does not say 22% of core power is in the decoders. It does say core power is ~22% of package power. Wrong figure? Also, can you determine if the decoder power situation is different on Arm cores? I haven't seen any studies on that.

3. Multiple decoder blocks doesn't penalize decoder blocks once the load balancing is done right, which Gracemont did. And you have to massively unroll a loop to screw up Tremont anyway. Conversely, decode blocks may lose less throughput with branchy code. Consider that decode slots after a taken branch are wasted, and clustered decode gets around that. Intel stated they preferred 3x3 over 2x4 for that reason.

4. "uops used by ARM are extremely close to the original instructions" It's the same on x86, micro-op count is nearly equal to instruction count. It's helpful to gather data to substantiate your conclusions. For example, on Zen 4 and libx264 video encoding, there's ~4.7% more micro-ops than instructions. Neoverse V2 retires ~19.3% more micro-ops than instructions in the same workload. Ofc it varies by workload. It's even possible to get negative micro-op expansion on both architectures if you hit branch fusion cases enough.

8. You also have to tell your ARM compiler which of the dozen or so ISA extension levels you want to target (see https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#inde...). It's not one option by any means. Not sure what you mean by "peephole heuristic optimizations", but people certainly micro-optimize for both arm and x86. For arm, see https://github.com/dotnet/runtime/pull/106191/files as an example. Of course optimizations will vary for different ISAs and microarchitectures. x86 is more widely used in performance critical applications and so there's been more research on optimizing for x86 architectures, but that doesn't mean Arm's cores won't benefit from similar optimization attention should they be pressed into a performance critical role.

neonsunset · on Aug 11, 2024

> Not sure what you mean by "peephole heuristic optimizations"

Post-emit or within-emit stage optimization where a sequence of instructions is replaced with a more efficient shorter variant.

Think replacing pairs of ldr and str with ldp and stp, changing ldr and increment with ldr with post-index addressing mode, replacing address calculation before atomic load with atomic load with addressing mode (I think it was in ARMv8.3-a?).

The "heuristic" here might be possibly related to additional analysis when doing such optimizations.

For example, previously mentioned ldr, ldr -> ldp (or stp) optimization is not always a win. During work on .NET 9, there was a change[0] that improved load and store reordering to make it more likely that simple consecutive loads and stores are merged on ARM64. However, this change caused regressions in various hot paths because, for example, previously matched ldr w0, [addr], ldr w1, [addr+4] -> modify w0 -> str w0, [addr] pair got replaced with ldp w0, w1, [add] -> modify w0, str w0 [addr].

Turns out this kind of merging defeated store forwarding on Firestorm (and newer) as well as other ARM cores. The regression was subsequently fixed[1], but I think the parent comment author may have had scenarios like these in mind.

[0]: https://github.com/dotnet/runtime/pull/92768

[1]: https://github.com/dotnet/runtime/pull/105695

hajile · on Aug 11, 2024

1. Why would you WANT to hit 5+GHz when the downsides of exponential power take over? High clocks aren't a feature -- they are a cope.

AMD/Intel maintain I-cache and maintain a uop cache kept in sync. Using a tiny part to pre-decode is different from a massive uop cache working as far in advance as possible in the hopes that your loops will keep you busy enough that your tiny 4-wide decoder doesn't become overwhelmed.

2. The float workload was always BS because you can't run nothing but floats. The integer workload had 22.1w total core power and 4.8w power for the decoder. 4.8/22.1 is 21.7%. Even the 1.8w float case is 8% of total core power. The only other argument would be that the study is wrong and 4.8w isn't actually just decoder power.

3. We're talking about worst cases here. Nothing stops ARM cores from creating a "work pool" of upcoming branches in priority order for them to decode if they run out of stuff on the main branch. This is the best of both worlds where you can be faster on the main branch AND still do the same branchy code trick too.

4. This is the tail wagging the dog (and something else if your numbers are correct). Complex x86 instructions have garbage performance, so they are avoided by the compiler. The problem is that you can't GUARANTEE those instructions will NEVER be used, so the mere specter of them forces complex algorithms all over the place where ARM can do more simple things.

In any case, your numbers raise a VERY interesting question about x86 being RISC under the hood.

Consider this. Say that we have 1024 bytes of ARM code (256 instructions). x86 is around 15% smaller (871.25 bytes) and with the longer 4.25 byte instruction average, x86 should have around 205 instructions. If ARM is generating 19.3% more uops than instructions, we have about 305 uops. x86 with just 4.7% more has 215 uops (the difference here is way outside any margins of error here).

If both are doing the same work, x86 uops must be in the range of 30% more complex. Given the limits of what an ALU can accomplish, we can say with certainty that x86 uops are doing SOMETHING that isn't the RISC they claim to be doing. Perhaps one could claim that x86 is doing some more sophisticated instructions in hardware, but that's a claim that would need to be substantiated (I don't know what ISA instructions you have that give a 15% advantage being done in hardware, but aren't already in the ARM ISA and I don't see ARM refusing to add circuitry for current instructions to the ALU if it could reduce uops by 15% either).

8. https://en.wikipedia.org/wiki/Peephole_optimization

The final optimization stage is basically heuristic find & replace. There could in theory be a mathematically provable "best instruction selection", but finding it would require trying every possible combination which isn't possible as long as P=NP holds true.

My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though. Instead, you take your basic, short instruction and pad it with useless bytes. Add a couple useless bytes to a bunch of instructions and you now have the right length to push the function over to the cache boundary without adding any NOPs.

But the issues go deeper. When do you use a REX prefix? You may want it so you can use 16 registers, but it also increases code size. REX2 with APX is going to increase this issue further where you must juggle when to use 8, 16, or 32 registers and when you should prefer the long REX2 because it has 3-register instructions. All kinds of weird tradeoffs exist throughout the system. Because the compilers optimize for the CPU and the CPU optimizes for the compiler, you can wind up in very weird places.

In an ISA like ARM, there isn't any code density weirdness to consider. In fact, there's very little weirdness at all. Write it the intuitive way and you're pretty much guaranteed to get good performance. Total time to work on the compiler is a zero-sum game given the limited number of experts. If you have to deal with these kinds of heuristic headaches, there's something else you can't be working on.

dzaima · on Aug 11, 2024

> My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though.

I'd call that more neat than absurd.

> You may want it so you can use 16 registers, but it also increases code size.

RISC-V has the exact same issue, some compressed instructions having only 3 bits for operand registers. And on x86 for 64-bit-operand instructions you need the REX prefix always anyways. And it's not that hard to pretty reasonably solve - just assign registers by their use count.

Peephole optimizations specifically here are basically irrelevant. Much of the complexity for x86 comes from just register allocation around destructive operations (though, that said, that does have rather wide-ranging implications). Other than that, there's really not much difference; all have the same general problems of moving instructions together for fusing, reordering to reduce register pressure vs putting parallelizable instructions nearer, rotating loops to reduce branches, branches vs branchless.

hajile · on Aug 11, 2024

RISC-V has a different version of this issue that is pretty straight-forward. Preferring 2-register operations is already done to save register space. The only real extra is preferring the 8 registers C uses for math. After this, it's all just compression.

x86 has a multitude of other factors than just compression. This is especially true with standard vs REX instructions because most of the original 8 instructions have specific purposes and instructions that depend on them for these (eg, Accumulator instructions with A register, Mul/div using A+D, shift uses C, etc). It's a problem a lot harder than simple compression.

Just as cracking an alphanumeric password is exponentially harder than a same-length password with numbers only, solving for all the x86 complications and exceptions is also exponentially harder.

dzaima · on Aug 11, 2024

If anything, I'd say x86's fixed operands make register allocation easier! Don't have to register-allocate that which you can't. (ok, it might end up worse if you need some additional 'mov's. And in my experience more 'mov's is exactly what compilers often do.)

And, right, RISC-V even has the problem of being two-operand for some compressed instructions. So the same register allocation code that's gone towards x86 can still help RISC-V (and vice versa)! On RISC-V, failure means 2→4 bytes on a compressed instruction, and on x86 it means +3 bytes of a 'mov'. (granted, the additioanal REX prefix cost is separate on x86, while included in decompression on RISC-V)

hajile · on Aug 11, 2024

With 16 registers, you can't just avoid a register because it has a special use. Instead, you must work to efficiently schedule around that special use.

Lack of special GPRs means you can rename with impunity (this will change slightly with the load/store pair extension). Having 31 truly GPR rather than 8 GPR+8 special GPR also gives a lot of freedom to compilers.

dzaima · on Aug 11, 2024

Function arguments and return values already are effectively special use, and should frequently be on par if not much more frequent than the couple x86 instructions with fixed registers.

Both clang and gcc support calls having differing used calling conventions within one function, which ends up effectively exactly identical to fixed-register instructions (i.e. an x86 'imul r64' can be done via a pseudo-function where the return values are in rdx & rax, an input is in rax, and everything else is non-volatile; and the dynamically-choosable input can be allocated separately). And '__asm__()' can do mixed fixed and non-fixed registers anyway.

hajile · on Aug 11, 2024

Unlike x86, none of this is strictly necessary. As long as you put things back as expected, you may use all the registers however you like.

dzaima · on Aug 11, 2024

The option of not needing any fixed register usage would apply to, what, optimizing compilers without support for function calls (at least via passing arguments/results via registers)? That's a very tiny niche to use as an argument for having simplified compiler behavior.

And good register allocation is still pretty important on RISC-V - using more registers, besides leading to less compressed instruction usage, means more non-volatile register spilling/restoring in function prologue/epilogue, which on current compilers (esp. clang) happens at the start & end of functions, even in paths that don't need the registers.

That said, yes, RISC-V still indeed has much saner baseline behavior here and allows for simpler basic register allocation, but for non-trivial compilers the actual set of useful optimizations isn't that different.

hajile · on Aug 11, 2024

Not just simpler basic allocation. There are fewer hazards to account for as well. The process on RISC-V should be shorter, faster, and with less risk that the chosen heuristics are bad in an edge case.

clamchowder · on Aug 11, 2024

1. Performance. Also Arm implemented instruction cache coherency too.

Predecode/uop cache are both means to the same end, mitigating decode power. AMD and Intel have used both (though not on the same core). Arm has used both, including both on the same core for quite a few generations.

And a uop cache is just a cache. It's also big enough on current generations to cache more than just loops, to the point where it covers a majority of the instruction stream. Not sure where the misunderstanding of the uop cache "working as far in advance is possible" comes from. Unless you're talking about the BPU running ahead and prefetching into it? Which it does for L1i, and L2 as well?

2. "you can't run nothing but floats" they didn't do that in the paper, they did D += A[j] + B[j] ∗ C[j]. Something like matrix multiplication comes to mind, and that's not exactly a rare workload considering some ML stuff these days.

But also, has a study been done on Arm cores? For all we know they could spend similar power budgets on decode, or more. I could say an Arm core uses 99% of its power budget on decode, and be just as right as you are (they probably don't, my point is you don't have concrete data on both Arm and x86 decode power, which would be necessary for a productive discussion on the subject)

3. You're describing letting the BPU run ahead, which everyone has been doing for the past 15 years or so. Losing fetch bandwidth past a taken branch is a different thing.

4. Not sure where you're going. You started by suggesting Arm has less micro-op expansion than x86, and I provided a counterexample. Now you're talking about avoiding complex instructions, which a) compilers do on both architectures, they'll avoid stuff like division, and b) humans don't in cases where complex instructions are beneficial, see Linux kernel using rep movsb (https://github.com/torvalds/linux/blob/5189dafa4cf950e675f02...), and Arm introducing similar complex instructions (https://community.arm.com/arm-community-blogs/b/architecture...)

Also "complex" x86 instructions aren't avoided in the video encoding workload. On x86 it takes ~16.5T instructions to finish the workload, and ~19.9T on Arm (and ~23.8T micro-ops on Neoverse V2). If "complex" means more work per instruction, then x86 used more complex instructions, right?

8. You can use a variable length NOP on x86, or multiple NOPs on Arm to align function calls to cacheline boundaries. What's the difference? Isn't the latter worse if you need to move by more than 4 bytes, since you have multiple NOPs (and thus multiple uops, which you think is the case but isn't always true, as some x86 and some Arm CPUs can fuse NOP pairs)

But seriously, do try gathering some data to see if cacheline alignment matters. A lot of x86/Arm cores that do micro-op caching don't seem to care if a function (or branch target) is aligned to the start of a cacheline. Golden Cove's return predictor does appear to track targets at cacheline granularity, but that's a special case. Earlier Intel and pretty much all AMD cores don't seem to care, nor do the Arm ones I've tested.

Anyway, you're making a lot of unsubstantiated guesses on "weirdness" without anything to suggest it has any effect. I don't think this is the right approach. Instead of "tail wagging the dog" or whatever, I suggest a data-based approach where you conduct experiments on some x86/Arm CPUs, and analyze some x86/Arm programs. I guess the analogy is, tell the dog to do something and see how it behaves? Then draw conclusions off that?

hajile · on Aug 11, 2024

1. The biggest chip market is laptops and getting 15% better performance for 80% more power (like we saw with X Elite recently) isn't worth doing outside the marketing win of a halo product (a big reason why almost everyone is using slower X Elite variants). The most profitable (per-chip) market is servers. They also prefer lower clocks and better perf/watt because even with the high chip costs, the energy will wind up costing them more over the chip's lifespan. There's also a real cost to adding extra pipeline stages. Tejas/Jayhawk cores are Intel's cancelled examples of this.

L1 cache is "free" in that you can fill it with simple data moves. uop cache requires actual work to decode and store elements for use in addition to moving the data. As to working ahead, you already covered this yourself. If you have a nearly 1-to-1 instruction-to-uop ratio, having just 4 decoders (eg, zen4) is a problem because you can execute a lot more than just 4 instructions on the backend. 6-wide Zen4 means you use 50% more instructions than you decode per clock. You make up for this in loops, but that means while you're executing your current loop, you must be maxing out the decoders to speculatively fill the rest of the uop cache before the loop finishes. If the loop finishes and you don't have the next bunch of instructions decoded, you have a multi-cycle delay coming down the pipeline.

2. I'd LOVE to see a similar study of current ARM chips, but I think the answer here is pretty simple to deduce. ARM's slide says "4x smaller decoders vs A710" despite adding a 5th decoder. They claim 20% reduction in power at the same performance and the biggest change is the decoder. As x86 decode is absolutely more complex than aarch32, we can only deduce that switching from x86 to aarch64 would be an even more massive reduction. If we assume an identical 75% reduction in decoder power, we'd move from 4.8w on haswell the decoder down to 1.2w reducing total core power from 22.1 to 18.5 or a ~16% overall reduction in power. This isn't too far from to the power numbers claimed by ARM.

4. This was a tangent. I was talking about uops rather than the ISA. Intel claims to be simple RISC internally just like ARM, but if Intel is using nearly 30% fewer uops to do the same work, their "RISC" backend is way more complex than they're admitting.

8. I believe aligning functions to cacheline boundaries is a default flag at higher optimization levels. I'm pretty sure that they did the analysis before enabling this by default. x86 NOP flexibility is superior to ARM (as is its ability to avoid them entirely), but the cause is the weirdness of the x86 ISA and I think it's an overall net negative.

Loads of x86 instructions are microcode only. Use one and it'll be thousands of cycles. They remain in microcode because nobody uses them, so why even try to optimize and they aren't used because they are dog slow. How would you collect data about this? Nothing will ever change unless someone pours in millions of dollars in man-hours into attempting to speed it up, but why would anyone want to do that?

Optimizing for a local maxima rather than a global maxima happens all over technology and it happens exactly because of the data-driven approach you are talking about. Look for the hot code and optimize it without regard that there may be a better architecture you could be using instead. Many successes relied on an intuitive hunch.

ISA history has a ton of examples. iAPX432 super-CISC, the RISC movement, branch delay slots, register windows, EPIC/VLIW, Bulldozer's CMT, or even the Mill design. All of these were attempts to find new maxima with greater or lesser degrees of success. When you look into these, pretty much NONE of them had any real data to drive them because there wasn't any data until they'd actually started work.

clamchowder · on Aug 11, 2024

1. Yeah I agree, both X Elite and many Intel/AMD chips clock well past their efficiency sweet spot at stock. There is a cost to extra pipeline stages, but no one is designing anything like Tejas/Jayhawk, or even earlier P4 variants these days. Also P4 had worse problems (like not being able to cancel bogus ops until retirement) than just a long pipeline.

Arm's predecoded L1i cache is not "free" and can't be filled with simple data moves. You need predecode logic to translate raw instruction bytes into an intermediate format. If Arm expanded predecode to handle fusion cases in A715, that predecode logic is likely more complex than in proir generations.

2. Size/area is different from power consumption. Also the decoder is far from the only change. The BTBs were changed from 2 to 3 level, and that can help efficiency (could make a smaller L2 BTB with similar latency, while a slower third level keeps capacity up). TLBs are bigger, probably reducing page walks. Remember page walks are memory accesses and the paper earlier showed data transfers count for a large percentage of dynamic power.

4. IMO no one is really RISC or CISC these days

8. Sure you can align the function or not. I don't think it matters except in rare corner cases on very old cores. Not sure why you think it's an overall net negative. "feeling weird" does not make for solid analysis.

Most x86 instructions are not microcode only. Again, check your data with performance counters. Microcoded instructions are in the extreme minority. Maybe microcoded instructions were more common in 1978 with the 8086, but a few things have changed between then and now. Also microcoded instructions do not cost thousands of cycles, have you checked? i.e. a gather is ~22 micro ops on Haswell, from https://uops.info/table.html Golden Cove does it in 5-7 uops.

ISA history has a lot of failed examples where people tried to lean on the ISA to simplify the core architecture. EPIC/VLIW, branch delay slots, and register windows have all died off. Mill is a dumb idea and never went anywhere. Everyone has converged on big OoO machines for a reason, even though doing OoO execution is really complex.

If you're interested in cases where ISA does matter, look at GPUs. VLIW had some success there (AMD Terascale, the HD 2xxx to 6xxx generations). Static instruction scheduling is used in Nvidia GPUs since Kepler. In CPUs ISA really doesn't matter unless you do something that actively makes an OoO implementation harder, like register windows or predication.

clamchowder · on April 10, 2024

iGPUs like the ones in PHX/MTL have to go into handhelds and ultrabooks, so they're going to be power and thermally limited in before 2-4 MB of cache + LPDDR5 becomes a major bottleneck.

Now if you can give the iGPU a 80W power budget instead of maybe 15W, that's a different story. But at that point you're competing with discrete GPUs that can show up with GDDR6, and maybe an iGPU doesn't make so much sense anymore.

clamchowder · on April 10, 2024

I simply meant it was ambitious compared to prior Intel iGPUs, especially stuff like Skylake GT2 where you could be playing at 720P low and still not get 30 FPS.

The chiplet strategy is kind of ambitious too because there's power overhead and they're targeting battery powered devices with Meteor Lake.

Eh, writing is hard. I never liked English class anyway

nsteel · on April 11, 2024

Re the power overhead. They probably get some power back by using smaller nodes on some chiplets than they'd otherwise be able to afford if doing the whole design with the same process. And they are betting on the benefits of backside-power delivery.

gautamcgoel · on April 10, 2024

Don't let the haters get you down, I love your articles.

clamchowder · on April 10, 2024

Oh I don't mind the discussion here at all, I'm just occasionally puzzled at things I thought I was pretty direct about.

Honestly though I don't like writing. Finding stuff out about hardware is fun. Weaving it into a coherent article is a chore.

nsteel · on April 11, 2024

I think most people don't appreciate how hard it is for hardware companies with established products to make non-trivial architectural changes. It's risky and it usually snowballs and becomes more risky. Especially when others in the market are already doing similar things, it downplays the change.

I love researching things too but I can't write coherently. I really enjoyed your article, thank you.

clamchowder · on April 10, 2024

(author here) I appreciate the feedback, but I have trouble understanding where you're coming from.

"what exactly makes it ambitious?" I thought I outlined that it was much more powerful than Intel's prior (RPL) iGPUs both in the first paragraph, and in the conclusion. It competes with the powerful iGPUs AMD has been getting into handhelds.

"AMD as afterthoughts" - Can you explain how you got that impression? I opened by noting how AMD's APUs are extremely competitive if not downright dominant in handhelds (Steam Deck and ROG Ally called out as specific examples) and threatens Intel in the laptop scene too.

"4K 120 FPS" - uh no, you're not getting that on an iGPU unless it's a game from 15 years ago. I suggest checking the very wide variety of other reviewers who run game benchmarks on devices like the Steam Deck or ROG Ally. 1080P or 720P 30 FPS is a good target, and you might need medium or low graphics to get there. That's what I mean by compromised gaming. It's not the same experience as say, gaming on a desktop with a midrange discrete card.

"lesser" iGPUs imo aren't a new sweet spot, the sweet spot is just holding on to older cards that still deliver better performance than these iGPUs. For example check Steam's hardware survey (https://store.steampowered.com/hwsurvey/videocard/). There are more people with a GTX 1080 than a RTX 4080. And PC games are optimizing for stable hardware capability. The latest games are usually playable on Pascal.

AtlasBarfed · on April 10, 2024

Basically, NVidia is the bar. Ambitious to me implies at least challenging NVidia. AMD makes competitive GPUs in certain arenas (and for Linux is the main choice due to drivers), but for me no definition of "ambitious" involves even thinking about AMD.

Well, 4k 120 from an iGPU ... you're saying THAT is ambitious? There's the bar!

Historically Intel has about every 5 years started to rumble about getting serious in the discrete markets, and they make some marketing fluff, but nothing even remotely competitive outside the iGPU "meh" range ever comes out.

So if I hear Intel being "ambitious" and then read an article that basically pretends (I'm not accusing you of anything) NVidia doesn't exist, well, seems like a failed premise to me.

I'm pretty negative on Intel over the last decade, you'd think I was a spurned contractor (I'm not, never worked there). Intel is definitely in the "prove it" mode. They've so massively failed/squandered opportunity at smartphone chips, SSDs, memory, graphics, and then finally screwed the pooch in process tech and CPUs. So clearly an engineering company that was hijacked by finance MBAs are driven into the ground, and it is HARD for companies to come back from that poison, especially when they had about 30 years of near-unchallenged monopoly dominance in the marketplace.

I didn't want to imply "author sux Lol" the article was pretty in-depth and information depth, but it remains the basic premise is flawed, because the source marketing/press release by Intel is about an 80% chance of being BS or "same story, different half decade".

clamchowder · on April 10, 2024

4K 120..."There's the bar!" - By ambitious I meant Intel's serious about getting competitive gaming performance in the handheld or thin/light laptop category. MTL's iGPU is ambitious compared to older standard Intel iGPUs like the HD 530.

"Nvidia doesn't exist" They don't exist in the iGPU market, unless you count the Nintendo Switch. The Switch doesn't run the same games that Meteor Lake and Phoenix do, and therefore I don't think it's an interesting comparison. But I do have data at https://chipsandcheese.com/2023/12/23/nintendo-switchs-igpu-... if you want to factor in Nvidia. Same with Nvidia's discrete cards or AMD's desktop RDNA 3 variant (with the larger 192 KB vector register file). Neither of those can fit in the same form factors and power envelopes that Meteor Lake and Phoenix compete in.

What Intel source marketing/press release stuff did you take issue with? I'll be honest, I didn't go over their Meteor Lake marketing/press release materials in detail. But if they did claim something crazy and didn't deliver, I can understand the disappointment.

neogodless · on April 10, 2024

https://wccftech.com/gpu-market-rebounds-q2-2023-amd-nvidia-...

> The integrated segment had a total of 48.82 million units shipped worldwide followed by the high-end GPU segment which saw 6.84 million GPU shipments, 2.59 million shipments in the mid-range category, and 1.81 million in the entry-level segment. Workstation GPUs also shipped 1.50 million units.

Integrated graphics may not have the profit margins of dedicated, but in sheer quantity, they dwarf add-in boards.

Retric · on April 10, 2024

But how many of those 48.82 units are actually used in any meaningful capacity?

Intel is spending a great deal of money manufacturing stuff that’s then utterly wasted on any system with dedicated GPU’s or more commonly never use more than a small fraction of that 3D capacity. It just seems like a multi billion dollar waste from a company that’s so used to being a near monopoly it can’t step back from the iGPU trap.

fngjdflmdflg · on April 10, 2024

>Historically Intel has about every 5 years started to rumble about getting serious in the discrete markets, and they make some marketing fluff, but nothing even remotely competitive outside the iGPU "meh" range ever comes out.

I think the key thing you may be missing here is Intel Arc which is Intel's first real dGPU. And now they are using that tech in their iGPUs.

vetinari · on April 10, 2024

> "4K 120 FPS" - uh no, you're not getting that on an iGPU

You are not going to get that from many dGPUs either. You might get that from the high end in the last or previous gen, but not from the midrange or 2-gen-old models.

josephg · on April 10, 2024

It really depends on the game.

I have a 4090 (for work, I swear…). Cyberpunk is smooth, but I don’t get 4k 120fps. But I also play a lot of little indie games. Terraria? Stardew valley? Slay the spire? This stuff doesn’t need a dgpu at all. And I suspect a significant percentage of global gaming hours go into stuff like this now. Games that really push the hardware are expensive. (Well, or badly made). Either way, it’s usually bad for business.

clamchowder · on April 2, 2024

Well for me it was a matter of how much time it took. I spent enough time understanding the Central Processor with all the weird CDC-specific terms. I had to reread several times before I realized Increment instructions did load/store. Oh, and that "bootloader" was called "dead start".

The peripheral processors were indeed a 10-way SMT thing, which they called a barrel processor. Each thread had three registers but they operated in really strange ways, and I just didn't figure that out.

There were other details like bandwidth being tied to bank count, which I also didn't go into.

dbcurtis · on April 2, 2024

For sure. I remember thinking.... "But where are the instructions to load and store X registers?" -- Psych! There aren't any! Just jam an address into a A register. Take that you RISC fan-boys, can you get to less than zero instructions?

clamchowder · on April 1, 2024

I don't know if the CDC 6600 can be considered superscalar. I called it scalar because it can never issue more than one instruction per cycle, and can thus never sustain faster-than-scalar execution.

If you use a different definition of superscalar (just having concurrent operations in flight), then I guess that applies to the CDC 6600? Then it'd also apply to any pipelined core, including stuff like the Intel 486.

dbcurtis · on April 2, 2024

I'm pretty sure the 6600 could issue multiple instructions per clock. A 60 bit word could hold 4 15 bit instructions, or for instructions that required an 18 bit immediate address, those took two 15 bit parcels. If there were no conflicts, I think you could issue multiple instructions in a clock.

For instance, if Ax is the base address, and Bx is the stride, you could do A4 = A4 + B4 ; loading X4 by side-effect A3 = A3 + B3 ; loading X3 by side-effect X1 = X5 + X6 ; do an add of data loaded in the previous unroll of the loop A2 = A2 + B2 ; store X2 by side-effect, presumably computed in previous unroll

All 4 of those fit in one 60 bit word. And if you are clever with loop unrolling and instruction timing, you can get a lot of overlap.

The 6600 had dirt-simple opcodes that took about a week to memorize, but.... that was a long time ago, so sorry I can no longer assemble machine code in my head. Memory fading.....

clamchowder · on April 1, 2024

Be sure to check out the training manuals like http://www.bitsavers.org/pdf/cdc/Tom_Hunter_Scans/6600_CPU_T... or http://www.bitsavers.org/pdf/cdc/Tom_Hunter_Scans/6600_CPU_T... too. I figured out a lot from those.

clamchowder · on April 1, 2024

The original Pentium was the first superscalar x86 CPU. Superscalar just means you can execute faster than scalar (one instruction at a time), and the P5 could theoretically sustain 2 IPC. Of course it was in-order so cache misses would likely prevent you from getting there.

The Pentium Pro was Intel's first out-of-order design. It's also superscalar.

StackOverflow has a really funny post by Andy Glew, who worked on the Pentium Pro, noting that a scalar out-of-order machine could make sense in some corner cases. (https://stackoverflow.com/questions/10074831/what-is-general...)

dbcurtis · on April 2, 2024

If you were an instruction pairing whiz, you could get pretty close to 2 IPC on the Pentium I for certain kernels.

The thing that Seymour had going for him at CDC is that he didn't bother with synchronous exceptions. On x86 (I worked on a late 486, Pentium I, Pentium II and Itanium II) you absolutely must have synchronous exceptions for backward compatibility. Debugging an exception on 6600 was a hair-tearing exercise, because the PC point somewhere near, not at, the instruction that raised, and there may be a swiss-cheese window of instructions that did not complete somewhere before that. Loads of fun for one and all. (I used 6600/7600's while working on Cyber 203/205 at CDC).