The solution to this is not to split, but just follow Qualcomm. Their vision for...

sitkack · on April 30, 2024

> Right now, most devices on the market do not support the C extension

This is not true and easily verifiable.

The C extension is defacto required, the only cores that don't support it are special purpose soft cores.

C extension in the smallest IP available core https://github.com/olofk/serv?tab=readme-ov-file

Supports M and C extensions https://github.com/YosysHQ/picorv32

Another sized optimized core with C extension support https://github.com/lowrisc/ibex

C extension in the 10 cent microcontroller https://www.wch-ic.com/products/CH32V003.html

This one should get your goat, it implements as much as it can using only compressed instructions https://github.com/gsmecher/minimax

FullyFunctional · on April 30, 2024

The expansion of a 16-bit C insn to 32-bit isn't the problem. That part is trivial. The problem (and it is significant) is for a highly speculative superscalar machine that fetches 16+ instructions at a time but cannot tell the boundary of instructions until they are all decoded. Sure, it can be done, but that doesn't mean that it doesn't cost you in mispredict penalties (AKA IPC) and design/verification complexities that could have gone to performance.

It is also true that burning up the encoding space for C means pain elsewhere. Example: branch and jump offsets are painfully small. So small that all non-toy code need to use a two instruction sequence to all call (and sometimes more).

These problems don't show up on embedded processors and workloads. They matter for high performance.

camel-cdr · on April 30, 2024

> me but cannot tell the boundary of instructions until they are all decoded

Not fully decoded though, since it's enough to look at the lower bits to determine instruction size.

> Sure, it can be done, but that doesn't mean that it doesn't cost you in mispredict penalties

What does decoding have to do with mispredict penalties?

> Example: branch and jump offsets are painfully small

Yes, thats what the 48 bit instruction encoding is for. See e.g. what the scalar eficiency SIG is currently working on: https://docs.google.com/spreadsheets/u/0/d/1dQYU7QQ-SnIoXp9v...

inkyoto · on May 1, 2024

> Not fully decoded though, since it's enough to look at the lower bits to determine instruction size.

It is not about decoding, which happens later, it is about 32-bit instructions crossing the L1 cache line boundary in the L1-i cache which happens first.

Instructions are fetched from the L1-i cache in bundles (i.e. cache lines), and the size of the bundle is fixed for a specific CPU model. In all RISC CPU's, the size of a cache line is a multiply of the instruction size (mostly 32 bits). The RISC-V C extension breaks the alignment, which incurs a performance penalty for high performance CPU implementations, but is less significant for smaller, low power implementations where performance is not a concern.

If a 32-bit instruction cross the cache line boundary, another cache line must be fetched from the L1-i cache before an instruction can be decoded. The performance penalty in such a scenario is prohibitive for a very fast CPU core.

P.S. Even worse if the instruction crosses a page boundary, and the page is not resident in memory.

dzaima · on May 1, 2024

I don't think crossing cache lines is particularly much of a concern? You'll necessarily be fetching the next cache line in the next cycle anyway to decode further instructions (not even an unconditional branch could stop this I'd think), at which point you can just "prepend" the chopped tail of the preceding bundle (and you'd want some inter-bundle communication for fusion regardless).

This does of course delay decoding this one instruction by a cycle, but you already have that for instructions which are fully in the next line anyways (and aligning branch targets at compile time improves both, even if just to a fixed 4 or 8 bytes).

inkyoto · on May 1, 2024

> I don't think crossing cache lines is particularly much of a concern?

It is a concern if a branch prediction has failed, and the current cache line has to be discarded or has been invalidated. If the instruction crosses the cache line boundary, both lines have to be discarded. For a high-performance CPU core, it is a significant and, most importantly, unnecessary performance penalty. It is not a concern for a microcontroller or a low power design, though.

dzaima · on May 1, 2024

Why does an instruction crossing cache lines have anything to do with invalidation/discarding? RISC-V doesn't require instruction cache coherency so the core doesn't have much restriction on behavior if the line was modified, so all restrictions go to just explicit synchronization instructions. And if you have multiple instructions in the pipeline, you'll likely already have instructions from multiple cache lines anyways. I don't understand what "current cache line" even entails in the context of a misprediction, where the entire nature of the problem is that you did not have any idea where to run code from, and thus shouldn't know of any related cache lines.

dzaima · on May 1, 2024

Mispredict penalties == latency of pipeline. Needing to delay decoding/expansion to after figuring out where instructions actually start will necessarily add a delay of some number of gates (whether or not this ends up in mispredict penalty increasing by any cycles of course depend on many things).

That said, the alternative of instruction fission (i.e. that which RISC-V avoids requiring) would add some delay too (I have no clue how these compare though, I'm not a hardware engineer; and RISC-V does benefit from instruction fusion which can similarly add latency, and whose requirement other architectures could decide to try to avoid (though it'd be harder to keep avoiding it as hardware potential improves while old compiled binary blobs stay unchanged), so it's complicated)

camel-cdr · on May 1, 2024

Ah, that makes sense, thanks. I think on the end it all boils down to both the arm and the rv approach to be fine approaches, with slightly different tradeoffs.

timschmidt · on April 30, 2024

> Qualcomm wants to remove it because it is actively harmful for fast implementations

Qualcomm's "fast implementation" reportedly started out life as an ARM core and has had it's decoders replaced. That explanation makes their very different use of the instruction space make much more sense to me than any other. They did the minimum to adapt an existing design. Not the stuff of lasting engineering.

panick21_ · on April 30, 2024

> Right now, most devices on the market do not support the C extension

That's outright false.

And outside of the actual devices, the whole software ecosystem very much uses the C extension.

Qualcomm simply wants to break the standard to make money, that literally all it is.

> Qualcomm wants to remove it because it is actively harmful for fast implementations

Funny how not a single company other then Qualcomm argues this. Not Ventara, not Si-Five, not Esperanto, not Tenstorrent, non of the companies form China.

Its almost, almost as if it not that big of a deal and Qualcomm simply want to same money and reuse ARM IP.

> The solution to fragmentation is to just disable the C extension everywhere, but SiFive doesn't want to hear that.

Literally nobody except Qualcomm wants to hear it. It wasn't even a discussion before Qualcomm. All the other companies had plenty of opportunity to bring up issues in all the working groups, and nobody did. Literally not a single company gave a talk, talking about how the C extension was holding them back. In fact most of them were saying the opposite.

FullyFunctional · on April 30, 2024

There is a lot of stuff behind the scene you don't know. You statement about "other companies" is completely wrong.

panick21_ · on May 1, 2024

These things are supposed to be discuss in the open, its an open standards process. So please link me to the official statements by these companies that they are unhappy.

I have watched many discussions and updates by the work-groups, and nobody came forward.

And why are they afraid to come forward? If this is so important then shouldn't there be an effort to convince the community?

So sorry until I see something other claims like 'there is a shadow cabal of unhappy companies planning a takeover' I'm not gone buy that this is widespread movement.

snvzz · on May 2, 2024

>I have watched many discussions and updates by the work-groups, and nobody came forward.

Rivos came forward. They (kindly) told Qualcomm not to put words on their mouth.

Rivos is, of course, totally fine with C.

brucehoult · on May 2, 2024

Yup. Rivos said basically "Please don't interpret our willingness to look at your data, once you provide it, as supporting your claims"

camel-cdr · on April 30, 2024

pray tell

brucehoult · on May 1, 2024

> Right now, most devices on the market do not support the C extension, and any code that tries to be compatible does not use it.

I don't know of ANY commercially-sold RISC-V chips that don't implement the C extension. Even the 10 cent CH32V003 implements C (RV32EC).

> burns 75% of the entire encoding space

In ARMv7 the 2-byte original Thumb instructions burn 87.5% of the 4-byte encoding space (28 out of 32 combinations of the 5 MSBs).

camel-cdr · on April 30, 2024

> most devices on the market do not support the C extension

Name one that doesn't, it's exactly the opposite (for 64-bit).

GeorgeTirebiter · on April 30, 2024

Maybe you've found the solution: RV32 must have the C extension.

RV64 and RV128 must NOT have the C extension.

Problem solved?

camel-cdr · on April 30, 2024

No I meant, that for 64-bit CPUs virtually every available one supports the C extension.