I think we may be playing in slightly different spaces: unlike a JS JIT, Cranelift doesn't have "super fancy escape/type analysis". We're really targeting the core optimizations (GVN, LICM, cprop, RLE, STLF, various algebraic rules) in a fast compile pass. The RLE+GVN interaction was pretty real in our case, as are interactions between the algebraic rewrites and GVN.
You'll note that my main point is that the single fixpoint loop for all of the core rewrites is what we wanted, and what the sea-of-nodes-with-CFG gets us; the egraph (multiple versions of one value) is kind of an aside. One could say: well sure but I could just do a single pass with that fixpoint loop without all the egraph stuff; and, yes, that's what our single rewrite pass is.
Thanks for writing the article, btw. I didn't have a chance to go through the whole thing yet.
Did you have a chance to study Graal's IR? It is a hybrid between sea of nodes and CFG; it can contain some "fixed" nodes that can be wired into basic blocks. It can also be relaxed and have nearly everything floating.
TurboFan's IR was very close to C2, but it had even more things that could float. E.g. a pure operation could be lowered to a subgraph with internal control. TurboFan's schedule could move entire floating control islands and attach them to the main CFG skeleton, or remove them entirely.
I'm working on a new IR and I'll be able to share more soon.
> I think we may be playing in slightly different spaces: unlike a JS JIT
My B3 compiler is a direct equivalent of Cranelift.
Sure, I've also written JS JITs. B3 was initially the backend of a JS JIT, but now does other things too. And I've done a lot of LLVM work.
Generally, it's unwise to assume you know what another person has or hasn't done. I'm not making that assumption about you. (I don't even know you, and that doesn't matter for the purposes of this discussion.)
> single fixpoint loop for all of the core rewrites is what we wanted, and what the sea-of-nodes-with-CFG gets us
It just feels so constraining. Eventually you'll want optimizations that can't be expressed with egraphs, and I'm not convinced that your approach conserves code or complexity even if you are sticking to egraphable optimizations.
OK, cool. I was assuming "escape analysis and type inference" implied a JS JIT -- straight from your comment, no other assumptions intended. But you've got a lot of interesting experience here and thanks for your thoughts.
We have two: fuel and epochs. Fuel (analogous to "gas" as it appears in many VM platforms) is a deterministic count and epochs check an always-increasing counter in shared memory meant to be periodically bumped by another thread. In either case, hitting the limit can be configured to either async-yield back to the event loop (in async mode) or to trap/terminate the Wasm instance. Both are based on instrumentation, i.e., extra code inserted during Wasm-to-machine-code compilation to do these checks at the start of loops and at the top of each function. Epoch instrumentation is cheaper because it's checking a mostly-read-only value in memory (so usually cached) while fuel loads and stores that value constantly from the VMContext.
(Core Wasmtime maintainer here, and I built our epochs mechanism when I realized we could do better than fuel if one doesn't need the determinism, only periodic yields)
> It'll be interesting to see what the second non-browser-based WASM runtime to fully support 3.0 will be (I'm guessing wasmtime will be first; ...)
Wasmtime already supports every major feature in the Wasm 3.0 release, I believe. Of the big ones: garbage collection was implemented by my colleague Nick Fitzgerald a few years ago; tail calls by Jamey Sharp and Trevor Elliott last year (with full generality, any signature to any signature, no trampolines required!); and I built our exceptions support which merged last month and is about to go out in Wasmtime 37 in 3 days.
The "3.0" release of the Wasm spec is meant to show progress and provide a shorthand for a level of features, I think, but the individual proposals have been in progress for a long time so all the engine maintainers have known about them, given their feedback, and built their implementations for the most part already.
(Obligatory: I'm a core maintainer of Wasmtime and its compiler Cranelift)
No, it's still behind a flag (and so transitively, exceptions are too, because we built exception objects on top of GC).
Our docs (https://docs.wasmtime.dev/stability-tiers.html) put GC at tier 2 with reason "production quality" and I believe the remaining concerns there are that we want to do a semi-space copying implementation rather than current DRC eventually. Nick could say more. But we're spec-compliant as-is and the question was whether we've implemented these features -- which we have :-)
They're hitting another design point on the compile time vs. code-quality tradeoff curve, which is interesting. They compile 4.27x faster than Cranelift with default (higher quality) regalloc, but Cranelift produces code that runs 1.64x faster (section 6.2.2).
This isn't too surprising to me, as the person who wrote Cranelift's current regalloc (hi!) -- regalloc is super important to run-time perf, so for Wasmtime's use-case at least, we've judged that it's worth the compile time.
TPDE is pretty cool and it's great to see more exploration in compiler architectures!
> That is, if x and y are determined only at runtime for power(x, y) then I don't see what can be optimized.
Yes, the example in Max's post is specifically assuming one wants to generate a specialized version of `power` where `y` is fixed.
To take it back to weval: we can know what the bytecode input to the interpreter is; we provide an intrinsic (part of the "wevaling" request) to indicate that some function argument is a pointer to memory with constant, guaranteed-not-to-change content. That, together with context specialization on PC (another intrinsic), allows us to unroll the interpreter loop and branch-fold it so we get the equivalent of a template method compiler that reconstitutes the CFG embedded in the bytecode.
Thanks, I think I see now. So `y` is the bytecode, in the analogy. Makes sense.
(For me at least a concrete example would have helped, something like showing the specialized output of running on the bytecode for `power` with that interpreter. But maybe that would be too verbose...)
This is indeed a good point and something I want to write about when I eventually do a blog post on weval.
A few counterpoints that I'd offer (and what led me to still take this approach):
- If the target has sub-par debugging infrastructure, it can be easier to debug an interpreter (which is portable) then apply the semantics-preserving PE. In particular when targeting Wasm outside the browser, there is... not really a good debug experience, anywhere, for that. It was way easier to get an interpreter right by developing on native with gdb/rr/whatever, and then separately ensure weval preserves semantics (which I tested with lockstep differential execution).
- Maintenance: if one is going to have an interpreter and a compiler anyway (and one often wants this or needs this e.g. to handle eval()), easier for them both to come from the same source.
- Amortized cost: in the Wasm world we want AOT compilers for many languages eventually; there are interpreter ports with no Wasm backends; developing weval was a one-time cost and we can eventually apply it multiple times.
- If the semantics of the existing interpreter are quite nontrivial, that can push the balance the other way. I designed weval as part of my work on SpiderMonkey; extremely nontrivial interpreter with all sorts of edge cases; replicating that in a from-scratch compiler seemed a far harder path. (It's since been done by someone else and you can find the "wasm32 codegen" patchset in Bugzilla but there are other phasing issues with it from our use-case's PoV; it's not true AOT, it requires codegen at runtime.)
I don't think the tradeoff is always clear and if one is building a language from scratch, and targeting a simple ISA, by all means write a direct compiler! But other interesting use-cases do exist.
The divergent semantics risk between the interpreter and the compiler is a really big deal. It's genuinely difficult to get a language implementation to behave exactly as specified, even when the spec is do the same as some other implementation. Treating "compiled code" as specialising the interpreter with respect to the program is a great solution to that, since the bugs in the optimiser/partial-evaluator (they're kind of the same thing) are unlikely to be of the same class as bugs from independent implementations.
Wasm is a really solid target for heroic compiler optimisations. It's relatively precisely specified, user facing semantic diagnostics are in some language front end out of sight, aliasing is limited and syscalls are finite with known semantics. Pretty much because it was designed by compiler people. You've picked a good target for this technique.
One problem with Wasm is that in practice there's not as much optimization work to do as you might expect, as the higher-level compiler which produced it already di a lot of the work. Of course you still need to lower the stack machine + locals to actual spills/fills/etc., but still a chunk of the work is already done for you.
weval author here (thanks Max for the blog post!). Also AMA!
The talk about weval that Max mentions was at NEU and also CMU; the latter was recorded and is here: https://vimeo.com/940568191
I also plan to write up a blog post on weval in depth, plus its application to SpiderMonkey-on-Wasm, in a few months; it's pretty exciting though, currently getting 4-5x speedups on some benchmarks on a decidedly nontrivial interpreter!
This is amazing! I really do look forward to that writeup.
Would investigating the mapping from interpreters down the assembly that Wasmtime generates be something worth looking into? I got distracted into the vimeo presentation, the weval blog post covers this? How high can the interpreter/compiler tower go?
The approach that Cranelift uses is what we call the "aegraph" (talk I gave about it: slides https://cfallin.org/pubs/egraphs2023_aegraphs_slides.pdf, video https://vimeo.com/843540328). The basic idea is that eclasses hold sets of operator expressions (think sea-of-node IR), and we keep that alongside the CFG with a "skeleton" of side-effecting ops. We build that, do rewrites, then extract out of it back to a conventional CFG-of-basic-blocks for lowering. The "equivalence" part comes in when doing rewrites: the main difference between an egraph and a conventional sea-of-nodes IR with rewrite rules is that one keeps all representations in the equivalence class around, then chooses the best one later.
We had to solve a few novel problems in working out how to handle control flow, and we're still polishing off some rough edges (search recent issues in the repo for egraphs); but we're mostly happy how it turned out!
(Disclosure: tech lead of CL for a while; the e-graphs optimizer is "my fault")
(And thanks!)
reply