> "The integration involves modifying the TransformerDecoder module in torchtune to bypass the linear layer computation, allowing the Liger Fused Linear Cross Entropy Loss to handle the forward projection weights. "
Although this wasn't integrated into PyTorch itself (but to torchtune, which is a different thing). If you're writing your own training loop you need to use a third-party kernel, e.g. the Liger kernel mentioned in the article, or Cut Cross Entropy (which is much better than the Liger one, although IIRC it has a numeric bug in one of its kernels making the results very slightly off).
My favorite SNES game (Uncharted Waters 2) is a 2MB ROM.
I think about that every time I send a screenshot. The depth, complexity, and audiovisual beauty of that game stuffed into a space roughly a few times larger than a capture of my 1440p monitor in 2026.
A solid suggestion, but a big point of porting it to C# is the performance gains, which the CLR would mitigate. I know it'll be faster than running in a browser - where the game will also run - but if you're offering something for "performance", I don't think the time is best spent on making my job of composing the package easier. I think I'd rather try to figure out how to go whole-hog and compile as much of the game into an AOT package as possible. But, for what it's worth, the entire game engine was written in C# and ported into JS for the express purpose of being able to back-port the packaged code into C#. So I'm hoping it's not too onerous to do the native transpilation, either.
> "Well," said Bear. "Leave rabbit thoughts alone in the dark, they build a little rabbit to have them. That's just what thoughts do. They're extremely bad at minding their own business."
> The shift for me was realizing test generation shouldn’t be a one-off step. Tests need to live alongside the codebase so they stay in sync and have more context.
Does the actual test code generated by the agent get persisted to project?
If not, you have kicked the proverbial can down the road.
Yes gavinray, It gets persisted to the project. Its lives alongside the codebase. So that any test generated has the best context of what is being shipped. which makes the AI models use the best context to test any feature more accurately and consistently.
In my opinion it would be way cooler if it actually created a real Linux desktop environment instead of only a replica.
Would it succeed? Probably not, but it would be way more interesting, even if it didn't work.
I find things like Claude's C compiler way more interesting where, even though CCC is objectively bad (code is messy, generates very bad unoptimized code, etc) it at least is something cool and shows that with some human guideance it could generate something even better.
I'd say it's almost something of a rite of passage to get taken advantage of if you're young and working in tech startups. Usually this is in the form of abysmally low pay, along with "Sweat Equity":
It doesn't make sense to me that an embedded VM/interpreter could ever outperform direct code
You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
This seems counterintuitive, so I googled it. Apparently, it boils down to instruction cache efficiency and branch prediction, largely. The best content I could find was this post, as well as some scattered comments from Mike Pall of LuaJIT fame:
Interestingly, this is also discussed on a similar blogpost about using Clang's recent-ish [[musttail]] tailcall attribute to improve C++ JSON parsing performance:
Yeah, Clang's musttail and preserve_none make interpreter writing much simpler, just make yourself a guaranteed tail call opcode dispatch method (continuation passing style works a treat here), stitch those together using Copy-and-Patch and you have yourself a down and dirty jit compiler.
> It doesn't make sense to me that an embedded VM/interpreter could ever outperform direct code. You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
It is funny, but (like I’ve already mentioned[1] a few months ago) for serialization(-adjacent) formats in particular the preferential position of bytecode interpreters has been rediscovered again and again.
The earliest example I know about is Microsoft’s MIDL, which started off generating C code for NDR un/marshalling but very soon (ca. 1995) switched to bytecode programs (which Microsoft for some reason called “format strings”; these days there’s also typelib marshalling and WinRT metadata-driven marshalling, the latter completely undocumented, but both data-driven). Bellard’s nonfree ffasn1 also (seemingly) uses bytecode, unlike the main FOSS implementations of ASN.1. Protocol Buffers started off with codegen (burying Google user in de/serialization code) but UPB uses “table-driven”, i.e. bytecode, parsing[2].
The most interesting chapter in this long history is in my opinion Swift’s bytecode-based value witnesses[3,4]. Swift (uniquely) has support for ABI compatibility with polymorphic value types, so e.g. you can have a field in the middle of your struct whose size and alignment only become known at dynamic linking time. It does this in pretty much the way you expect[5] (and the same way IBM’s SOM did inheritance across ABI boundaries decades ago): each type has a vtable (“value witness”) full of compiler-generated methods like size, alignment, copy, move, etc., which for polymorphic type instances will call the type arguments’ witness methods and compute on the results. Anyways, here too the story is that they started with native codegen, got buried under the generated code, and switched to bytecode instead. (I wonder—are they going to PGO and JIT next, like hyperpb[6] for Protobuf? Also, bytecode-based serde when?)
You could probably rephrase almost any enum dispatch whatsoever as a "bytecode interpreter" of a sort, especially if run recursively to parse over some kind of sequence. If bytecode helps you achieve a more succinct representation of some program code than the native binary representation for your architecture, it makes sense that this could be faster in some cases.
Not in serde itself, but people have been experimenting with serde alternatives that are bytecode based. Nothing stable as far as I know, just experiments.
Another experiment is https://lib.rs/crates/facet which is a more general derives to generate compile time introspection to generate metadata tables approach.
Depending on your definition of "death", I've been there (no heartbeat, stopped breathing for several minutes).
In the time between my last memory, and being revived in the ambulance, there was no experience/qualia. Like a dreamless sleep: you close your eyes, and then you wake up, it's morning yet it feels like no time had passed.
It's entirely too much to put in a Hacker News comment, but if I had to phrase my beliefs as precisely as possible, it would be something like:
> "Phenomenal consciousness arises when a self-organizing system with survival-contingent valence runs recurrent predictive models over its own sensory and interoceptive states, and those models are grounded in a first-person causal self-tag that distinguishes self-generated state changes from externally caused ones."
I think that our physical senses and mental processes are tools for reacting to valence stimuli. Before an organism can represent "red"/"loud" it must process states as approach/avoid, good/bad, viable/nonviable. There's a formalization of this known as "Psychophysical Principle of Causality."
Valence isn't attached to representations -- representations are constructed from valence. IE you don't first see red and then decide it's threatening. The threat-relevance is the prior, and "red" is a learned compression of a particular pattern of valence signals across sensory channels.
Humans are constantly generating predictions about sensory input, comparing those predictions to actual input, and updating internal models based on prediction errors. Our moment-to-moment conscious experience is our brain's best guess about what's causing its sensory input, while constrained by that input.
This might sound ridiculous, but consider what happens when consuming psychedelics:
As you increase dose, predictive processing falters and bottom-up errors increase, so the raw sensory input goes through increasing less model-fitting filters. At the extreme, the "self" vanishes and raw valence is all that is left.
I think your idea of consciousness is more like human/animal consciousness. Which is reasonable since that’s all we have to go off of, but I take it to mean any kind of experience, which might arise due to different types of optimisation algorithms and selective pressures.
I’m not sure I agree that everything is valence, unless I’m misunderstanding what you mean by valence. I guess it’s valence in the sense that sensory information is a specific quality with a magnitude.
I don’t think that colours, sounds and textures are somehow made out of pleasure and pain, or fear and desire. That just isn’t my subjective experience of them.
I do think that human consciousness is something like a waking dream, like how we hallucinate lots of our experiences rather than perceiving things verbatim. Perception is an active process much more than most people realise as we can see from various perceptual illusions. But I guess we’re getting more into cognition here.
https://pytorch.org/blog/peak-performance-minimized-memory/
Is this the same thing as you discuss above?reply