Hacker Newsnew | past | comments | ask | show | jobs | submit | elseless's commentslogin

Sure. You can think of a (simple) traditional CPU as executing instructions in time, one-at-a-time[1] — it fetches an instruction, decodes it, performs an arithmetic/logical operation, or maybe a memory operation, and then the instruction is considered to be complete.

The Efficient architecture is a CGRA (coarse-grained reconfigurable array), which means that it executes instructions in space instead of time. At compile time, the Efficient compiler looks at a graph made up of all the “unrolled” instructions (and data) in the program, and decides how to map it all spatially onto the hardware units. Of course, the graph may not all fit onto the hardware at once, in which case it must also be split up to run in batches over time. But the key difference is that there’s this sort of spatial unrolling that goes on.

This means that a lot of the work of fetching and decoding instructions and data can be eliminated, which is good. However, it also means that the program must be mostly, if not completely, static, meaning there’s a very limited ability for data-dependent branching, looping, etc. to occur compared to a CPU. So even if the compiler claims to support C++/Rust/etc., it probably does not support, e.g., pointers or dynamically-allocated objects as we usually think of them.

[1] Most modern CPUs don’t actually execute instructions one-at-a-time — that’s just an abstraction to make programming them easier. Under the hood, even in a single-core CPU, there is all sorts of reordering and concurrent execution going on, mostly to hide the fact that memory is much slower to access than on-chip registers and caches.


Pointers and dynamic objects are probably fine given the ability to do indirect loads, which I assume they have (Side note: I have built b-trees on FPGAs before, and these kinds of data structures are smaller than you think). It's actually pure code size that is the problem here rather than specific capabilities, as long as the hardware supports those instructions.

Instead of assembly instructions taking time in these architectures, they take space. You will have a capacity of 1000-100000 instructions (including all the branches you might take), and then the chip is full. To get past that limit, you have to store state to RAM and then reconfigure the array to continue computing.


Agree that code size is a significant potential issue, and that going out to memory to reprogram the fabric will be costly.

Re: pointers, I should clarify that it’s not the indirection per se that causes problems — it’s the fact that, with (traditional) dynamic memory allocation, the data’s physical location isn’t known ahead of time. It could be cached nearby, or way off in main memory. That makes dataflow operator latencies unpredictable, so you either have to 1. leave a lot more slack in your schedule to tolerate misses, or 2. build some more-complicated logic into each CGRA core to handle the asynchronicity. And with 2., you run the risk that the small, lightweight CGRA slices will effectively just turn into CPU cores.


Oh, many embedded architectures don't have a cache hierarchy and instead place dynamic objects in one SRAM. Access latency is constant anywhere you go.


Hmm. You'd be able to trade off time for that space by using more general configurations that you can dynamically map instruction-sequences onto, no?

The mapping wouldn't be as efficient as a bespoke compilation, but it should be able to avoid the configuration swap-outs.

Basically a set of configurations that can be used as an interpreter.


I think that footnote is close to the heart of it: on a modern OoO superscalar processor, there are hundreds of instructions in-flight. that means a lot of work done to maintain their state and ensure that they "fire" when their operands are satisfied. I think that's what this new system is about: a distributed, scalable dataflow-orchestration engine.

I think this still depends very much on the compiler: whether it can assemble "patches" of direct dependencies to put into each of the little processing units. the edges between patches are either high-latency operations (memory) or inter-patch links resulting from partitioning the overall dataflow graph. I suspect it's the NOC addressing that will be most interesting.


> it executes instructions in space instead of time. At compile time, the Efficient compiler looks at a graph made up of all the “unrolled” instructions (and data) in the program, and decides how to map it all spatially onto the hardware units.

Naively that sounds similar to a GPU. Is it?


No? GPUs are just extremely parallel much wider SIMD cores


You managed to explain that in a way that even I could understand. Magnificent, thank you.


> meaning there’s a very limited ability for data-dependent branching, looping, etc. to occur compared to a CPU

Not very useful then if I can't do this very basic thing?


Precisely this. And don’t forget about bugs in virtualization layers/drivers — ZFS can very often save your data in those cases, too.


I once managed to use ZFS to detect a bit flip on a machine that did not have ECC RAM. All python programs started crashing in libpython.so on my old desktop one day. I thought it was a bug in ZFS, so I started debugging. I compared the in-memory buffer from ARC with the on-disk buffer for libpython.so and found a bit flip. At the time, accessing a snapshot through .zfs would duplicate the buffer in ARC, which made it really easy to compare the in-memory buffer against the on-disk buffer. I was in shock as I did not expect to ever see one in person. Since then, I always insist on my computers having ECC.


In my opinion, the biggest problem with Apple’s external displays is their 60 Hz refresh rate. That’s half of what their own iPhone (!) and MacBook pro models support, and is a far cry from the 240 Hz (albeit at lower resolutions) displays that are starting to pop up from other manufacturers.


H.T. Kung’s 1982 paper on systolic arrays is the genesis of what are now called TPUs: http://www.eecs.harvard.edu/~htk/publication/1982-kung-why-s...


It has been reviewed by the author’s peers — it was accepted to ISCA ‘22.


The Landauer principle states that any computation involving information erasure (e.g., taking the hash of blocks) must correspond to some nonzero increase in entropy, which would be expressed here as heat.


Proof-of-stake can never achieve the same trust model as proof-of-work. Proof-of-space-and-time can.

See Andrew Poelstra ‘15: https://nakamotoinstitute.org/static/docs/on-stake-and-conse...


This script is really nice: https://github.com/Nyr/openvpn-install

It sets up systemd and iptables and generates all certs and keys and wraps them up into tidy, per-client .ovpn files


I found this out the hard way a month ago when I updated my hackintosh from 10.13 to 10.14.

Never assume, even several weeks after a macOS release, that working Nvidia drivers will be available!


You live on the edge... I am on a legit mac and I'm still on HS for at least another month or two just to be safe.


ARM needs to get some skin in the accelerator game before RISC-V et. al. commoditize its cash cow.

NVDLA is fairly permissively licensed (free for commercial use), but of course Nvidia will steer the greater ecosystem around it. Perhaps ARM can be the Red Hat to NVDLA's Linux, or something like that. Still seems a bit strange to me.


>> ARM needs to get some skin in the accelerator game before RISC-V et. al. commoditize its cash cow.

For their own sake sure. I look forward to RISC-V taking over a large part of the world ;-) The Esperanto chip looks so promising even without proprietary extensions.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: