Hacker Newsnew | past | comments | ask | show | jobs | submit | mshockwave's commentslogin

One thing I observed is that RVV code is usually slower in QEMU

LLVM now has another way to implement RTTI using the `CastInfo` trait instead of `classof`: https://llvm.org/doxygen/structllvm_1_1CastInfo.html

But it's really just an implementation difference, the idea is still to have a lightweight RTTI.


how did it do regalloc before instruction selection? How do you select the correct register class without knowing which instruction you're gonna use?


> I don’t know many good reasons for extrusive linked lists

for one, its iterator won't be invalidated


That depends on which array & extrusive linked list classes you’re talking about. Let me put it another way: in three decades of professional coding in scientific computing, video games, film vfx, web programming, and GPU driver and hardware development, I’ve never had to reach for an extrusive linked list for work. I’ve only ever used them for learning, teaching, and toy projects.


Is it normal to spend 10minutes on tuning nowadays? Do we need to spend another 10 minutes upon changing the code?


You mean autotune? I think 10 minutes is pretty normal, torch.compile('max-autotune') can be much slower than that for large models.


Add to that it can be done only once by developers before distribution for major hardware. Configs saved. Then on client side selected.


It's likely that Swift compiler is using LLVM LIT (https://llvm.org/docs/CommandGuide/lit.html), which is implemented in python, as the test driver


Python and LIT are used heavily to build and test the compiler, but that is only for building it, you do not need it to download and use the built toolchain. The python dependency is more about its use in LLDB.


> In the end, programs will want probably to stay conservative and will implement only the core ISA

Unlikely, as pointed out in sibling comments the core ISA is too limited. What might prevail is profiles, specifically profiles for application processors like RVA22U64 and RVA23U64, which the latter one makes a lot more sense IMHO.


Come on, what was to be understood is to 'stick to the core ISA' as much as possible.

I had to clarify the obvious: if a program does not need more than a conservative usage of the ISA to run at reasonable speed, no hardcore change to the hardware should be investigated.

Additionnally, the 'adding new machine instructions' fan boys tend to forget about machine instruction fusion (they probably want they names in the extension specifications) which has to be investigated first, and often in such niche cases, it may be not the CPU to think about, but specialized ASIC blocks and/or FPGA.


yes, it has been done for at least a decade if not more

> Even more of a wild idea is to pair up two cores and have them work together this way

I don't think that'll be profitable, because...

> When you have a core that would have been idle anyway

...you'll just schedule in another process. Modern OS rarely runs short on available tasks to run


The article is easy to follow but I think the author missed the e point: branchless programming (a subset of the more known constant time programming) is almost exclusively used in cryptography only nowadays. As shown by the benchmarks in the article, modern branch predictors can easily achieve over 95% if not 99% precision since like a decade ago


yes, the short answer is LLVM uses RegPressureTracker (https://llvm.org/doxygen/classllvm_1_1RegPressureTracker.htm...) to do all those calculations. Slightly longer answer: I should probably be a little more specific that in most cases, Machine Scheduler cares more about register pressure _delta_ caused by a single instruction, either traverses from bottom-up or top-down. In which case it's easier to make an estimation when some of other instructions are not scheduled yet.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: