Hacker Newsnew | past | comments | ask | show | jobs | submit | pixelesque's commentslogin

Sometimes (and very often in some scenarios/industries, i.e. HPC for graphics and simulation with indices for things like points, vertices, primvars, voxels, etc) you want pretty good efficiency of the size of the datatype as well for memory / cache performance reasons, because you're storing millions of them, and need to be random addressing (so can't really bit-pack to say 36 bytes, at least without overhead away from native types, which are really needed for maximum speed without any branching).

Losing half the range to make them signed when you only care about positive values 95% of the time (and in the rare case when you do any modulo on top of them you can cast, or write wrappers for that), is just a bad trade-off.

Yes, you've still then only doubled the range to 2^32, and you'll still hit it at some point, but that extra byte can make a lot of difference from a memory/cache efficiency standpoint without jumping to 64-bit.

So very often uint32_t is a very good sweet spot for size: int32_t is sometimes too small, and (u)int64_t is generally not needed and too wasteful.


> HPC for graphics and simulation with indices

Those are not sizes of data structures.

> Losing half the range

It's not a part of the range of sizes they can use, with any typical data structure.

> Losing half the range to make them signed when you only care about positive values 95% of the time is just a bad trade-off.

It's the right choice sizes in the standard library (in C++) or standard-ish/popular libraries in C. And - again, it's the wrong type. For example, even if you only care about positive values, their difference is not necessarily positive.


As I generally believed in Moore's law, i.e., accepted the notion that transistors were exponential, I was surprised at how long the difference between a 2 GiB address space and a 3 GiB address space was relevant in practice.

In theory, it should have been at most a year. In practice, Windows XP /3GB boot switch (allocates 3 GB of virtual address space user mode and 1 GiB for the kernel instead of the usual 2 and 2) was relevant for many years.


If 64-bit was an easy option for you, the transition wasn't after the /3GB switch, it typically happened at about 1GB RAM and yeah, it wasn't very long as you imagined because of Moore's law.

So that /3GB switch is for people who are stuck on the wrong hardware for a variety of reasons, and the timing is about how long those people stayed trapped rather than how long before this became a bad idea (it was a bad idea before it even shipped, but it was necessary)

Linux had some more extreme splits including a 3.5:0.5 split and a nasty 4:4 split (in which all the userspace addresses are invalidated when in kernel space, ugh) and it's for the same reason, these aren't customers who chose not to go to 64-bit, they're customers who can't yet and will pay $$$$ to keep what they are doing working for just a while longer anyway despite that.


In terms of at Microsoft's end, or in general with the amount of new repos and pushes / commits from other people vibe-coding?

> Even the latest CPUs have a 2:1 fp64:fp32 performance ratio

Not completely - for basic operations (and ignoring byte size for things like cache hit ratios and memory bandwidth) if you look at (say Agner Fog's optimisation PDFs of instruction latency) the basic SSE/AVX latency for basic add/sub/mult/div (yes, even divides these days), the latency between float and double is almost always the same on the most recent AMD/Intel CPUs (and normally execution ports can do both now).

Where it differs is gather/scatter and some shuffle instructions (larger size to work on), and maths routines like transcendentals - sqrt(), sin(), etc, where the backing algorithms (whether on the processor in some cases or in libm or equivalent) obviously have to do more work (often more iterations of refinement) to calculate the value to greater precision for f64.


> the latency between float and double is almost always the same on the most recent AMD/Intel CPUs

If you are developing for ARM, some systems have hardware support for FP32 but use software emulation for FP64, with noticeable performance difference.

https://gcc.godbolt.org/z/7155YKTrK


> ... if you look at (say Agner Fog's optimisation PDFs of instruction latency) ...

That.... doesn't seem true? At least for most architectures I looked at?

While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.

And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?

Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.

So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.


Well, maybe not all admittedly, and I didn't look at AVX2/512, but it looks like `_mm_div_ps` and `_mm_div_pd` are identical for divide, at the 4-wide level for the basics.

Obviously, the wider you go, the more constrained you are on infrastructure and how many ports there are.

My point was more it's very often the expensive transcendentals where the performance difference is felt between f32 and f64.


This depends largely on your operations. There is lots of performance critical code that doesn't vectorize smoothly, and for those operations, 64 bit is just as fast.


Yes, if you're not FP ALU limited (which is likely the case if not vectorized), or data cache/bandwidth/thermally limited from the increased cost of fp64, then it doesn't matter - but as I said that's true for every performance aspect that "doesn't matter".

That doesn't mean that there are no situations where it does matter today - which is what I feel is implied by calling it "Ancient".


No it doesn't - it allows a single writer and concurrent READs at the same time.


Thanks! even I run a sqlite in "production" (is it production if you have no visitors?) and WAL mode is enabled, but I had to work around concurrent writes, so I was really confused. I may have misunderstood the comments.


Writes are super fast in SQLite even if they are not concurrent.

If you were seeing errors due to concurrent writes you must adjust BUSY_TIMEOUT


Thanks I'll have a look. For now I just had a sane retry strategy. Not that I have any traffic, mind you :-)))


One reason is Low Probability of Intercept radars (and transmitters / datalinks) do exist, and are very difficult (but not impossible) to identify and locate.


$2 split between Iran and Oman...


They're discussing how to manage SD cards, and Houston wants them to sign in and out (by initialing in MS OneNote!) every time they change windows.


Exactly the same situation with me in terms of gmail address (although my names are less common).

I get so many other $MY_NAME emails, including bills (including multiple credit cards and things like Afterpay), deliveries, medical details/reports, family communications, etc, etc.

And it's very clear that quite a few online services blatantly don't verify email addresses, they just assume the email is valid and allow the person to start using it.


Possibly a combination of moving infrastructure to Azure, and also a significant increase in the number of PRs and commits due to Vibe-coding?



It comes down to how "coherent" the rays are, and how much effort (compute) you want to put into sorting them into batches of rays.

With "primary" ray-tracing (i.e. camera rays, rays from surfaces to area lights), it's quite easy to batch them up and run SIMD operations on them.

But once you start doing global illumination, with rays bouncing off surfaces in all directions (and with complex materials, with multiple BSDF lobes, where lobes can be chosen stochastically), you start having to put a LOT of effort into sorting and batching rays such that they all (within a batch) hit the same objects or are going in roughly the same direction.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: