>> I believe the trick with CPU math kernels is exploiting instruction level parallelism with fewer memory references
It's the collection of tricks to minimize all sort of cache misses (L1, L2, TLB, page miss etc), improve register reuse, leverage SIMD instructions, transpose one of the matrices if it provides better spatial locality, etc.
The trick is indeed to somehow imagine how the CPU works with the Lx caches and keep as much info in them as possible. So its not only about exploiting fancy instructions, but also thinking in engineering terms. Most of the software written in higher level langs cannot effectively use L1/L2 and thus results in this constant slowing down otherwise similarly (from asymptotic analysis perspective) complexity algos.
It's quite unfavorable on modern hardware. A Sapphire Rapids core can do 2 separate 32 half-precision FMAs (vfmadd132ph, [1]) per clock, which is 128 FLOPs/cycle. It is not possible to achieve that kind of throughput with an 8-bit LUT and accumulation, even just a shuffle with vpshufb is too slow.
That's absolutely wild. If you really only needed vpshufb, the throughput is the same in terms of values, because there are twice as many values per register and you get to retire half as many instructions, but it takes a bunch more instructions to combine the two inputs and apply a LUT of 256 values :(
Fair point. It might help if the system is DRAM bandwidth limited, so reducing the data size helps even though individual operations take multiple instructions. But that is not the situation with todays hardware.
I don't like that question because it asks for recollection of a name, as opposed to taking the theorem "when X is true, then Y is true" and changing the question into the form "when X is true, ____???".
Worst case I've seen of this was when I was in 9th grade and our geometry teacher required us to memorize the chapter and section names of theorems in the book when proving. For example, in our proofs about triangles, we had to write "theorem 12.5" or else we wouldn't get credit on the test, and here 12.5 was the chapter and section number in the particular textbook, which is an utterly useless piece of info.
Of course, the name Brauer is not nearly as useless as a chapter name, but still being familiar with math history probably shouldn't be hard requirement for being a professional mathematician.
I think his generals are easier than Tao's. I wonder how many "average" PhD candidates worldwide can answer the questions in Tao's generals satisfactorily without difficulties. Many of them just seem highly research-oriented.
I think the mathematical concept that you are looking for is that of the dual space. Essentially if you have a vector space V, you can construct a dual space V* where the elements of the dual space are functions taking elements of V to the underlying field F, and under certain conditions these spaces are isomorphic (the same) - so there is a 1:1 correspondence between elements of the vector space and the functions in the dual space.
One way to view this formula is to use the fact that the Beta distribution is a conjugate prior for the binomial distribution.
Essentially if you have a Beta(a, b) prior then your prior mean is a/(a+b) and after observing n samples from a Bernoulli distribution that are all positive, your posterior is Beta(a+n, b) with posterior mean (a+n)/(a+n+b). So in your example you effectively have a Beta(0, x) prior and x (“suspicious”/“gullible”) is directly interpreted as the strength of your prior!
Yeah, that's a lot of jargon associated with Bayesian statistics, but at it's root the idea is simple. How to merge information you have before observing some data (a.k.a. prior) with new information you just observed, to obtain updated information (a.k.a. posterior) that includes both what you believed initially + the new evidence you observed.
The probability machinery (Bayes rule) is a principled way to do this, and in the case of count data (number of positive reviews for the cafe) works out to give be a simple fraction n/(n+x).
Define:
x = parameter of how skeptical you are in general about the quality of cafes (large x very sceptical),
m = number of positive reviews for the cafe,
p = m+1 / (m+1+x) your belief (expressed as a probability) that the cafe is good after hearing m positive reviews about it.
Learning about the binomial and the beta distribution would help you see where the formula comes from. People really like Bayesian machinery, because it has a logical/consistent feel: i.e. rather than coming up with some formula out of thin air, you derive the formula based on general rules about reasoning under uncertainty + updating beliefs.
> Can this way to view the formula be expressed without the terms
You're asking "Can this way of viewing the formula in terms of Bayesian probability be expressed without any of the machinery of Bayesian probability?".
Also, in case anyone is interested, the uninformative Jeffreys prior for this in Bayesian statistics (meaning it does not assume anything and is invariant to certain transformations of the inputs) is Beta(0.5, 0.5). Thus the initial guess is 0.5, and it evolves from there from the data.
No, I was not. While Oliver certainly killed more, the reformation and crackdown that Thomas led claimed enough victims that I'm comfortable calling it mass murdr.
- https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...
might be of interest