THE humble distributive
law, in its simplest form
states that...this leads
to a large family of fast
algorithms, including
Viterbi’s algorithm and
the fast Fourier
transform (FFT).
Two extremely influential papers appeared back to back in transactions information theory. This is one of them.
Interesting, of course many computations can be expressed as a graph. In the case of the bipartite graph we perform belief propagation on to decode LDPC where is the optimization from the distributive property? The parity matrix would typically be constructed so that there's few subexpression to factor out, to maximize the error correcting properties.
I agree both FFT and belief propagation can be expressed as message passing algorithms.
It shows up in pushing in the parenthesis and pulling common terms out in the expression that is a sum (over all possible assignments) of products of terms.
Doing the summation the naive way will be exponential in the number of variables. The goal is to this in an efficient way exploiting the distributive property and symmetry if any, much like in the FFT case.
This can be done efficiently, for example, when the graph is a tree. (Even if it isn't, one can pretend as if it is. Surprisingly that often works very well but that's a different topic entirely)
There might be an additional optimization in having the writer also cache it's write index on the cache line together with the read index cache. This way the writer would only do writes to the write index cache line. The hardware might be able to optimize this. I wonder how it interacts with UMWAIT on the latest cores.
Reading its own values is never a problem, the writeIndex cacheline will switch from E (or M) to S (or O), but it is never evicted, so reads from it never miss. For frequent writes the read will be fulfilled by the store buffer anyway.
I think it will be invalidated due to RFO when the reader reads the write index. Only when multiple readers reads the same cache line without any intervening write will the RFO heuristic be disabled.
This works much better on my POWER9 system. I have two CPUs in my machine with the same core count but different cache configurations according to hwloc and this showed that they really are different.
This is definitely an option you want to be using when using ASan or LSan. You may also want to consider additionally using -momit-leaf-frame-pointer to skip frame pointers only in leaf functions while keeping frame pointers for non-leaf functions. This can make small leaf functions significantly shorter, limiting some of the negative impact of using -fno-omit-frame-pointer alone.
Sometimes even -fno-omit-frame-pointer won't help, like if the stack is being unwound through a system library that was built without frame pointers. In that case you can switch to the slow unwinder. Set the environment variable `ASAN_OPTIONS=fast_unwind_on_malloc=0` when running your program. But note that this will make most programs run significantly slower so you probably want to use it only when you really need it and not as the default setting for all runs.
You're incorrect, garbage collection would be the biggest problem for that use case. You have to be really careful even with your C/C++ code, warming up the branch predictor between packets etc, see my article for some pitfalls: https://rigtorp.se/virtual-memory/
Taking a lock only needs to be an acquire operation and a compiler barrier for other lock operations. Using seq_cst or acq_rel semantics is stronger than needed. From my reading and discussions with people from WG21 the current argument for why taking a lock only requires acq semantics is that a compiler optimization that transforms a non-deadlocking program into a potentially deadlocking program is not allowed. There's an interesting twitter thread where we discuss this I can't find anymore :(.
That is an amazing thread. The fact that C++ apparently allows
optimizing
#include <stdio.h>
int stop = 1;
void maybeStop() {
if(stop)
for(;;);
}
int main() {
printf("hello, ");
maybeStop();
printf("world\n");
}
into
int main() {
printf("hello, world\n");
}
(as Clang does today) does not inspire confidence about disallowing moving the loop in the other example. If the compiler is allowed to assume that this loop terminates, why not the lock loop?
Maybe there is a reason, but none of this inspires confidence.
The standard says that a thread must eventually terminate, do an atomic operation or do IO. So the while(lock.exchange(true)); loop is different.
Also keep in mind that C++11 specifies std::mutex::lock() to have acquire semantics and unlock() to have release semantics on the lock object. In order for std::mutex to actually work the reordering of m1.unlock(); m2.lock(); to m2.lock(); m1.unlock(); must be disallowed. But since m1 and m2 are separate objects m1.unlock() has no happens before relationship with m2.lock(). This seems to be a problem in the C++11 memory model. The arguments I have heard from some WG21 people is that there is no problem since transforming a wellformed terminating program into a non-terminating program is not allowed. I can't find the wording in the C++ standard that asserts this. But oh well, it works right now on gcc/llvm/msvc.
I don't remember the exact wording, but the standard explicitly makes an exception for the always terminating assumptions, for loops accessing atomic variables or having side effects (i.e. volatile or I/O).