A server with 100K TCP connections isn't much different, and you might be surprised (or maybe not) by how extensively dynamic data structures are used in the TCP stack implementation. I'm not sure all of them can be made 'cache-friendly,' and embedding all layers together would make the code unmaintainable. Intensive access to external memory is inevitable when accessing a large enough number of unrelated states.
I am using perf. On Graviton, the event you want to look for is LLC-load-misses, which tracks last-level cache (LLC) misses (i.e., external memory accesses). The command "perf record -e LLC-load-misses -t $(pgrep valley-server)" will record the number of LLC misses per instruction during the execution of the valkey-server main thread. Please note that LLC-load-misses events are not collectable when running on instances that only use a subset of the processor's core. "perf list" provides the events that can be collected on your machine