Because of noise in general, "best case" seems always like the best metric to me...

Because of noise in general, "best case" seems always like the best metric to me. Over a large number of run, you're likely to hit the "perfect" measurement with on a microbenchmark.

Otherwise, for an "adaptive" number of runs till enough time is spent to have some "confidence" on the measure, I've been fairly happy with: https://github.com/google/benchmark/