Hacker Newsnew | past | comments | ask | show | jobs | submit | ackerman80's commentslogin



Came across this which gives good insight into the 4 golden signals for a top-level health tracking: https://blog.netsil.com/the-4-golden-signals-of-api-health-a...

One thing of note in the graph is the tracking of response size. This would be very useful for 200 responses with "Error" in the text. Because then the response size would drop drastically below a normal successful response payload size.

In addition to Latency, Error Rates, Throughput and Saturation , folks like Brendan Gregg @ Netflix have recommended tracking capacity.


Are you a plant? It must just be coincidence that the second post in the series is titled "measuring capacity." :) https://honeycomb.io/blog/2017/01/instrumentation-measuring-...

(bias alert - I work on Honeycomb)


we are all learning from the same folks ahead of us it seems :)

I agree with other comments though the devil is in the details of how to actually setup these "golden signals" so that they are useful and not just drown everyone in packet level non-sense.


TCP retransmission rates looks like a useful metric which can help in monitoring the health of a service. One way to obtain that is by analyzing service interactions as mentioned in the blog. Tracing could be another way through which we can find that info. I am curious as to how code instrumented monitoring solutions get that information. (PS: I work for Netsil)


By default you can only get that per-kernel from /proc/net/netsnmp. BPF may allow something more granular.

The other way of approaching it is to look for the additional latency it causes, which you can spot on a per-service basis.


Additional latency could be an indicator, but there's no guarantee that it is because of retransmissions ?


If you look at your latency histogram and are seeing a bump at around 200ms above normal (which was the default minimum wait time a few years back anyway), it's probably retransmits.


Got it.


you can get retransmits from 'sar' on linux


I see. But, it looks like it is per host and there is no way to find out for a particular service running on the host.


Right now we maintain few select percentiles from the latency distribution over 1 min time-period. We plan to maintain latency histograms which will allow you to look at latency distribution on arbitrary time intervals.


Any information on pricing?


Netsil AOC is priced by the number of vCPUs or cores that you would be monitoring. You can reach out to us at hello@netsil.com for the exact price quote based on your needs.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: