Who am I? Studied Computer Engineering Did web development for a couple of years Moved to systems administration for a couple of years Had a run at build and automation engineering Landed in SRE, <3 it
135 million daily transactions 4.7 billion daily API calls most weeks (55k/s, 100k/s peak) 2.5 terabytes of daily log data output 250,000-per-second time series data points 14k nagios checks
You can roll your own instrumentation. @contextmanager def op(what): start = time.time() yield increment('hitcount.total_s', value=(time.time() - start), tags=["op:" + what]) while True: with op('receive'): req = queue.pop() with op('compute_route'): route = compute_route(req) with op('update'): db.execute(''' UPDATE hitcount WHERE route = ? SET hits=hits + 1 ''', (route, )) with op('finish'): req.finish() (from: https://honeycomb.io/blog/2017/01/instrumentation-measuring-capacity-through-utilization/)
Aspects help instrumentation. @Controller public class MyController { @RequestMapping("/") @TimeMethod(name = "app_duration_seconds", help = "Some helpful info here") public Object handleMain() { // Do something } } c = Counter('request_failure_total', 'Description of counter') h = Histogram('request_latency_seconds', 'Description of histogram') @c.count_exceptions() @h.time() def businessFunction(): # Do something pass