Slide 1

Slide 1 text

Operability in Go Improving operations in Go programs 1

Slide 2

Slide 2 text

Who Am I Ian N Schenck SWE, Infrastructure, Oscar Health @ianschenck (I don’t really tweet) ian.schenck@gmail.com 2

Slide 3

Slide 3 text

Preface I am: A SWE who keeps ending up in SRE. I try to write operable code. This is going to break, how do I make it as easy as possible? 3

Slide 4

Slide 4 text

What are Operations 4

Slide 5

Slide 5 text

1. Plan 2. Analyze 3. Design 4. Implement 5. Document/Test 6. Maintain *Not necessarily in that order Software Design Life Cycle (SDLC) 5

Slide 6

Slide 6 text

Maintenance Even if your software is perfect and entirely bug free, it can still break. - Environments are complex and changing - Hardware can break - Humans are buggy 6

Slide 7

Slide 7 text

Maintenance - Failure When something fails, we have two equal objectives: 1. Fix it 2. Determine what went wrong 7

Slide 8

Slide 8 text

Maintenance - Failure 1. Fix it Depends on the situation. Let’s talk about failing well... 8

Slide 9

Slide 9 text

Failing Well - Fail immediately when unrecoverable errors occur. - Fail the smallest execution unit necessary. - Err on the side of caution - fail as big as you need to (maybe the whole application). 9

Slide 10

Slide 10 text

Failing Well In general, an unhandled/unrecoverable error should panic. It should also give clear and concise information about what led to the panic. 10

Slide 11

Slide 11 text

Failing Well - Panic Applications may panic, which will fail up to a deferred recover() E.g. Panic in an HTTP handler will fail up to the serving goroutine. Panic without recover() terminates. 11

Slide 12

Slide 12 text

Failing Well Panic does give a stack trace, but could use more context. Add context around application panics using logging. 12

Slide 13

Slide 13 text

Maintenance - Failure 2. Determine what went wrong. If you're unable to determine what went wrong, you can't avoid repeating the failure. 13

Slide 14

Slide 14 text

Diagnosing Failure - 5 Whys The vehicle will not start. (the problem) 1. Why? - The battery is dead. 2. Why? - The alternator is not functioning. 3. Why? - The alternator belt has broken. 4. Why? - The alternator belt wore out. 5. Why? - The vehicle was not maintained. (root cause) 14

Slide 15

Slide 15 text

Diagnosing Failure We need (a lot of) information! 15

Slide 16

Slide 16 text

Killing a Stuck Process SIGQUIT (kill -3) a process, get a stack trace: goroutine 1 [IO wait, 5 minutes]: ... net.(*TCPListener).AcceptTCP(0xc820124170, 0xc82005dbe0, 0x0, 0x0) /usr/local/go/src/net/tcpsock_posix.go:254 +0x4d net.(*TCPListener).Accept(0xc820124170, 0x0, 0x0, 0x0, 0x0) ... 16

Slide 17

Slide 17 text

Sources of Information Stack Trace Logs ... 17

Slide 18

Slide 18 text

Logging Provide context, don’t just: log.Println(err) E.g. unexpected EOF Unexpected EOF of what!? 18

Slide 19

Slide 19 text

A Note on Errors Some errors provide context: listen tcp :33712: bind: address already in use “Named” errors (io.ErrUnexpectedEOF) do not: unexpected EOF 19

Slide 20

Slide 20 text

A Note on Errors https://github.com/pkg/errors You can add context to the error with errors.Wrap. You can add context with the logger. 20

Slide 21

Slide 21 text

Logging Context Structured logging adds key-value pairs to your log. https://github.com/sirupsen/logrus log.WithFields(logrus.Fields{ "animal": "walrus", "number": 8, }).Debug("Started observing beach") 21

Slide 22

Slide 22 text

Logging Context Structured logging provides a way to add context in a machine and human consumable format. 22

Slide 23

Slide 23 text

Logging Context Structured loggers can output text or JSON format for easy consumption by logstash/ELK/Splunk. Context can make all the difference... 23

Slide 24

Slide 24 text

Logging Anxiety The anxiety over what to log, when. How much is too much, how much is enough? Let’s set this aside for now. 24

Slide 25

Slide 25 text

Information Other information? Logs (action with context) Environment Flags Stack Trace More? 25

Slide 26

Slide 26 text

Information Logging some of these may work, but perhaps there’s a better way. Logging doesn’t work at all for some cases. E.g. what’s the current stack look like? 26

Slide 27

Slide 27 text

Information What about exposing information outside of logging? Logging describes action with context. expvar - in the standard library. Exposes current state. 27

Slide 28

Slide 28 text

expvar Adds a route to the default ServerMux at /debug/vars as a side effect. Also exposes a handler. 28

Slide 29

Slide 29 text

expvar expvar provides an http handler/endpoint which exposes arbitrary data in JSON format. What kind of data? - cmdline - memstats - And more... 29

Slide 30

Slide 30 text

expvar { "cmdline": [ ".\/expvar_example" ], "memstats": { "Alloc": 136736, "TotalAlloc": 136736, ... } } 30

Slide 31

Slide 31 text

MemStats https://golang.org/pkg/runtime/#MemStats Gives you various stats like: - Allocated bytes (Heap/Sys/Total) - GC statistics - Allocations by size 31

Slide 32

Slide 32 text

expvar Expose various Var types: - Float* - Int* - Map* - String * Atomic operations 32

Slide 33

Slide 33 text

expvar Can expose a variety of “Vars”, but notably there is Publish(Func): func init() { http.HandleFunc("/debug/vars", expvarHandler) Publish("cmdline", Func(cmdline)) Publish("memstats", Func(memstats)) } 33

Slide 34

Slide 34 text

expvar expvar.Func Publish a function that returns interface{}. The returned value is marshalled to JSON. 34

Slide 35

Slide 35 text

expvar func memstats() interface{} { stats := new(runtime.MemStats) runtime.ReadMemStats(stats) return *stats } 35

Slide 36

Slide 36 text

expvar So what can we do with expvar? Expose the environment: expvar.Publish("env", expvar.Func(func () interface{} {return os.Environ()})) 36

Slide 37

Slide 37 text

Exposing Environment Want a map instead of a slice? func publishEnv() interface{} { env := make(map[string]string) for _, line := range os.Environ() { parts := strings.SplitN(line, "=", 2) env[parts[0]] = parts[1] } return redactMap(env) } 37

Slide 38

Slide 38 text

Exposing Secrets You want to filter secret values somehow. Replace the value with a hash so you can compare. 38

Slide 39

Slide 39 text

Exposing Flags Flag values can be very useful for debugging func publishFlags() interface{} { flagMap := make(map[string]interface{}) flag.VisitAll(func(f *flag.Flag) { flagMap[f.Name] = f.Value }) return redactMap(flagMap) } 39

Slide 40

Slide 40 text

Exposing Flags Flag magic values, for two reasons: 1. They can be changed. 2. They show up in expvar. (super useful) 40

Slide 41

Slide 41 text

Exposing a Stack Trace Publish the stack trace: func publishStack() interface{} { buf := make([]byte, 65535) n := runtime.Stack(buf, true) buf = buf[0:n] return string(buf) } 41

Slide 42

Slide 42 text

expvar ALL the things! expvar.Publish("env", expvar.Func(publishEnv)) expvar.Publish("flags", expvar.Func(publishFlags)) expvar.Publish("stack", expvar.Func(publishStack)) … expvar.Publish("my-internal-value", ...) 42

Slide 43

Slide 43 text

Caution Make sure you don’t publish very expensive functions! Don’t make expvar too expensive or turn it into an accidental DoS vector. 43

Slide 44

Slide 44 text

Use Verbose Names Exposing information is great. Make that information verbose/specific: expvar.Publish("jobs", ...) Better: expvar.Publish("discovery-job-cache", ...) 44

Slide 45

Slide 45 text

Use Verbose Names Flag Names: - addr - init-timeout - rpc-retries Better: - status-addr, status.addr - discovery-init-timeout, discovery.init.timeout - foo-rpc-retries, foo.rpc.retries 45

Slide 46

Slide 46 text

Use Verbose Names Environment Variables: - DB_PASSWORD - INIT_TIMEOUT Better: - CLAIMS_DB_PASSWORD - DISCOVERY_INIT_TIMEOUT 46

Slide 47

Slide 47 text

expvar + structlog These two work together. Use expvar for state. Use logging for action. Less anxiety. 47

Slide 48

Slide 48 text

More with HTTP expvar means we’ve already committed to having a port open responding to http requests. What else can we do with this? 48

Slide 49

Slide 49 text

More with HTTP - Health handler - Specialized handlers for libraries (e.g. Vault integration) - Shutdown handlers (quit and abort) - Admin handlers 49

Slide 50

Slide 50 text

Specialized Endpoints Probably only want to do modification/destruction on POST. GET can return a form. 50

Slide 51

Slide 51 text

Specialized Endpoints - Monitoring Prometheus is fantastic (you should use it) Since you already have an internal/status http endpoint, dangle your prometheus metrics off of it. 51

Slide 52

Slide 52 text

Library Developers Provide exported variables for application developers to expose in expvar or logs. Or use expvar and prometheus directly? (But that side effect) 52

Slide 53

Slide 53 text

Recap Think about failure at all times to guide: - Panicking when necessary - Exposing data via expvar - Logging and context - Naming (flags, environment variables, etc.) 53

Slide 54

Slide 54 text

Thanks Come talk to me about operations and go! Questions? 54