Is There An Echo In Here? Signal Analysis for Ops (with notes)

Slide 1

Slide 1 text

Signal Analysis for Ops Is There An Echo In Here? Noah Kantrowitz Tuesday, May 6, 14

Slide 2

Slide 2 text

MATH AHEAD Tuesday, May 6, 14 Upfront warning: this talk is about math. It will introduce you to the basic concepts and techniques in signal processing in the hopes this will give you tools to apply to your operations life.

Slide 3

Slide 3 text

This is a Metric Tuesday, May 6, 14 With that out of the way, let's start the beginning. Here is a single metric.

Slide 4

Slide 4 text

Value @ Time Tuesday, May 6, 14 It has a value and a timestamp. Some metrics may have more metadata than that, but a value at a time is pretty much the minimum you can have.

Slide 5

Slide 5 text

Tuesday, May 6, 14 Very often we put metrics in graphs, again we have the value on the y axis and the time on the x axis.

Slide 6

Slide 6 text

metric.wav Tuesday, May 6, 14 So to cut to the chase a bit, this is exactly the same structure as audio data, and many similar signals.

Slide 7

Slide 7 text

Tuesday, May 6, 14 Many of you have probably seen an image like this at some point.

Slide 8

Slide 8 text

Tuesday, May 6, 14 Or possibly this. But what are these graphs coming from? They don't look like our metric graphs, but we said audio data and metrics are basically the same thing.

Slide 9

Slide 9 text

Frequency Domain Tuesday, May 6, 14 So before we had a graph of value over time, this is what is called "time domain". Those audio visualizers are different, instead they show value over frequency. This is the frequency domain.

Slide 10

Slide 10 text

Frequency 0hz 20Hz Tuesday, May 6, 14 So here is an example frequency domain graph. Across the bottom we have different frequencies, starting at 0Hz and going up. The y axis a bit fuzzy, we generally just worry about comparisons instead of using speciﬁc absolute values.

Slide 11

Slide 11 text

Value +0dB +50dB Tuesday, May 6, 14 You might sometimes see the y axis measured decibels. Decibels are just a compact way to represent a ratio between two values. +10dB is the same things as saying times 10, and -20dB is the same as saying divided by 100.

Slide 12

Slide 12 text

Fourier Transform Tuesday, May 6, 14 But we don't have frequency domain information, we have time domain metrics. The Fourier transform is how you convert from time to frequency domain.

Slide 13

Slide 13 text

ˆ f ( ⇠ ) = Z 1 1 f ( x ) e 2 ⇡ix⇠ dx Tuesday, May 6, 14 Here is the formal deﬁnition, which probably means about as much to you as it does to me, so lets go through it piece by piece

Slide 14

Slide 14 text

Tuesday, May 6, 14 So a super simple graph, a sine wave at 1hz.

Slide 15

Slide 15 text

Tuesday, May 6, 14 To that lets add another sine wave, this time at 4hz and with half the magnitude. This means our second graph is half the height and changing four times faster.

Slide 16

Slide 16 text

Tuesday, May 6, 14 Add them together and we get this. Now we are starting to get in to the realm of the kind of graphs ops folks look at all day, complex curves without super clear patterns. This is still pretty normal looking, so lets add some more data to this.

Slide 17

Slide 17 text

Tuesday, May 6, 14 Okay, thats a graph I could get back from graphite. From our progression you can maybe still see those original 1hz and 4hz waves, but if you walked in one day to a load average graph that looked like this you might not recognize that. This is where the Fourier transform comes in. It can take any stream of points and show you what periodic components contributed to it and how much.

Slide 18

Slide 18 text

1Hz 4 Hz/2 10 Hz/4 16 Hz/4 2 Hz/10 Tuesday, May 6, 14 So lets try it out. Here is the output from a Fourier Transform on that noisy graph. Immediately you can see the ﬁve periodic components and their relative strengths. I even threw in some random variation on each point, but on the scale of this graph its invisible. Now we can start to see why this kind of analysis, even without further processing, can be useful. How many of you would have missed that little 2Hz part in the original graph?

Slide 19

Slide 19 text

Tuesday, May 6, 14 For the more visual in the audience, here you can see the all ﬁve sine waves as split out against the Fourier transform.

Slide 20

Slide 20 text

ˆ f ( ⇠ ) = Z 1 1 f ( x ) e 2 ⇡ix⇠ dx Tuesday, May 6, 14 So now we know what a Fourier transform is. What about that scary looking equation? Those inﬁnities are not usually a good sign when trying to write code.

Slide 21

Slide 21 text

DFT DTFT Tuesday, May 6, 14 So instead we use the discrete Fourier transform, or more often its close cousin the discrete- time Fourier transform. Rather than functions, these operate on an array of samples. If ops folks have anything it is arrays of numbers.

Slide 22

Slide 22 text

Xk = N 1 X n=0 xne 2⇡i N nk Tuesday, May 6, 14 Scary equation take two. This is a bit easier to wrap our brains around. We plug in each value of k from 0 to N-1 and we get our transformed data. If you are familiar with sigmas you might have noticed the downside though, this is going to be slow since for every frequency we want to analyze, we need to do a sum across every data point.

Slide 23

Slide 23 text

FFT Tuesday, May 6, 14 So instead of the plain DFT, we can use a faster divide-and-conquer algorithm called a fast Fourier transform. These work by analyzing subsets of the data recursively. With an FFT, we can get order n log n instead of n squared.

Slide 24

Slide 24 text

IFT Tuesday, May 6, 14 And one last acronym for you, inverse Fourier transform. This lets us reverse the process, going from the frequency domain back to time domain but adding up all the sine waves.

Slide 25

Slide 25 text

Low-pass Tuesday, May 6, 14 So okay, now we have some new tools, converting from time domain to frequency domain, and then back again. What are we going to do with it? Let's look at a simple ﬁlter called a low-pass. This means that it only allows signals below a certain frequency, removing the higher frequencies.

Slide 26

Slide 26 text

Tuesday, May 6, 14 So lets take another graph that will probably look unpleasantly familiar to many of you. We have a simple alert threshold, but something is ﬂapping (to use the nagios term) so the pager is going off constantly. This sucks and generates lots of excess noise during incidents. Nagios includes a feature called ﬂap detection that tries to suppress these, but it is unpleasant to say the least.

Slide 27

Slide 27 text

Tuesday, May 6, 14 Here is the same data in the frequency domain. You can see that we have clearly crossed the alerting threshold, but that it is only the high frequency component that is doing so. Lets take that yellow line as a ﬁltering point and see how we can transform this data.

Slide 28

Slide 28 text

Tuesday, May 6, 14 First we discard everything over our ﬁltering point.

Slide 29

Slide 29 text

Tuesday, May 6, 14 And then we use an inverse Fourier transform to get back to time domain. We can now see our normal background variance is doing just ﬁne, well under the alerting threshold. So now we can alert on both frequency and ﬁltered time domain so we get just the one initial alert. Or maybe we want different thresholds for short lived vs. long lived events. Once you have these tools to play with, you can use them to build all kinds of pipelines.

Slide 30

Slide 30 text

High-pass Band-pass Tuesday, May 6, 14 So thats the basics of a low pass filter, but you can use the concepts for changing which frequencies you alter in different ways. A high pass filter would be allowing only high frequencies, and a band pass filter is allowing only frequencies in a specific interval.

Slide 31

Slide 31 text

Windowing Tuesday, May 6, 14 Next signal analysis concept, window functions. A window function is a mask you can multiply the signal by to analyze just one piece at a time. In ops, this is more or less a given from the start, we can only store a ﬁnite (and often quite small) number of points so there will be some horizon before which we can't see. Unfortunately this comes at a cost, spectral leakage. This is the distortion of the computed frequency domain due to lack of information.

Slide 32

Slide 32 text

Tuesday, May 6, 14 The simplest window is a rectangle. In fact any ﬁnite time constraint on our data can be taken as a rectangular window, though usually the noise is at such low frequencies we ignore it. Rectangular windows have the lowest levels of leakage as measured by total noise, but they are often not the best option for analyzing smaller time slices as the leakage it does create is broader.

Slide 33

Slide 33 text

Tuesday, May 6, 14 To make this a bit clearer, here is the frequency domain of a 7hz sine wave with a big rectangular window. It looks exactly like we would expect, a sharp spike right at 7hz and within a rounding error of 0 everywhere else. Any noise introduced is so tiny we can't see it.

Slide 34

Slide 34 text

Tuesday, May 6, 14 Here is the same wave with a three quarters of a second window instead. Now instead of a sharp peak we get a gradual taper, some of our 7hz signal has leaked in to adjacent buckets in the Fourier transform.

Slide 35

Slide 35 text

Tuesday, May 6, 14 So here is a spectral leakage diagram. There are two important bits you need to look at. First we have the main lobe in the center. Here we can see the nice part about the rectangular window, the main lobe is very tall meaning low signal loss, and very narrow meaning low leakage between frequencies. Everything other than the main lobe are called side lobes, and here we see the problem with the rectangular windows, the side lobes are numerous and also very tall meaning that our noise frequencies aren't being suppressed.

Slide 36

Slide 36 text

Tuesday, May 6, 14 Another simple option is a triangular window. It is better than a rectangle, but not by a whole lot.

Slide 37

Slide 37 text

Tuesday, May 6, 14 So to compare to the last leakage graph, we can see out main lob is bigger but not a ton, and our side lobes are reduced, but again not by much.

Slide 38

Slide 38 text

Tuesday, May 6, 14 A pretty common general-purpose window function is the Blackman-Harris curve. This is a 4-term Blackman-Harris, which is a balance between the main lob and side lobes

Slide 39

Slide 39 text

Tuesday, May 6, 14 As compared to the previous two, you can see the side lobes are very suppressed, while the main lobe is much bigger than before. So this means that our main frequency gets, to use a technical term, a bit mushy, but our noise frequencies will be much less visible.

Slide 40

Slide 40 text

Tuesday, May 6, 14 To make this a bit clearer, here is our same 7Hz signal with a Blackman-Harris window, and our original rectangular window as the dotted line. The peak has been lowered and it bleeds a bit more into neighboring buckets, but the buckets further away from the peak drop to zero much faster. This can help in a lot of situations when you a looking for small peaks in a multi-frequency signal, though you risk nearby frequencies getting lost in each other. This isn't usually a huge problem, but pick your window functions to match your analysis. Check out the wikipedia article on window functions for an excellent overview of tons of window functions.

Slide 41

Slide 41 text

NumPy Tuesday, May 6, 14 So now you want to go forth and use these ideas. The best one-stop-shop is the Python library NumPy. It offers not just Fourier transforms, but many many numeric analysis tools, as well as integration with other scientiﬁc computing libraries and plotting tools.

Slide 42

Slide 42 text

FFTW Tuesday, May 6, 14 FFTW is also an option, and has bindings in most languages, though of highly varying quality levels.

Slide 43

Slide 43 text

go-dsp Tuesday, May 6, 14 go-dsp is another relatively minimalist option for Go.

Slide 44

Slide 44 text

Ruby Tuesday, May 6, 14 Unfortunately Ruby, the workhorse of many Ops teams, has no good tooling for scientiﬁc or mathematical computing, so you are mostly on your own there.

Slide 45

Slide 45 text

Go forth and ﬁnd the signal Tuesday, May 6, 14

Slide 46

Slide 46 text

Thank You Tuesday, May 6, 14

Slide 47

Slide 47 text

Bonus Round Tuesday, May 6, 14 Some extra content if there is time.

Slide 48

Slide 48 text

DCT Tuesday, May 6, 14 The discrete cosine transform is a data compression technique closely related to the Fourier transform. The overall idea is simple, convert to frequency domain, pick out the most important spectral components (usually this means discarding unhelpfully-high frequencies), and then store only those. When you want to recreate the original signal, just run an inverse DCT, which for our discussion is pretty much the same as an inverse FFT. The advantage of all of this is the transformed data takes up signiﬁcantly less space than the original data. This is literally the difference between between bitmaps and JPEGs, or wave ﬁles and MP3s.

Slide 49

Slide 49 text

Wavelets Tuesday, May 6, 14 Wavelet transforms and the discrete wavelet transform in particular are similar to DCTs in application, though very different internally. They have many advantages with small, aperiodic elements and so might be more apt for compressing metrics data. They are currently at the heart of most top-of-the-line image and video compression schemes. As with DCTs, anyone looking to store large amounts of metrics data should look at these techniques.

Slide 50

Slide 50 text

Noise Gate Tuesday, May 6, 14 In a totally different direction, lets look a noise gates. These are commonly used in audio processing to reduce noise and crosstalk in a signal.

Slide 51

Slide 51 text

Tuesday, May 6, 14 The idea is pretty simple, instead of a single threshold have two. The alert happens when you pass the upper threshold, but only stops when you pass the lower threshold. In the example we are smoothing from four alerts ﬁring to just one.

Slide 52

Slide 52 text

Hysteresis Tuesday, May 6, 14 This general category of techniques is called hysteresis. This is a fancy word for any process where the output is related the input at the current time and the past value of the output. In the case of our noise gate, this is basically just a simple form of memory inside the algorithm.

Slide 53

Slide 53 text

Control Theory Tuesday, May 6, 14 And ﬁnally, if all this piqued your interest in math again, another topic to check out is control theory. Signal analysis can help you determine anomalies or ﬁnd patterns in your metrics, control theory can help you turn that around into changes to your systems. The simplest example is going from "My CPU usage is too high" to "I need to launch new servers". The trick is usually knowing how many servers to launch and when.

Slide 54

Slide 54 text

PID Control Tuesday, May 6, 14 And if you only learn one control algorithm, check out the PID loop controller. Or maybe I'll just have to come back next year and do a talk on it.

Slide 55

Slide 55 text

Thank You Tuesday, May 6, 14