Slide 1

Slide 1 text

Analytics in AWS Patrick Eaton, PhD [email protected] @PatrickREaton

Slide 2

Slide 2 text

Stackdriver at a Glance Stackdriver's intelligent monitoring service helps SaaS providers spend more time on Dev and less on Ops ● Founded by cloud/infrastructure industry veterans (Microsoft, VMware, EMC, Endeca, Red Hat) with deep systems and DevOps expertise ● Team of 15, based in Downtown Boston ● Public beta underway -- see the web site

Slide 3

Slide 3 text

Analytics at Stackdriver Goal: Deliver more value for less effort Examples: ● Monitor and report on infrastructure in terms that makes sense to your team ● Detect how pieces of the infrastructure relate ● Display lots of data at one time ● Suggest policies to alert you of possible problems ● Identify unusual application behavior This talk: three broad analytics techniques and examples

Slide 4

Slide 4 text

Technique: Mine the Metadata ● AWS provides heaps of metadata about your infrastructure ● Metadata is accessible via Amazon's APIs ● Use it! Simple analysis of the metadata can produce big wins

Slide 5

Slide 5 text

Using Metadata - Identify App Groups ● We make suggestions based on instance names, tags, security groups, etc. ● Customer gets more relevant monitoring

Slide 6

Slide 6 text

Using Metadata - Find Relationships ● We make explicit the relationships in the customers infrastructure ○ load balancers and backing instances ○ instances and security groups ○ volumes and their host ○ snapshots and their volumes ● Customer sees an architectural view that matches their mental model

Slide 7

Slide 7 text

Technique - Pre-compute Summarizations ● Modern UIs are data rich ● Data summarization is key for effectively presentation and UI performance ● Identify commonly-used summarizations and pre- compute them (outside the critical path)

Slide 8

Slide 8 text

Summarizing - Roll-up Across Time We summarize data for a resource across time ● Compute roll-ups for various functions: avg, max, percentile, etc. Customer can see historical trends quickly 30x Time 30min roll-up

Slide 9

Slide 9 text

Summarizing - Roll-up Across Resources We summarize behavior of multiple resources ● Compute roll-ups for various functions: avg, max, percentile, etc. Customer can view cluster performance at a glance 15 resources

Slide 10

Slide 10 text

Tools for Summarization Tools ● Hadoop and AWS Elastic Map/Reduce (EMR) ● Python and mrjob ○ https://github.com/Yelp/mrjob ● Data from S3 archives; data to Cassandra Usage ● EMR clusters of 14 c1.mediums ● Start jobs every 6 hours

Slide 11

Slide 11 text

mrjob Word Count Example from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def combiner(self, word, counts): yield (word, sum(counts)) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() Define map() and reduce() functions in standard Python Python generators pass data back to Hadoop mrjob handles set-up and configuration of job $ python mrjob_wc.py readme.txt > counts # local $ python mrjob_wc.py readme.txt -r emr > counts # EMR $ python mrjob_wc.py readme.txt -r hadoop > counts # Hadoop

Slide 12

Slide 12 text

Summarization Algorithm ● Phase 0 ○ Map ■ Read archives from s3 ■ Route data point to a time bucket for aggregation ● key = resource::metric::time::granularity ○ Reduce ■ Aggregate data into a single data point ■ We compute 8 functions (min, max, avg, med...) ● Phase 1 ○ Map ■ Route aggregated point to a data series ● key = resource::metric:granularity::aggr_fn ○ Reduce ■ Write series of data to Cassandra

Slide 13

Slide 13 text

Summarization Algorithm Illustrated t t + δ t + 2δ t+5+δ t + 5 t + 3δ Raw Data 5-Minute 15-Minute min max avg min max avg min max avg ... ... ... min min min max max max avg avg avg min min min max max max avg avg avg Map-0 Red-0 Map-1

Slide 14

Slide 14 text

Technique - Analyze the Data Find trends, anomalies, oddities, correlations, relationships, incident signatures, best practice violations, etc. in the data We analyze data from the customer environment and highlight discoveries The customer learns about potential problems before they happen

Slide 15

Slide 15 text

Analyzing - Suggest Alert Policies The product supports threshold-based alerts. We analyze historical data and suggest policies to catch future issues. Approach: analyze data series looking for typically consistent performance with statistically significant outliers ● Use mrjob in local mode - easy programming model for smaller data sets

Slide 16

Slide 16 text

Analyzing - Identify Anomalies A more sophisticated version of the previous Currently using R for much of this analysis Using rpy2 to translate to/from Python (http://rpy.sourceforge.net/rpy2.html)

Slide 17

Slide 17 text

Example of rpy2 import rpy2.robjects as robjects robjects.r(''' find_decomposition <- function(data_series) { ts <- ts(data_series, start=c(1, 1), frequency=12) decomp <- decompose(ts, type="additive") } ''') def decompose(df, series): decomp_components = ['seasonal', 'trend', 'random'] decompose_fn = robjects.r['find_decomposition'] return decompose_fn(series)

Slide 18

Slide 18 text

Thank you! Yes, we are hiring! Patrick Eaton - [email protected] - @PatrickREaton