Slide 1

Slide 1 text

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. November 13, 2014 | Las Vegas, NV ARC202 Real-World Real-Time Analytics Gustavo Arjones | @arjones CTO, Socialmetrix Sebastian Montini | @sebamontini Solutions Architect, Socialmetrix

Slide 2

Slide 2 text

• SaaS Company—since 2008 • Social media analytics track and measure activity of brands and personality, providing information to market research and brand comparison • Multilanguage technology (English, Portuguese, and Spanish) • Leader in Latin America, with operations in 5 countries, customers in Latin America and US • 1 out of 34 Twitter Certified Program worldwide

Slide 3

Slide 3 text

Our customers

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Ranking Brand 1 Brand 2 Brand 3 Q2 Q3 Q2 Q3 Q2 Q3 1° Flavor Breakfast Flavor Flavor Advertising Flavor 2° Healthy Flavor Packaging Brand I love Flavor Breakfast 3° Components Components Healthy Packaging Healthy Healthy 4° Advertising Healthy Components Addiction Components Advertising 5° Enquires Desire Prices Consumption Prices Components TOTAL 1.401 8.189 463 5.519 1.081 2.445 Share of topics Which conversations are my brand and my competitors’ brands driving?

Slide 7

Slide 7 text

smx.io/reinvent #reinvent

Slide 8

Slide 8 text

Challenges

Slide 9

Slide 9 text

Challenges: Variety • Different data sources • Different API • SLA • Method (pull or push) • Rate-limit, backoff strategy

Slide 10

Slide 10 text

Challenges: Velocity • Updates every second • Top users, top hashtags each minute • After event analysis are made with batch over complete dataset • Spikes of 20,000+ tweets per minute Last TV Debate Results Announced

Slide 11

Slide 11 text

Challenges: Meaning •Disambiguation •Data Enrichment – Demographics – Sentiment – Influencers •Human analysis PAN Orange Telecom Oi Telecom Hi!

Slide 12

Slide 12 text

Challenges: Alert and report •Clear and understandable UI •Slice-dice for business (not BI experts) •Real-time alerts for anomalies

Slide 13

Slide 13 text

Architecture evolution

Slide 14

Slide 14 text

Drivers for architecture evolution • More customers, bigger customers • Add new features • Keep costs under control

Slide 15

Slide 15 text

Architecture evolution 0 20 40 60 80 100 120 #1 #2 #3 #4 Active Customers

Slide 16

Slide 16 text

Architecture—1st iteration What we needed: • Complete data isolation • Trying different solutions/offerings

Slide 17

Slide 17 text

Architecture—1st iteration What we did: • All-in-one approach • Multi-instance architecture • Simple vertical scalability • MySQL performance tuning

Slide 18

Slide 18 text

Architecture—1st iteration What we've learned: • Multi-instance is harder to administrate, but minimizes instability impact on customers • Vertical scalability: poor resource management • MySQL schema changes translate into downtime

Slide 19

Slide 19 text

Architecture—2nd iteration What we needed: • Separation of responsibilities (crawling, processing) • Horizontal scalability • Fast provisioning • Cost reduction

Slide 20

Slide 20 text

Architecture—2nd iteration What we changed: • Migrated to AWS • RabbitMQ (Single Node) • Replace MySQL for Amazon RDS • AWS CloudFormation • Auto Scaling groups

Slide 21

Slide 21 text

Architecture—2nd iteration What we've learned: • PIOPS à • Tuning the Auto Scaling policies can be hard • AWS CloudFormation: great for migration, not enough for daily ops

Slide 22

Slide 22 text

Architecture—3rd iteration What we needed: • Deliver new features (NRT, more complex analytics) • Scale fast • Be resilient against failure • Adding and improving data sources • Keep costs under control (always)

Slide 23

Slide 23 text

Architecture—3rd iteration What we changed: • Apache Storm • RabbitMQ HA • Amazon Elastic MapReduce (Hadoop/Hive) • AWS CloudFormation + Chef • Amazon Glacier + Amazon S3 lifecycles policies

Slide 24

Slide 24 text

Architecture—3rd iteration What we've learned: • Spot Instances + Reserved Instances • Hive = SQL à SQL scripts are hard to test • Bulk upserts on Amazon RDS can be expensive (PIOPS) • Amazon DynamoDB is great, but expensive (for our use-case)

Slide 25

Slide 25 text

Dashboard

Slide 26

Slide 26 text

Architecture—4th iteration What we needed: • Monitor millions of social media profiles • Make data accessible (exploration, PoC) • Improve UI response times • Testing our data pipelines • Reprocessing (faster)

Slide 27

Slide 27 text

Architecture—4th iteration What we changed: • Cassandra (DSE) • MongoDB MMS • Apache Spark

Slide 28

Slide 28 text

What we've learned: • Leverage AWS ecosystem • Datastax AMI + Opscenter integration • MongoDB MMS: automation magic! • Apache Spark unit testing + Amazon EC2 launch scripts • Amazon EMR doesn’t have the latest stable versions Architecture—4th iteration

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Architecture evolution - 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 #1 #2 #3 #4 Active Customers Costs Customers

Slide 31

Slide 31 text

Lessons learned

Slide 32

Slide 32 text

Lessons learned • Automate since Day 1 (CloudFormation + Chef) • Monitor systems activity, understand your data patterns, e.g. LogStash (ELK) • Always have a Source of Truth (Amazon S3 + Glacier) • Make your Source of Truth searchable

Slide 33

Slide 33 text

Lessons Learned (II) •Approximation is a good thing: HLL, CMS, Bloom •Write your pipelines considering reprocessing needs • Avoid at all costs framework explosion •AWS ecosystem allows rapid prototype

Slide 34

Slide 34 text

Socialmetrix NextGen 2015

Slide 35

Slide 35 text

Architecture evolution 0 20 40 60 80 100 120 #1 #2 #3 #4 Active Customers

Slide 36

Slide 36 text

Architecture nextgen • Reduce moving parts • Apache Spark as central processing framework – Realtime (Micro-batch) – Batch-processing • Kafka (Message Broker) • Cassandra (Time-series storage) • ElasticSearch (Content Indexer)

Slide 37

Slide 37 text

To infinity … and beyond! Architecture evolution 0 20 40 60 80 100 120 #1 #2 #3 #4 NextGen Active Customers

Slide 38

Slide 38 text

Gustavo Arjones, CTO @arjones | [email protected] Sebastian Montini, Solutions Architect @sebamontini | [email protected] Let’s talk at Venetian—Titian Hallway Feedback and QandA

Slide 39

Slide 39 text

Please give us your feedback on this presentation © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Join the conversation on Twitter with #reinvent ARC202: Real-World Real-Time Analytics Thank you!