Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ARC202 Real World, Realtime Analytics - AWS Re:Invent 2014

ARC202 Real World, Realtime Analytics - AWS Re:Invent 2014

Sebastian Montini

October 03, 2014
Tweet

More Decks by Sebastian Montini

Other Decks in Technology

Transcript

  1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved.

    May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. November 13, 2014 | Las Vegas, NV ARC202 Real-World Real-Time Analytics Gustavo Arjones | @arjones CTO, Socialmetrix Sebastian Montini | @sebamontini Solutions Architect, Socialmetrix
  2. • SaaS Company—since 2008 • Social media analytics track and

    measure activity of brands and personality, providing information to market research and brand comparison • Multilanguage technology (English, Portuguese, and Spanish) • Leader in Latin America, with operations in 5 countries, customers in Latin America and US • 1 out of 34 Twitter Certified Program worldwide
  3. Ranking Brand 1 Brand 2 Brand 3 Q2 Q3 Q2

    Q3 Q2 Q3 1° Flavor Breakfast Flavor Flavor Advertising Flavor 2° Healthy Flavor Packaging Brand I love Flavor Breakfast 3° Components Components Healthy Packaging Healthy Healthy 4° Advertising Healthy Components Addiction Components Advertising 5° Enquires Desire Prices Consumption Prices Components TOTAL 1.401 8.189 463 5.519 1.081 2.445 Share of topics Which conversations are my brand and my competitors’ brands driving?
  4. Challenges: Variety • Different data sources • Different API •

    SLA • Method (pull or push) • Rate-limit, backoff strategy
  5. Challenges: Velocity • Updates every second • Top users, top

    hashtags each minute • After event analysis are made with batch over complete dataset • Spikes of 20,000+ tweets per minute Last TV Debate Results Announced
  6. Challenges: Meaning •Disambiguation •Data Enrichment – Demographics – Sentiment –

    Influencers •Human analysis PAN Orange Telecom Oi Telecom Hi!
  7. Challenges: Alert and report •Clear and understandable UI •Slice-dice for

    business (not BI experts) •Real-time alerts for anomalies
  8. Architecture—1st iteration What we did: • All-in-one approach • Multi-instance

    architecture • Simple vertical scalability • MySQL performance tuning
  9. Architecture—1st iteration What we've learned: • Multi-instance is harder to

    administrate, but minimizes instability impact on customers • Vertical scalability: poor resource management • MySQL schema changes translate into downtime
  10. Architecture—2nd iteration What we needed: • Separation of responsibilities (crawling,

    processing) • Horizontal scalability • Fast provisioning • Cost reduction
  11. Architecture—2nd iteration What we changed: • Migrated to AWS •

    RabbitMQ (Single Node) • Replace MySQL for Amazon RDS • AWS CloudFormation • Auto Scaling groups
  12. Architecture—2nd iteration What we've learned: • PIOPS à • Tuning

    the Auto Scaling policies can be hard • AWS CloudFormation: great for migration, not enough for daily ops
  13. Architecture—3rd iteration What we needed: • Deliver new features (NRT,

    more complex analytics) • Scale fast • Be resilient against failure • Adding and improving data sources • Keep costs under control (always)
  14. Architecture—3rd iteration What we changed: • Apache Storm • RabbitMQ

    HA • Amazon Elastic MapReduce (Hadoop/Hive) • AWS CloudFormation + Chef • Amazon Glacier + Amazon S3 lifecycles policies
  15. Architecture—3rd iteration What we've learned: • Spot Instances + Reserved

    Instances • Hive = SQL à SQL scripts are hard to test • Bulk upserts on Amazon RDS can be expensive (PIOPS) • Amazon DynamoDB is great, but expensive (for our use-case)
  16. Architecture—4th iteration What we needed: • Monitor millions of social

    media profiles • Make data accessible (exploration, PoC) • Improve UI response times • Testing our data pipelines • Reprocessing (faster)
  17. What we've learned: • Leverage AWS ecosystem • Datastax AMI

    + Opscenter integration • MongoDB MMS: automation magic! • Apache Spark unit testing + Amazon EC2 launch scripts • Amazon EMR doesn’t have the latest stable versions Architecture—4th iteration
  18. Architecture evolution - 20 40 60 80 100 120 140

    160 0 20 40 60 80 100 120 #1 #2 #3 #4 Active Customers Costs Customers
  19. Lessons learned • Automate since Day 1 (CloudFormation + Chef)

    • Monitor systems activity, understand your data patterns, e.g. LogStash (ELK) • Always have a Source of Truth (Amazon S3 + Glacier) • Make your Source of Truth searchable
  20. Lessons Learned (II) •Approximation is a good thing: HLL, CMS,

    Bloom •Write your pipelines considering reprocessing needs • Avoid at all costs framework explosion •AWS ecosystem allows rapid prototype
  21. Architecture nextgen • Reduce moving parts • Apache Spark as

    central processing framework – Realtime (Micro-batch) – Batch-processing • Kafka (Message Broker) • Cassandra (Time-series storage) • ElasticSearch (Content Indexer)
  22. To infinity … and beyond! Architecture evolution 0 20 40

    60 80 100 120 #1 #2 #3 #4 NextGen Active Customers
  23. Gustavo Arjones, CTO @arjones | [email protected] Sebastian Montini, Solutions Architect

    @sebamontini | [email protected] Let’s talk at Venetian—Titian Hallway Feedback and QandA
  24. Please give us your feedback on this presentation © 2014

    Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Join the conversation on Twitter with #reinvent ARC202: Real-World Real-Time Analytics Thank you!