Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Book on Genesis Media: Analyzing Data... and It Was Good!

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
November 17, 2015

The Book on Genesis Media: Analyzing Data... and It Was Good!

Rob Schwartzberg leads Genesis Media’s technology team, overseeing the development of the company’s data platform and ad technology. He has worked in software development across multiple industries for a decade. He holds a comprehensive background in systems architecture and engineering with a focus on SaaS applications and highly-scalable data warehousing.

Elastic{ON} Tour | New York City | November 17, 2015

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

November 17, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. The Book on Genesis Media: Analyzing Data…and It Was Good!

    Robert Schwartzberg, VP Technology | Genesis Media 1
  2. Topics • Growth of a startup • Logging use cases

    • Analytics use cases • The Architecture • Lessons learned along the way 2
  3. Genesis Media: Who We Are • Online Advertising Startup •

    Specializing in “viewable” video ads • Our value is our data • Contextual and Behavioral Information • 100% Amazon EC2 3
  4. Genesis Media: Growth 4 Year Publisher  Websites 2012 0 2013

    ~50 2014 ~100 2015 ~2000
  5. Pixels… Pixels Everywhere! • Tracking events across the web •

    Every event is tracked • Used to feed systems of record 5
  6. Pixels… Pixels Everywhere! • Event based“Pixels” (1x1 transparent images) as

    a data exchange format • http://yourdomain.com/ 1x1pixel.png? fname=john&lname=smith • Billions of transactions per day 6
  7. In The Beginning… • It was ugly, really ugly. •

    Straight to Apache • Commandline Fu (grep|awk|sed|wc –l) • Manual reporting took all day 7
  8. Enter Splunk • Gave near real-time reporting • Robust, idiosyncratic

    query language • Cost scales with data volume • No more Startup license! 8
  9. Logging Alternatives • Continue with Splunk • Greylog • Home

    grown • ELK ▪ Open Source ▪ Operationally Sound ▪ Easy to use (Kibana) ▪ Great community ▪ Flexible! 9
  10. ELK To The Rescue (Phase 1) • Started small •

    100% on AWS EC2 (AWS Plugin) • 500 events/sec • 3 Elasticsearch Servers (r3.large) on v1.2.x • 1 Kibana Server v3.1 • Weekly Indices 10
  11. Insights – Real-time Reporting • Available for non-technical departments •

    Client Performance Metrics • Search unknown third party event data (http://yourdomain.com/ 1x1pixel.png?event=thirdPartyAdStart) • Help Resolve Discrepancies 11
  12. 12

  13. ELK To The Rescue (Phase 2) • Added Kafka Message

    Queue • Logstash Kafka Plugin (Thanks, joekiller!) • Custom Index Mappings • Horizontal and Vertical scaling (r3.xlarge) • 1500 messages/sec 13
  14. Data Science • Began exploratory analytics ▪ Added Hadoop cluster

    • Main consumers • They broke…everything. ▪ OOM Errors ▪ Red Cluster States 14
  15. Pitfalls • Index Frequency • Sharding • Index Mappings ▪

    What happens with http://yourdomain.com?type=foo ▪ Analyzed Strings ▪ Versioned • Disks ▪ Watch your watermarks! • Too many shards? 15
  16. Mo Publishers, Mo Data… • 25 Billion Documents • 5000

    messages/sec, ~20 fields per event • Shard Tuning • Logstash – c4.xlarge (Processor Bound) • Elasticsearch – r3.2xlarge • Daily Snapshots • Upgraded to ES v1.7.1, Kibana v4.1.2 • Used by Ad Operations, Data Science, Technology, Publisher Operations, BI, etc… 16
  17. Publisher Performance 17

  18. ELK in the Stack • Not just for operational logging

    and text search • Where does ELK fit in with your Lambda Architecture? ▪ Speed Layer ▪ Hadoop Connectors 18
  19. Architecture 19

  20. ELK Exposed • Low latency (50 ms roundtrip) • Performance

    tuning • Hourly Indexes • Exposes proprietary metrics ▪ URL Level Targeting ▪ Attention Metrics ▪ Contextual Scoring ▪ Engagement Data • Used heavily in ad decisioning 20
  21. Future Plans • Further performance tuning ▪ Time series data

    storage • Better load balancing (HAProxy) from Logstash -> ES • ETL to/from Hadoop • Watcher • Operational Logging ▪ System and Application Logs ▪ Kafka Offsets 21
  22. Case Study – Dynamic Skip • How many people skip

    ads? • How is skip length determined? • What if you could adjust skip time to fit the environment? • What environmental signals are relevant? 22
  23. Case Study – Dynamic Skip 23

  24. Thanks! 24

  25. www.elastic.co 25