Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic Stack: Acquisition of social media data at a greater scale

Elastic Co
July 07, 2016
450

Elastic Stack: Acquisition of social media data at a greater scale

http://www.meetup.com/Chicago-Elastic-Fantastics/events/231694268/

Acquisition of social media data with the Elastic Stack at a greater scale.

You don't truly understand some technology unless you are able to break it and fix it. In this talk I will share my experience in breaking elasticsearch while indexing massive amounts of social media data from Facebook and Twitter. We will discuss the main challenges faced and lessons learned along the way of my journey with the Elastic Stack while staying on the edge of the hardware limits.

Miroslav Mihaylov is an experimental physicist turned full stack developer with 2 years of experience in elasticsearch. I have been extensively using the Elastic Stack in the past year as a part of an ongoing research effort in the field of Social Network Analysis and Text Data Mining.

Elastic Co

July 07, 2016
Tweet

Transcript

  1. Summary • Life before the ELK stack • Computational Needs

    And Challenges • Cluster Configuration And ELK Deployment • Graph Theory in 30 Seconds • Social Media Data From Twitter • Logstash Input Twitter Configuration • Single Thread Logstash Input Twitter • Multiple Logstash Instances • Twitter Data Rates Outcome • Lessons Learned • Data from Facebook Graph API • Facebook Graph API Data • Challenges and approach
  2. Life before the ELK stack • About Me • Data

    intensive projects from the past • Elasticsearch to the rescue • Current endeavors and approach
  3. Computational Needs And Challenges The Data • Acquisition • Storage

    Data Interface • Browsing with Kibana • Exporting data subset • Programmatically access This talk is about the data acquisition stage
  4. Cluster Configuration And ELK Deployment • Master Node 2xE5 2680

    64GB 5TB • 10 Slave nodes i5-3470 8GB 0.5TB • Each node has a secondary external interface connected to the internet https://forge.puppet.com/elasticsearch https://github.com/elastic/puppet-elasticsearch Deployment with puppet
  5. Graph Theory in 30 Seconds Collection of vertices connected by

    edges Graphs from social media data • users as vertices • users as edges
  6. Social Media Data From Twitter • Register Twitter Developer Account

    • Get your access keys and tokens at https://apps.twitter.com/ • Twitter Streaming APIs https://dev.twitter.com/streaming/overview • Query by keywords and/or key phrases • Query by locations (lon, lat pairs specifying a set of bounding boxes ) • Logstash Input Twitter plugin https://www.elastic.co/guide/en/logstash/2.3/plugins-inputs- twitter.html
  7. Logstash Input Twitter Configuration input { twitter { # Add

    your data from https://apps.twitter.com/ consumer_key => "WZyKcoN0jiQxAnF5R6QPkw" consumer_secret => "qnJBPpvLaNEg69Wb0o8ghJHlkKtBteyGzjG4fi7Esco" oauth_token => "1234567890-PysraHELjOy8D5t2KwOIn9IsgvtdttD89KY0etP" oauth_token_secret => "vsIrghbQRxfAuSTg5oKm1Hy7iS61aTpcEX19e6JNCo" keywords => ["ladygaga","katyperry","shakira", "rihanna", "KimKardashian","Cristiano" ] full_tweet => true } } output { stdout { codec => dots } elasticsearch { hosts => ["10.0.0.7","10.0.0.8","10.0.0.9","10.0.0.10"] # Cluster nodes index => "logstash-twitter-celebs-%{+YYYY-MM-dd}“ # Rotate indices daily document_id => "%{id_str}“} # Use tweet id for the document id } }
  8. Single Thread Logstash Input Twitter • “2016 Super Bowl 50”

    20 keywords single logstash instance Acquisition rate cannot exceed 2.9K tweets per minute
  9. Multiple Logstash Instances Keyword 1 Keyword 2 Keyword 3 Keyword

    4 Keyword 1 Keyword 2 Keyword 3 Keyword 4 Combined Data Output • Same set of keywords No significant increase in the data collection rate
  10. Multiple Logstash Instances Keyword 1 Keyword 2 Keyword 3 Keyword

    4 Combined Data Output • Keywords spread across instances Collection rate scales linearly with the number of nodes*
  11. Multiple Logstash Instances Outcome • Collection rate from 4 logstash

    instances peaking 12K tweets per minute • Indexing approximately 10M documents per day occupying 100GB HD space
  12. Lessons Learned • Elasticsearch: The Definitive Guide with Sense Editor

    https://www.elastic.co • Dynamic Templates and Logstash Templates https://github.com/logstash-plugins/logstash-output- elasticsearch/blob/v0.2.4/lib/logstash/outputs/elasticsearch/elasticsearch-template.json • Curator Python command line tool for Tending your es indices Optimize, Snapshot, Close, Delete indices • Keep your index small • Increase ES_HEAP_SIZE • Elasticsearch is intended for multimode environment
  13. Data from Facebook Graph API • Graph API https://developers.facebook.com/docs/graph-api/ •

    Custom PHP command line application About the Graph API • Nodes, Edges and fields • Node - everything which has an unique ID (wall post, comment, user) • Edge- the thing that connects the nodes • Request to endpoint /{public-user-id}/feed • Returns up to 250 of the events from the {public-user} timeline which can be one of the following type: status, link, photo, video, event, music, note • Each item contains the latest 25 users from the comments and likes edges. • Need to send secondary request to obtain the full list of likes and comments which can reach 100K and occasionally exceed 1M per event.
  14. Figures and Challenges • Acquiring the full data for 1K+

    Facebook brands resulting to over 60 Milion unique user interactions • The indices occupy over 4TB hard drive space. Some challenges • Inconsistent JSON structure leading to mapping conflicts->explicitly request the fields. • Daily index rotation is not sufficient for some of the {public-user} index can exceed 10GB with the spikes in comments and likes -> further tuning of the number of shards is needed.