Elastic Stack: Acquisition of social media data at a greater scale

Acquisition of Social Media Data at a Greater Scale Miroslav
Mihaylov [email protected] Elastic meetup Chicago July-7th 2016

Summary • Life before the ELK stack • Computational Needs
And Challenges • Cluster Configuration And ELK Deployment • Graph Theory in 30 Seconds • Social Media Data From Twitter • Logstash Input Twitter Configuration • Single Thread Logstash Input Twitter • Multiple Logstash Instances • Twitter Data Rates Outcome • Lessons Learned • Data from Facebook Graph API • Facebook Graph API Data • Challenges and approach

Life before the ELK stack • About Me • Data
intensive projects from the past • Elasticsearch to the rescue • Current endeavors and approach

Computational Needs And Challenges The Data • Acquisition • Storage
Data Interface • Browsing with Kibana • Exporting data subset • Programmatically access This talk is about the data acquisition stage

Cluster Configuration And ELK Deployment • Master Node 2xE5 2680
64GB 5TB • 10 Slave nodes i5-3470 8GB 0.5TB • Each node has a secondary external interface connected to the internet https://forge.puppet.com/elasticsearch https://github.com/elastic/puppet-elasticsearch Deployment with puppet

Graph Theory in 30 Seconds Collection of vertices connected by
edges Graphs from social media data • users as vertices • users as edges

Social Media Data From Twitter • Register Twitter Developer Account
• Get your access keys and tokens at https://apps.twitter.com/ • Twitter Streaming APIs https://dev.twitter.com/streaming/overview • Query by keywords and/or key phrases • Query by locations (lon, lat pairs specifying a set of bounding boxes ) • Logstash Input Twitter plugin https://www.elastic.co/guide/en/logstash/2.3/plugins-inputs- twitter.html

Logstash Input Twitter Configuration input { twitter { # Add
your data from https://apps.twitter.com/ consumer_key => "WZyKcoN0jiQxAnF5R6QPkw" consumer_secret => "qnJBPpvLaNEg69Wb0o8ghJHlkKtBteyGzjG4fi7Esco" oauth_token => "1234567890-PysraHELjOy8D5t2KwOIn9IsgvtdttD89KY0etP" oauth_token_secret => "vsIrghbQRxfAuSTg5oKm1Hy7iS61aTpcEX19e6JNCo" keywords => ["ladygaga","katyperry","shakira", "rihanna", "KimKardashian","Cristiano" ] full_tweet => true } } output { stdout { codec => dots } elasticsearch { hosts => ["10.0.0.7","10.0.0.8","10.0.0.9","10.0.0.10"] # Cluster nodes index => "logstash-twitter-celebs-%{+YYYY-MM-dd}“ # Rotate indices daily document_id => "%{id_str}“} # Use tweet id for the document id } }

Single Tweet Data

Single Thread Logstash Input Twitter • “2016 Super Bowl 50”
20 keywords single logstash instance Acquisition rate cannot exceed 2.9K tweets per minute

Multiple Logstash Instances Keyword 1 Keyword 2 Keyword 3 Keyword
4 Keyword 1 Keyword 2 Keyword 3 Keyword 4 Combined Data Output • Same set of keywords No significant increase in the data collection rate

Multiple Logstash Instances Keyword 1 Keyword 2 Keyword 3 Keyword
4 Combined Data Output • Keywords spread across instances Collection rate scales linearly with the number of nodes*

Multiple Logstash Instances Outcome • Collection rate from 4 logstash
instances peaking 12K tweets per minute • Indexing approximately 10M documents per day occupying 100GB HD space

Outcome

Cluster

Lessons Learned • Elasticsearch: The Definitive Guide with Sense Editor
https://www.elastic.co • Dynamic Templates and Logstash Templates https://github.com/logstash-plugins/logstash-output- elasticsearch/blob/v0.2.4/lib/logstash/outputs/elasticsearch/elasticsearch-template.json • Curator Python command line tool for Tending your es indices Optimize, Snapshot, Close, Delete indices • Keep your index small • Increase ES_HEAP_SIZE • Elasticsearch is intended for multimode environment

Data from Facebook Graph API • Graph API https://developers.facebook.com/docs/graph-api/ •
Custom PHP command line application About the Graph API • Nodes, Edges and fields • Node - everything which has an unique ID (wall post, comment, user) • Edge- the thing that connects the nodes • Request to endpoint /{public-user-id}/feed • Returns up to 250 of the events from the {public-user} timeline which can be one of the following type: status, link, photo, video, event, music, note • Each item contains the latest 25 users from the comments and likes edges. • Need to send secondary request to obtain the full list of likes and comments which can reach 100K and occasionally exceed 1M per event.

FB Data Rates Single pubic user timeline posts per hour
Unique user interactions per day

Figures and Challenges • Acquiring the full data for 1K+
Facebook brands resulting to over 60 Milion unique user interactions • The indices occupy over 4TB hard drive space. Some challenges • Inconsistent JSON structure leading to mapping conflicts->explicitly request the fields. • Daily index rotation is not sufficient for some of the {public-user} index can exceed 10GB with the spikes in comments and likes -> further tuning of the number of shards is needed.

Thank you

Elastic Stack: Acquisition of social media data...

Elastic Stack: Acquisition of social media data at a greater scale

Elastic Co

More Decks by Elastic Co

Featured

Transcript

Acquisition of Social Media Data at a Greater Scale Miroslav

Summary • Life before the ELK stack • Computational Needs

Life before the ELK stack • About Me • Data

Computational Needs And Challenges The Data • Acquisition • Storage

Cluster Configuration And ELK Deployment • Master Node 2xE5 2680

Graph Theory in 30 Seconds Collection of vertices connected by

Social Media Data From Twitter • Register Twitter Developer Account

Logstash Input Twitter Configuration input { twitter { # Add

Single Tweet Data

Single Thread Logstash Input Twitter • “2016 Super Bowl 50”

Multiple Logstash Instances Keyword 1 Keyword 2 Keyword 3 Keyword

Multiple Logstash Instances Keyword 1 Keyword 2 Keyword 3 Keyword

Multiple Logstash Instances Outcome • Collection rate from 4 logstash

Outcome

Cluster

Lessons Learned • Elasticsearch: The Definitive Guide with Sense Editor

Data from Facebook Graph API • Graph API https://developers.facebook.com/docs/graph-api/ •

FB Data Rates Single pubic user timeline posts per hour

Figures and Challenges • Acquiring the full data for 1K+

Thank you