Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PHP At The Firehose Scale

PHP At The Firehose Scale

Here's one to give the PHP bashers a well-deserved black eye! Twitter is one of the world's best know social media sites, handling over 500 million public tweets a day (that's around 6,000 tweets a second). Together, they're delivered to select partners as a 'firehose' of data, who in turn deliver it on to their customers. DataSift is one of Twitter's firehose partners, and when someone presses 'Send' in their Twitter client, we aim to get that tweet into the hands of our customers in about 1 second. And PHP plays several key roles in making that possible. Come along and hear Stuart explain just how.

Presented at @phpukconference in February 2014.

Stuart Herbert

February 21, 2014
Tweet

More Decks by Stuart Herbert

Other Decks in Programming

Transcript

  1. @ Contents PHP At The Firehose Scale What Is A

    Firehose? 1 2 3 4 5 Receiving The Firehose Delivering The Filtered Hose PHP In Our Overall Architecture Summary: Why PHP
  2. @ Contents PHP At The Firehose Scale What Is A

    Firehose? 1 2 3 4 5 Receiving The Firehose Summary: Why PHP Delivering The Filtered Hose PHP In Our Overall Architecture
  3. @ Contents PHP At The Firehose Scale What Is A

    Firehose? 1 2 3 4 5 Receiving The Firehose Summary: Why PHP Delivering The Filtered Hose PHP In Our Overall Architecture
  4. @ Contents PHP At The Firehose Scale What Is A

    Firehose? 1 2 3 4 5 Receiving The Firehose Summary: Why PHP Delivering The Filtered Hose PHP In Our Overall Architecture
  5. @ Contents PHP At The Firehose Scale What Is A

    Firehose? 1 2 3 4 5 Receiving The Firehose Summary: Why PHP Delivering The Filtered Hose PHP In Our Overall Architecture
  6. @ Scraping (verb): your robot downloads a webpage or website,

    and tries to make sense of the marked-up content PHP At The Firehose Scale
  7. @ Scraping (verb): your robot downloads a webpage or website,

    and tries to make sense of the marked-up content PHP At The Firehose Scale
  8. @ Scraping (verb): your robot downloads a webpage or website,

    and tries to make sense of the marked-up content PHP At The Firehose Scale
  9. @ Firehose (noun): a live stream of events that you

    consume PHP At The Firehose Scale
  10. @ Firehose (noun): a live stream of events that you

    consume PHP At The Firehose Scale
  11. @ Firehose (noun): a live stream of events that you

    consume PHP At The Firehose Scale
  12. @ Firehose (noun): a live stream of events that you

    consume PHP At The Firehose Scale
  13. @ Firehoses Have Sharp Peaks PHP At The Firehose Scale

    “New Tweets per second (TPS) record: 143,199 TPS. August 3rd, 2013. Source: Twitter Engineering Blog
  14. @ Firehoses Are Relentless PHP At The Firehose Scale Typical

    day: more than 500 million Tweets sent; average 5,700 TPS.” August 3rd, 2013. Source: Twitter Engineering Blog
  15. @ 100% Amount of incoming data that goes through our

    PHP code PHP At The Firehose Scale
  16. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  17. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  18. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  19. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  20. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  21. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  22. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  23. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  24. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  25. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  26. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  27. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  28. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  29. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  30. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  31. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  32. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  33. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  34. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  35. @ > 1Firehose Amount of outgoing data delivered by our

    PHP code every second PHP At The Firehose Scale
  36. @ > 1Firehose Amount of outgoing data delivered by our

    PHP code every second PHP At The Firehose Scale
  37. @ > 1Firehose Amount of outgoing data delivered by our

    PHP code every second PHP At The Firehose Scale
  38. @ > 2 Firehoses Peak amount of outgoing data delivered

    by our PHP code PHP At The Firehose Scale
  39. @ Flickr URL PHP At The Firehose Scale q HDFS

    Ultrahose Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper
  40. @ Flickr URL PHP At The Firehose Scale q HDFS

    Ultrahose Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Augmentation Pipeline 100% Worker Snapshotter Buffered Streams Redis Worker Worker PUSH Scheduler job queue Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL MongoDB CouchDB @datasift Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Google BigQuery PULL HDFS
  41. @ Flickr URL PHP At The Firehose Scale q HDFS

    Ultrahose Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper HttpStreaming, PuSH, Search Monitoring Aggregator Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline Monitoring Kafka Queue Events Storage tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Worker Snapshotter Buffered Streams Redis Worker Worker HTTP Request GET batch PUSH Scheduler job queue Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL MongoDB CouchDB PickleDB . DB @datasift Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery PULL HDFS
  42. @ Flickr URL PHP At The Firehose Scale q HDFS

    Ultrahose Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper
  43. @ Flickr URL Data Ingestion, Assembly, Augmentation q HDFS Ultrahose

    Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper
  44. @ Flickr URL Real-Time Product q HDFS Ultrahose Archiver push

    Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper
  45. @ Flickr URL Historics Product q HDFS Ultrahose Archiver push

    Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper
  46. @ • DataSift grew out of TweetMeme • Primary or

    secondary language of all staff • We already knew how to optimise and scale it PHP At The Firehose Scale
  47. @ • DataSift grew out of TweetMeme • Primary or

    secondary language of all staff • We already knew how to optimise and scale it PHP At The Firehose Scale
  48. @ • DataSift grew out of TweetMeme • Primary or

    secondary language of all staff • We already knew how to optimise and scale it PHP At The Firehose Scale
  49. @ • Doesn’t crash, doesn’t segfault, doesn’t leak memory, doesn’t

    have unpredictable GC • Doesn’t wake our Ops team up in the middle of the night • NodeJS only other engine that comes close in our experience so far PHP At The Firehose Scale
  50. @ • Doesn’t crash, doesn’t segfault, doesn’t leak memory, doesn’t

    have unpredictable GC • Doesn’t wake our Ops team up in the middle of the night • NodeJS only other engine that comes close in our experience so far PHP At The Firehose Scale
  51. @ • Doesn’t crash, doesn’t segfault, doesn’t leak memory, doesn’t

    have unpredictable GC • Doesn’t wake our Ops team up in the middle of the night • NodeJS only other engine that comes close in our experience so far PHP At The Firehose Scale
  52. @ Unstructured Data Plain old PHP objects are superb for

    handling arbitrary data PHP At The Firehose Scale
  53. @ • json_decode() creates an object tree without requiring any

    sort of input schema • We don’t need typed objects for #bigdata processing • Key for scaling our data sources, and ultimately scaling the business PHP At The Firehose Scale
  54. @ • json_decode() creates an object tree without requiring any

    sort of input schema • We don’t need typed objects for #bigdata processing • Key for scaling our data sources, and ultimately scaling the business PHP At The Firehose Scale
  55. @ • json_decode() creates an object tree without requiring any

    sort of input schema • We don’t need typed objects for #bigdata processing • Key for scaling our data sources, and ultimately scaling the business PHP At The Firehose Scale
  56. @ String Handling Strings are binary data if you’re just

    copying them around PHP At The Firehose Scale
  57. @ • Incoming data is UTF-8 encoded • We don’t

    use PHP to process the data • C++ / JVM used when we need to process UTF-8 data PHP At The Firehose Scale
  58. @ • Incoming data is UTF-8 encoded • We don’t

    use PHP to process the data • C++ / JVM used when we need to process UTF-8 data PHP At The Firehose Scale
  59. @ • Incoming data is UTF-8 encoded • We don’t

    use PHP to process the data • C++ / JVM used when we need to process UTF-8 data PHP At The Firehose Scale
  60. @ Connectivity PHP ships with robust, well-maintained built-in support for

    talking to most things PHP At The Firehose Scale
  61. @ • What’s bundled works really well • Greatly contributes

    to reliability in production • We end up patching all non-bundled extensions PHP At The Firehose Scale
  62. @ • What’s bundled works really well • Greatly contributes

    to reliability in production • We end up patching all non-bundled extensions PHP At The Firehose Scale
  63. @ • What’s bundled works really well • Greatly contributes

    to reliability in production • We end up patching all non-bundled extensions PHP At The Firehose Scale
  64. @ Firehose (noun): a live stream of events that you

    consume PHP At The Firehose Scale
  65. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  66. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  67. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  68. @ Flickr URL PHP At The Firehose Scale q HDFS

    Ultrahose Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Augmentation Pipeline 100% Worker Snapshotter Buffered Streams Redis Worker Worker PUSH Scheduler job queue Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL MongoDB CouchDB @datasift Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Google BigQuery PULL HDFS
  69. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  70. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  71. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  72. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  73. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  74. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale