PHP At The Firehose Scale

PHP At The Firehose Scale

Here's one to give the PHP bashers a well-deserved black eye! Twitter is one of the world's best know social media sites, handling over 500 million public tweets a day (that's around 6,000 tweets a second). Together, they're delivered to select partners as a 'firehose' of data, who in turn deliver it on to their customers. DataSift is one of Twitter's firehose partners, and when someone presses 'Send' in their Twitter client, we aim to get that tweet into the hands of our customers in about 1 second. And PHP plays several key roles in making that possible. Come along and hear Stuart explain just how.

Presented at @phpukconference in February 2014.

2c1dc90ff7bf69097a151677624777d2?s=128

Stuart Herbert

February 21, 2014
Tweet

Transcript

  1. @ At The Firehose Scale

  2. @ Introductions PHP At The Firehose Scale

  3. @ using PHP at #webscale is a well-understood problem PHP

    At The Firehose Scale
  4. @ firehose scale #bigdata means handling data at PHP At

    The Firehose Scale
  5. @ business of society #bigdata is the PHP At The

    Firehose Scale
  6. @ #bigdata handling is an old problem rediscovered PHP At

    The Firehose Scale
  7. @ PHP has two key roles in processing #bigdata PHP

    At The Firehose Scale
  8. @ Contents PHP At The Firehose Scale What Is A

    Firehose? 1 2 3 4 5 Receiving The Firehose Delivering The Filtered Hose PHP In Our Overall Architecture Summary: Why PHP
  9. @ Contents PHP At The Firehose Scale What Is A

    Firehose? 1 2 3 4 5 Receiving The Firehose Summary: Why PHP Delivering The Filtered Hose PHP In Our Overall Architecture
  10. @ Contents PHP At The Firehose Scale What Is A

    Firehose? 1 2 3 4 5 Receiving The Firehose Summary: Why PHP Delivering The Filtered Hose PHP In Our Overall Architecture
  11. @ Contents PHP At The Firehose Scale What Is A

    Firehose? 1 2 3 4 5 Receiving The Firehose Summary: Why PHP Delivering The Filtered Hose PHP In Our Overall Architecture
  12. @ Contents PHP At The Firehose Scale What Is A

    Firehose? 1 2 3 4 5 Receiving The Firehose Summary: Why PHP Delivering The Filtered Hose PHP In Our Overall Architecture
  13. @ is a firehose? What PHP At The Firehose Scale

  14. @ Who Has Scraped Content? http://flic.kr/p/6FZzoF PHP At The Firehose

    Scale
  15. @ Scraping (verb): your robot downloads a webpage or website,

    and tries to make sense of the marked-up content PHP At The Firehose Scale
  16. @ Scraping (verb): your robot downloads a webpage or website,

    and tries to make sense of the marked-up content PHP At The Firehose Scale
  17. @ Scraping (verb): your robot downloads a webpage or website,

    and tries to make sense of the marked-up content PHP At The Firehose Scale
  18. @ Most Famous Website Scraper PHP At The Firehose Scale

  19. @ That’s Really Hard Work http://flic.kr/p/7xa3pR PHP At The Firehose

    Scale
  20. @ Firehose (noun): a live stream of events that you

    consume PHP At The Firehose Scale
  21. @ Firehose (noun): a live stream of events that you

    consume PHP At The Firehose Scale
  22. @ Firehose (noun): a live stream of events that you

    consume PHP At The Firehose Scale
  23. @ Firehose (noun): a live stream of events that you

    consume PHP At The Firehose Scale
  24. @ Firehoses Have Sharp Peaks PHP At The Firehose Scale

    “New Tweets per second (TPS) record: 143,199 TPS. August 3rd, 2013. Source: Twitter Engineering Blog
  25. @ Firehoses Are Relentless PHP At The Firehose Scale Typical

    day: more than 500 million Tweets sent; average 5,700 TPS.” August 3rd, 2013. Source: Twitter Engineering Blog
  26. @ What A Firehose Feels Like http://flic.kr/p/6DqetD PHP At The

    Firehose Scale
  27. @ 100% Amount of incoming data that goes through our

    PHP code PHP At The Firehose Scale
  28. @ Firehose Receiving The PHP At The Firehose Scale

  29. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  30. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  31. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  32. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  33. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  34. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  35. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  36. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  37. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  38. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  39. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  40. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  41. @ Filtered Hose Delivering The PHP At The Firehose Scale

  42. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  43. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  44. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  45. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  46. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  47. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  48. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  49. @ > 1Firehose Amount of outgoing data delivered by our

    PHP code every second PHP At The Firehose Scale
  50. @ > 1Firehose Amount of outgoing data delivered by our

    PHP code every second PHP At The Firehose Scale
  51. @ > 1Firehose Amount of outgoing data delivered by our

    PHP code every second PHP At The Firehose Scale
  52. @ > 2 Firehoses Peak amount of outgoing data delivered

    by our PHP code PHP At The Firehose Scale
  53. @ PHP At The Firehose Scale

  54. @ PHP At The Firehose Scale

  55. @ PHP in our overall architecture PHP At The Firehose

    Scale
  56. @ Flickr URL PHP At The Firehose Scale q HDFS

    Ultrahose Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper
  57. @ Flickr URL PHP At The Firehose Scale q HDFS

    Ultrahose Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Augmentation Pipeline 100% Worker Snapshotter Buffered Streams Redis Worker Worker PUSH Scheduler job queue Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL MongoDB CouchDB @datasift Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Google BigQuery PULL HDFS
  58. @ Flickr URL PHP At The Firehose Scale q HDFS

    Ultrahose Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper HttpStreaming, PuSH, Search Monitoring Aggregator Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline Monitoring Kafka Queue Events Storage tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Worker Snapshotter Buffered Streams Redis Worker Worker HTTP Request GET batch PUSH Scheduler job queue Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL MongoDB CouchDB PickleDB . DB @datasift Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery PULL HDFS
  59. @ Flickr URL PHP At The Firehose Scale q HDFS

    Ultrahose Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper
  60. @ Flickr URL Data Ingestion, Assembly, Augmentation q HDFS Ultrahose

    Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper
  61. @ Flickr URL Real-Time Product q HDFS Ultrahose Archiver push

    Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper
  62. @ Flickr URL Historics Product q HDFS Ultrahose Archiver push

    Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper
  63. @ PHP? Why do we use PHP At The Firehose

    Scale
  64. @ Our History We’d used PHP before PHP At The

    Firehose Scale
  65. @ • DataSift grew out of TweetMeme • Primary or

    secondary language of all staff • We already knew how to optimise and scale it PHP At The Firehose Scale
  66. @ • DataSift grew out of TweetMeme • Primary or

    secondary language of all staff • We already knew how to optimise and scale it PHP At The Firehose Scale
  67. @ • DataSift grew out of TweetMeme • Primary or

    secondary language of all staff • We already knew how to optimise and scale it PHP At The Firehose Scale
  68. @ It Works PHP Engine Is Incredibly Reliable PHP At

    The Firehose Scale
  69. @ • Doesn’t crash, doesn’t segfault, doesn’t leak memory, doesn’t

    have unpredictable GC • Doesn’t wake our Ops team up in the middle of the night • NodeJS only other engine that comes close in our experience so far PHP At The Firehose Scale
  70. @ • Doesn’t crash, doesn’t segfault, doesn’t leak memory, doesn’t

    have unpredictable GC • Doesn’t wake our Ops team up in the middle of the night • NodeJS only other engine that comes close in our experience so far PHP At The Firehose Scale
  71. @ • Doesn’t crash, doesn’t segfault, doesn’t leak memory, doesn’t

    have unpredictable GC • Doesn’t wake our Ops team up in the middle of the night • NodeJS only other engine that comes close in our experience so far PHP At The Firehose Scale
  72. @ Unstructured Data Plain old PHP objects are superb for

    handling arbitrary data PHP At The Firehose Scale
  73. @ • json_decode() creates an object tree without requiring any

    sort of input schema • We don’t need typed objects for #bigdata processing • Key for scaling our data sources, and ultimately scaling the business PHP At The Firehose Scale
  74. @ • json_decode() creates an object tree without requiring any

    sort of input schema • We don’t need typed objects for #bigdata processing • Key for scaling our data sources, and ultimately scaling the business PHP At The Firehose Scale
  75. @ • json_decode() creates an object tree without requiring any

    sort of input schema • We don’t need typed objects for #bigdata processing • Key for scaling our data sources, and ultimately scaling the business PHP At The Firehose Scale
  76. @ String Handling Strings are binary data if you’re just

    copying them around PHP At The Firehose Scale
  77. @ • Incoming data is UTF-8 encoded • We don’t

    use PHP to process the data • C++ / JVM used when we need to process UTF-8 data PHP At The Firehose Scale
  78. @ • Incoming data is UTF-8 encoded • We don’t

    use PHP to process the data • C++ / JVM used when we need to process UTF-8 data PHP At The Firehose Scale
  79. @ • Incoming data is UTF-8 encoded • We don’t

    use PHP to process the data • C++ / JVM used when we need to process UTF-8 data PHP At The Firehose Scale
  80. @ Connectivity PHP ships with robust, well-maintained built-in support for

    talking to most things PHP At The Firehose Scale
  81. @ • What’s bundled works really well • Greatly contributes

    to reliability in production • We end up patching all non-bundled extensions PHP At The Firehose Scale
  82. @ • What’s bundled works really well • Greatly contributes

    to reliability in production • We end up patching all non-bundled extensions PHP At The Firehose Scale
  83. @ • What’s bundled works really well • Greatly contributes

    to reliability in production • We end up patching all non-bundled extensions PHP At The Firehose Scale
  84. @ Philosophy The “share nothing” architecture is uncommon in other

    dev communities PHP At The Firehose Scale
  85. @ Summary in PHP At The Firehose Scale

  86. @ Firehose (noun): a live stream of events that you

    consume PHP At The Firehose Scale
  87. @ Receiving Firehoses PHP At The Firehose Scale 1 Interaction

    Assembly Links Augmentation Other Augmentation Dispatch 2 3 Ogre Ogre 4 N … Goblin InputTasks 5 6 7
  88. @ Delivering The Filtered Hose PHP At The Firehose Scale

    1 2 3 4 N … Access Control + Producer Push Scheduler 5 6 7 Kafka Push Delivery Push Delivery Push Delivery Writes To Reads From Starts Discovers Writes To
  89. @ Workers PHP JobQueue PHP At The Firehose Scale Manager

    Workers Workers Ogre Supervisord Starts Starts Monitors Read From Ogre Write To Config Loads
  90. @ Flickr URL PHP At The Firehose Scale q HDFS

    Ultrahose Archiver push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard ACL (with interaction counter) HttpStreaming, PuSH, Search Monitoring Aggregator EDRs (licensed content metrics) Control Channels (D5) Hardware Load Balancer Archiver 100% Prism 100% Pickle Filtering Engine Twitter Facebook Wikipedia Reddit LexisNexis WordPress IntenseDebate @lorenzoalberton DataSift Architecture 2.4 Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Input Streams NewsCred BoardReader SinaWeibo TencentWeibo RenRen Augmentation Pipeline push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard push Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Pickle Node Node Shard Monitoring Kafka Queue Events Storage ACL (with interaction counter) tracker Limit Manager Auth Manager Notification Service WEB + API Stream . Manager . DB Definition . Manager . DB CSDL Compiler, Validator, Normaliser Historics Scheduler Recording Scheduler Push Scheduler Interaction Targets Mapping Filtering Tardis Pickle Interaction Targets Mapping Filtering Tardis Pickle ... ... Titan Historics ... Data Node Data Node Data Node Data Node 100% 100% Stop PUB License Manager DB Billing Pipeline DB DB DB Mask Manager DB Connection Manager Time Machine + Insights Post-Processing, Stream Analytics jobs DB chunks DB chunk selector job tracker Worker Snapshotter Buffered Streams Redis Worker Worker Node Meteor Real-time Streams Node Node HTTP Request GET batch PUSH Scheduler subscription X subscription Y job queue PUSH Producer Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL Exports and Analytics WebSockets HTTPStreaming Connections Storage kafka-HTTP bridge MongoDB Stream results CouchDB PickleDB . DB Audit Kafka Historical Queries @datasift Goblin Head Goblin Head Goblin Head Goblin Tail Goblin Tail Goblin Tail Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Bit.ly Tumblr Stream Splitter/Joiner Deduper Msg splitter Google BigQuery Stream results Cloud Storage DBs BI tools Anti-DoS Buffering Rate Limiter (MB/s) Rate Limiter (msg/s) Managed Sources (Push/Pull) Feed Splitter + Interaction Generator Managed Sources Service (API + Crawler) Instagram Connector Facebook Connector Google+ Connector ... Managed Sources Playback Ogre ACL Troll PULL Kafka Archive Metadata MCP GC Archive - Historics HDFS Private Recordings VEDO Zookeeper Links Resolution + OpenGraph + Twitter Cards + Metadata Deletes Processor Redis Augmentation Pipeline 100% Worker Snapshotter Buffered Streams Redis Worker Worker PUSH Scheduler job queue Subscriptions DB PUSH Delivery HTTP(S) POST (S)FTP Amazon S3 DynamoDB PostgreSQL MySQL MongoDB CouchDB @datasift Interaction Generation Interaction Generation 3rd party APIs Demographics Trends Analysis Sentiment Analysis Named Entities Topics Analysis Language Detection Klout Score + Profile Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre Ogre IBM Cognos Data ingestion + Augmentation Google BigQuery PULL HDFS
  91. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  92. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  93. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  94. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  95. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  96. @ • our history • reliability • handling of unstructured

    data • binary strings by default • talks to everything • pragmatic design philosophy PHP At The Firehose Scale
  97. @ Thank You Any Questions? PHP At The Firehose Scale