Processing events at scale

Processing events at scale

Processing (almost) real-time data streams usually turns out to be an extremely difficult task. Events are comming fast, in hundreds, thousands or tens of thousands per second. The logic behind each event can be extremely complex or/and time-consuming, so executing it in HTTP request-response flow sometimes does not seem to be the best possible way. Fortunately, there are at least several methods of supporting event processing in our applications.

During this talk, I would like to introduce you to some basic concepts that are behind event processing distribution on server clusters. I am going to briefly cover the example of queue systems based on RabbitMQ queue where one can store and route messages between producers and consumers, or distributed real-time computation system like Apache Storm where you can build complex topologies and process even million tuples per second per each node. Technology is important but what seems to be even more important is moving the center of gravity of event processing from http request-response flow to some separated layer that could be scaled to the limits.

Additionally, we will also talk about data stream events storage. Sharded SQL databases or base Hadoop-powered tools are good but there are dedicated tools on the market, like Druid, where we can store and aggregate billions of events without any problem.

34be88398f623c109b61d23e8215bd23?s=128

Mariusz Gil

January 30, 2015
Tweet

Transcript

  1. SCALE processing events at Mariusz Gil

  2. WROCŁAW, POLAND

  3. None
  4. None
  5. SCALE processing events at Who is working on application which

    is runnig on more than X servers?
  6. SCALE SCALE SCALE SCALE at

  7. None
  8. user browser &

  9. user browser & data processing, rendering request, routing

  10. faster is better slower is unusable

  11. requests sometimes are heavy… …too heavy

  12. requests sometimes are heavy… …too heavy

  13. event processing logic should be moved out from request-response loop

  14. None
  15. RabbitMQ is a platform to send and receive messages

  16. producer consumer Firstly

  17. producer consumer1 consumer2 Basic QoS settings

  18. producer consumer1 consumer2

  19. producer consumer1 consumer2

  20. None
  21. <?php namespace Acme\DemoBundle\Controller; use Symfony\Bundle\FrameworkBundle\Controller\Controller; class TweetController extends Controller {

    public function newTweetAction() { // ... // EXAMPLE AND VERY NAIVE IMPLEMENTATION $form->handleRequest($request); if ($form->isValid()) { $this->get('tweet_feed_producer')->publish(array( 'user' => $user, 'tweet' => 'Lorem ipsum dolor sit amet...' )); } // ... } }
  22. <?php namespace Acme\DemoBundle\Consumer; use OldSound\RabbitMqBundle\RabbitMq\ConsumerInterface; use PhpAmqpLib\Message\AMQPMessage; class TweetFeedsConsumer implements

    ConsumerInterface { public function execute(AMQPMessage $msg) { // ... // EXAMPLE AND VERY NAIVE IMPLEMENTATION $friends = $user->getFriends(); foreach ($friends as $friend) { $friend->getFeed()->push($tweet); } return true; } }
  23. How to know if our consumers layer is efficient or

    not?
  24. producer consumer producer consumer producer consumer Directed Acyclic Graphs

  25. None
  26. Apache Storm is a distributed realtime computation system

  27. doing for realtime processing what Hadoop did for batch processing

  28. written in Clojure, but language agnostic

  29. use cases realtime analytics online machine learning continous computations distributed

    RPC Storm's small set of primitives satisfy a stunning number of use cases.
  30. unbouded sequence of tuples stream

  31. spout source of streams Joke about TCP and UDP connections

  32. bolt process input stream and produce new one

  33. bolt process input stream and produce new one

  34. Topology is a network of spouts and bolts Directed Acyclic

    Multi-Graphs
  35. Infrastructure level nimbus, supervisors, workers Apache Zookeepers

  36. public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random _rand;

    @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); } @Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } @Override public void ack(Object id) { } @Override public void fail(Object id) { } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } }
  37. public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts

    = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
  38. public class WordCountTopology { public static void main(String[] args) throws

    Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }
  39. high-level abstraction Trident for realtime processing

  40. FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 3, new Values("the cow

    jumped over the moon"), new Values("the man went to the store and bought some candy"), new Values("four score and seven years ago"), new Values("how many apples can you eat")); spout.setCycle(true); TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( new MemoryMapState.Factory(), new Count(), new Fields("count") ).parallelismHint(6);
  41. events after all ? Where are my

  42. NOWHERE unfortunately…

  43. None
  44. +

  45. +

  46. None
  47. DO o

  48. DO verengineering on’t Redis pub/sub

  49. @mariuszgil

  50. THANKS

  51. ( it depends )