Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SMACK my data up!

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

SMACK my data up!

Présentation "SMACK my data up!" donnée au SnowCamp.io 2018 par Logan Hauspie (@lhauspie) et Manuel Verriez (@mverriez).

Avatar for Manuel Verriez

Manuel Verriez

January 26, 2018
Tweet

More Decks by Manuel Verriez

Other Decks in Programming

Transcript

  1. “ Je veux capter et stocker tous les tweets qui

    m’intéressent ! ...mais je ne sais pas encore vraiment ce que je vais en faire...
  2. #blockchain #javascript #java #scala #deeplearning #keras #kubernetes #openshift #data #prometheus

    #grafana #oauth #iot #flink #apache #dataviz #continuousdelivery #elixir #angular #microservices #azure #aws #gcp #paxos #devtools #openfaas #zenibar #datascience #EE4J #virtualization #fnproject #geodata #bigdata #beanvalidation #golang #spark #mesos #akka #cassandra #kafka #chatbot #AI #laravel #vuejs #containers #nodejs #jenkins #elasticsearch #symfony #http2 #kotlin #dataprivacy #degooglisation #graphql #restful #snowcamp #grenoble
  3. “Social_Network_Store” Keyspace “Tweets” Table ... “tweet_1” Row “message” Col “content”

    Value “hashtags” Col “#1, #2” Value “location” Col “45.18, 5.68” Value “tweet_2” Row “message” Col “coucou” Value “hashtags” Col “#X, #Y” Value “likes” Col 98 Value
  4. "tweet":{ "id":951092652749021184, "createdAt":1515593133000, "lang":"en", "user":{...}, "text":"[...] #digital #trends", "hashtagEntities": [

    {"start":14, "end":22, "text":"digital"}, {"start":23, "end":30, "text":"trends"} ] } CREATE TABLE tweets.tweets ( statusId bigint, hashtags set<text>, createdAt timestamp, message text, author text, picture text, links set<text>, PRIMARY KEY (statusId) );
  5. Data Center 55 0 5 10 50 45 40 35

    30 25 20 15 “tweet_5”, “#Tag”, 2017-01-25 “tweet_12”, “#TT”, 2017-01-24 “tweet_15”, “#HT”, 2017-01-27 “tweet_24”, “#Tag”, 2017-01-26
  6. 10 15 20 “tweet_5”, “#Tag”, 2017-01-25 “tweet_5”, “#Tag”, 2017-01-25 “tweet_12”,

    “#TT”, 2017-01-24 “tweet_12”, “#TT”, 2017-01-24 “tweet_12”, “#TT”, 2017-01-24 “tweet_15”, “#HT”, 2017-01-27 “tweet_15”, “#HT”, 2017-01-27 “tweet_24”, “#Tag”, 2017-01-26
  7. CREATE TABLE tweets.tweets_by_hashtag ( hashtag text, statusId bigint, hashtags set<text>,

    createdAt timestamp, message text, author text, picture text, links set<text>, PRIMARY KEY (hashtag, createdAt, statusId) ) WITH CLUSTERING ORDER BY (createdAt DESC); Clé de partition Clé de clustering
  8. Application (Scala) API Twitter (statuses/filter) Storage (Cassandra) Abonnement à l’API

    (Push) Ecriture dans Cassandra Akka Actor : Tweet Reader Akka Actor(s) : Tweet Writer Message (Tweet)
  9. Akka Java Shared state Threads Lockers Global/Static variables Runnable synchronized,

    wait, notify.. Actor Actor Actor Message Message Message
  10. val tweetsWriter = system .actorOf(Props(new TweetToCassandraWriterActor(cluster, config.keyspace)) .withRouter(new RoundRobinPool(5)), "router")

    val tweetReceiver = system.actorOf(Props(new TweetsReceiverActor(tweetsWriter))) tweetReceiver ! “snowcamp”
  11. override def receive:Receive = { case query:Seq[String] => twitterStream.filter(query:_*) case

    StopScan => { println("Actor is going to stop itself...") tweetWriter ! StopScan context.stop(self) } }
  12. “ Je souhaiterais ne plus subir le “push” et avoir

    un système de “back-pressure” Gérer les producteurs rapides et consommateurs lents..
  13. “ Je souhaiterais persister les messages “bruts” entrants Pour pouvoir

    relire les messages en cas de nouveaux besoins ou en cas de correction d’un bug existant ;)
  14. Producer 1 GROUP1 GROUP2 Kafka Cluster Zookeeper Producer 2 Consumer

    G1_C1 Consumer G1_C2 Consumer G2_C1 Broker 2 Broker 1 test_topic test_topic test_topic
  15. 0 1 2 3 4 5 6 7 8 9

    10 11 12 13 14 15 Group 1 Offset = 7 Consumer 2 Offset = 11 Consumer 3 Offset = 13 Producer 1 Producer 2
  16. “ Je souhaiterais avoir des recommandations sur les tags à

    suivre … en fonction de ceux qui m’intéressent déjà
  17. Driver Program Spark Context Source de données Partition #1 Worker

    Node Executor Worker Node Executor Worker Node Executor Partition #2 Partition #3 Partition #n
  18. Batch : “Big Data” i.e. : Relations entre les hashtags

    avec Spark ML Streaming : “Fast Data” i.e. : Calcul des tendances avec Spark Streaming
  19. val fpg = new FPGrowth() .setMinSupport(0.02) .setNumPartitions(10) val model =

    fpg.run(tweetsDS) val minConfidence = 0.8 val associationRules = model.generateAssociationRules(minConfidence)
  20. [ethereum] => [bitcoin], 0.8405797101449275 [ethereum] => [blockchain], 0.9130434782608695 [ethereum,blockchain] =>

    [bitcoin], 0.8888888888888888 [ethereum,cryptocurrency] => [bitcoin], 1.0 [ethereum,cryptocurrency] => [blockchain], 0.9787234042553191 [fintech] => [ai], 0.8170731707317073 [datascientist] => [datascience], 0.9921875 [datascientist,bigdata,datascience] => [iot], 0.825 [bigdata,datascience] => [datascientist], 0.8602150537634409 [bitcoin,cryptocurrency] => [blockchain], 0.8333333333333334
  21. stream .map(record => JsonMethods.parse(record.value()).extract[Tweet]) ... .map((_, 1)) .reduceByKeyAndWindow(_ + _,

    Duration(30 000)) .foreachRDD(rdd => { ... rdd.sortBy(_._2, false) .map(rec => "{\"hashtag\":\"" + rec._1 + "\", \"count\" : " + rec._2 + "}") .take(10).mkString(",") val producerRecord = new ProducerRecord[Integer, String]("tweets-trends", "[" + toSend + "]") kafkaProducer.send(producerRecord) })
  22. “ OK, donc 3 serveurs Cassandra + 3 Kafka +

    3 Zookeeper + quelques noeuds Akka et combien pour Spark... ? Comment vais-je gérer tout cela ?