Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SMACK my data up!

SMACK my data up!

Présentation "SMACK my data up!" donnée au SnowCamp.io 2018 par Logan Hauspie (@lhauspie) et Manuel Verriez (@mverriez).

Manuel Verriez

January 26, 2018
Tweet

More Decks by Manuel Verriez

Other Decks in Programming

Transcript

  1. “ Je veux capter et stocker tous les tweets qui

    m’intéressent ! ...mais je ne sais pas encore vraiment ce que je vais en faire...
  2. #blockchain #javascript #java #scala #deeplearning #keras #kubernetes #openshift #data #prometheus

    #grafana #oauth #iot #flink #apache #dataviz #continuousdelivery #elixir #angular #microservices #azure #aws #gcp #paxos #devtools #openfaas #zenibar #datascience #EE4J #virtualization #fnproject #geodata #bigdata #beanvalidation #golang #spark #mesos #akka #cassandra #kafka #chatbot #AI #laravel #vuejs #containers #nodejs #jenkins #elasticsearch #symfony #http2 #kotlin #dataprivacy #degooglisation #graphql #restful #snowcamp #grenoble
  3. “Social_Network_Store” Keyspace “Tweets” Table ... “tweet_1” Row “message” Col “content”

    Value “hashtags” Col “#1, #2” Value “location” Col “45.18, 5.68” Value “tweet_2” Row “message” Col “coucou” Value “hashtags” Col “#X, #Y” Value “likes” Col 98 Value
  4. "tweet":{ "id":951092652749021184, "createdAt":1515593133000, "lang":"en", "user":{...}, "text":"[...] #digital #trends", "hashtagEntities": [

    {"start":14, "end":22, "text":"digital"}, {"start":23, "end":30, "text":"trends"} ] } CREATE TABLE tweets.tweets ( statusId bigint, hashtags set<text>, createdAt timestamp, message text, author text, picture text, links set<text>, PRIMARY KEY (statusId) );
  5. Data Center 55 0 5 10 50 45 40 35

    30 25 20 15 “tweet_5”, “#Tag”, 2017-01-25 “tweet_12”, “#TT”, 2017-01-24 “tweet_15”, “#HT”, 2017-01-27 “tweet_24”, “#Tag”, 2017-01-26
  6. 10 15 20 “tweet_5”, “#Tag”, 2017-01-25 “tweet_5”, “#Tag”, 2017-01-25 “tweet_12”,

    “#TT”, 2017-01-24 “tweet_12”, “#TT”, 2017-01-24 “tweet_12”, “#TT”, 2017-01-24 “tweet_15”, “#HT”, 2017-01-27 “tweet_15”, “#HT”, 2017-01-27 “tweet_24”, “#Tag”, 2017-01-26
  7. CREATE TABLE tweets.tweets_by_hashtag ( hashtag text, statusId bigint, hashtags set<text>,

    createdAt timestamp, message text, author text, picture text, links set<text>, PRIMARY KEY (hashtag, createdAt, statusId) ) WITH CLUSTERING ORDER BY (createdAt DESC); Clé de partition Clé de clustering
  8. Application (Scala) API Twitter (statuses/filter) Storage (Cassandra) Abonnement à l’API

    (Push) Ecriture dans Cassandra Akka Actor : Tweet Reader Akka Actor(s) : Tweet Writer Message (Tweet)
  9. Akka Java Shared state Threads Lockers Global/Static variables Runnable synchronized,

    wait, notify.. Actor Actor Actor Message Message Message
  10. val tweetsWriter = system .actorOf(Props(new TweetToCassandraWriterActor(cluster, config.keyspace)) .withRouter(new RoundRobinPool(5)), "router")

    val tweetReceiver = system.actorOf(Props(new TweetsReceiverActor(tweetsWriter))) tweetReceiver ! “snowcamp”
  11. override def receive:Receive = { case query:Seq[String] => twitterStream.filter(query:_*) case

    StopScan => { println("Actor is going to stop itself...") tweetWriter ! StopScan context.stop(self) } }
  12. “ Je souhaiterais ne plus subir le “push” et avoir

    un système de “back-pressure” Gérer les producteurs rapides et consommateurs lents..
  13. “ Je souhaiterais persister les messages “bruts” entrants Pour pouvoir

    relire les messages en cas de nouveaux besoins ou en cas de correction d’un bug existant ;)
  14. Producer 1 GROUP1 GROUP2 Kafka Cluster Zookeeper Producer 2 Consumer

    G1_C1 Consumer G1_C2 Consumer G2_C1 Broker 2 Broker 1 test_topic test_topic test_topic
  15. 0 1 2 3 4 5 6 7 8 9

    10 11 12 13 14 15 Group 1 Offset = 7 Consumer 2 Offset = 11 Consumer 3 Offset = 13 Producer 1 Producer 2
  16. “ Je souhaiterais avoir des recommandations sur les tags à

    suivre … en fonction de ceux qui m’intéressent déjà
  17. Driver Program Spark Context Source de données Partition #1 Worker

    Node Executor Worker Node Executor Worker Node Executor Partition #2 Partition #3 Partition #n
  18. Batch : “Big Data” i.e. : Relations entre les hashtags

    avec Spark ML Streaming : “Fast Data” i.e. : Calcul des tendances avec Spark Streaming
  19. val fpg = new FPGrowth() .setMinSupport(0.02) .setNumPartitions(10) val model =

    fpg.run(tweetsDS) val minConfidence = 0.8 val associationRules = model.generateAssociationRules(minConfidence)
  20. [ethereum] => [bitcoin], 0.8405797101449275 [ethereum] => [blockchain], 0.9130434782608695 [ethereum,blockchain] =>

    [bitcoin], 0.8888888888888888 [ethereum,cryptocurrency] => [bitcoin], 1.0 [ethereum,cryptocurrency] => [blockchain], 0.9787234042553191 [fintech] => [ai], 0.8170731707317073 [datascientist] => [datascience], 0.9921875 [datascientist,bigdata,datascience] => [iot], 0.825 [bigdata,datascience] => [datascientist], 0.8602150537634409 [bitcoin,cryptocurrency] => [blockchain], 0.8333333333333334
  21. stream .map(record => JsonMethods.parse(record.value()).extract[Tweet]) ... .map((_, 1)) .reduceByKeyAndWindow(_ + _,

    Duration(30 000)) .foreachRDD(rdd => { ... rdd.sortBy(_._2, false) .map(rec => "{\"hashtag\":\"" + rec._1 + "\", \"count\" : " + rec._2 + "}") .take(10).mkString(",") val producerRecord = new ProducerRecord[Integer, String]("tweets-trends", "[" + toSend + "]") kafkaProducer.send(producerRecord) })
  22. “ OK, donc 3 serveurs Cassandra + 3 Kafka +

    3 Zookeeper + quelques noeuds Akka et combien pour Spark... ? Comment vais-je gérer tout cela ?