SMACK my data up!

Logan Hauspie Manuel Verriez Tout SHUSS pour Grenoble 2018 @Grenoble
⛷ [email protected]

“ Je veux capter et stocker tous les tweets qui
m’intéressent ! ...mais je ne sais pas encore vraiment ce que je vais en faire...

#blockchain #javascript #java #scala #deeplearning #keras #kubernetes #openshift #data #prometheus
#grafana #oauth #iot #flink #apache #dataviz #continuousdelivery #elixir #angular #microservices #azure #aws #gcp #paxos #devtools #openfaas #zenibar #datascience #EE4J #virtualization #fnproject #geodata #bigdata #beanvalidation #golang #spark #mesos #akka #cassandra #kafka #chatbot #AI #laravel #vuejs #containers #nodejs #jenkins #elasticsearch #symfony #http2 #kotlin #dataprivacy #degooglisation #graphql #restful #snowcamp #grenoble

Application (Scala) API Twitter (statuses/filter) Storage (Cassandra) Abonnement à l’API
(Push) Ecriture dans Cassandra

• • •

• • ◦

• • •

http://digbigdata.com/know-thy-cap-theorem-for-nosql/

“Social_Network_Store” Keyspace “Tweets” Table ... “tweet_1” Row “message” Col “content”
Value “hashtags” Col “#1, #2” Value “location” Col “45.18, 5.68” Value “tweet_2” Row “message” Col “coucou” Value “hashtags” Col “#X, #Y” Value “likes” Col 98 Value

"tweet":{ "id":951092652749021184, "createdAt":1515593133000, "lang":"en", "user":{...}, "text":"[...] #digital #trends", "hashtagEntities": [
{"start":14, "end":22, "text":"digital"}, {"start":23, "end":30, "text":"trends"} ] } CREATE TABLE tweets.tweets ( statusId bigint, hashtags set<text>, createdAt timestamp, message text, author text, picture text, links set<text>, PRIMARY KEY (statusId) );

Data Center 55 0 5 10 50 45 40 35
30 25 20 15 “tweet_5”, “#Tag”, 2017-01-25 “tweet_12”, “#TT”, 2017-01-24 “tweet_15”, “#HT”, 2017-01-27 “tweet_24”, “#Tag”, 2017-01-26

10 15 20 “tweet_5”, “#Tag”, 2017-01-25 “tweet_5”, “#Tag”, 2017-01-25 “tweet_12”,
“#TT”, 2017-01-24 “tweet_12”, “#TT”, 2017-01-24 “tweet_12”, “#TT”, 2017-01-24 “tweet_15”, “#HT”, 2017-01-27 “tweet_15”, “#HT”, 2017-01-27 “tweet_24”, “#Tag”, 2017-01-26

“ En fait, je voudrais rechercher les tweets par hashtag
:)

SELECT * FROM tweets.tweets WHERE hashtags contains 'snowcamp';

• • • •

CREATE TABLE tweets.tweets_by_hashtag ( hashtag text, statusId bigint, hashtags set<text>,
createdAt timestamp, message text, author text, picture text, links set<text>, PRIMARY KEY (hashtag, createdAt, statusId) ) WITH CLUSTERING ORDER BY (createdAt DESC); Clé de partition Clé de clustering

SELECT * FROM tweets.tweets_by_hashtag WHERE hashtag = 'snowcamp';

• • •

“ Je voudrais découper correctement mon application, paralléliser les traitements...

Application (Scala) API Twitter (statuses/filter) Storage (Cassandra) Abonnement à l’API
(Push) Ecriture dans Cassandra Akka Actor : Tweet Reader Akka Actor(s) : Tweet Writer Message (Tweet)

Object Actor

• • • Actor Mailbox Message Message Message

• • Actor Mailbox Message Message Message

Concurrence Parallélisme Thread #1 Thread #1 Thread #2 Thread #2
CPU #1 CPU #1 CPU #2 Thread #1 Thread #2

Thread 1 Thread 2 Mémoire partagée Lock... Wait... Lock... Wait
...

Akka Java Shared state Threads Lockers Global/Static variables Runnable synchronized,
wait, notify.. Actor Actor Actor Message Message Message

• • •

• • • •

https://doc.akka.io/docs/akka/current/general/supervision.html

• ◦ • ◦ • ◦

• • •

val tweetsWriter = system .actorOf(Props(new TweetToCassandraWriterActor(cluster, config.keyspace)) .withRouter(new RoundRobinPool(5)), "router")
val tweetReceiver = system.actorOf(Props(new TweetsReceiverActor(tweetsWriter))) tweetReceiver ! “snowcamp”

override def receive:Receive = { case query:Seq[String] => twitterStream.filter(query:_*) case
StopScan => { println("Actor is going to stop itself...") tweetWriter ! StopScan context.stop(self) } }

“ Je souhaiterais ne plus subir le “push” et avoir
un système de “back-pressure” Gérer les producteurs rapides et consommateurs lents..

PUSH Ecriture dans Cassandra PUSH PULL

“ Je souhaiterais persister les messages “bruts” entrants Pour pouvoir
relire les messages en cas de nouveaux besoins ou en cas de correction d’un bug existant ;)

Commit Log

“ Je souhaiterais développer de nouvelles applications consommatrices de mes
messages Un dashboard par exemple..

• • •

• • • • •

• • •

Producer 1 GROUP1 GROUP2 Kafka Cluster Zookeeper Producer 2 Consumer
G1_C1 Consumer G1_C2 Consumer G2_C1 Broker 2 Broker 1 test_topic test_topic test_topic

0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 Group 1 Offset = 7 Consumer 2 Offset = 11 Consumer 3 Offset = 13 Producer 1 Producer 2

• • •

“ Je souhaiterais calculer les tendances en temps réel

“ Je souhaiterais avoir des recommandations sur les tags à
suivre … en fonction de ceux qui m’intéressent déjà

Driver Program Spark Context Source de données Partition #1 Worker
Node Executor Worker Node Executor Worker Node Executor Partition #2 Partition #3 Partition #n

Batch : “Big Data” i.e. : Relations entre les hashtags
avec Spark ML Streaming : “Fast Data” i.e. : Calcul des tendances avec Spark Streaming

val fpg = new FPGrowth() .setMinSupport(0.02) .setNumPartitions(10) val model =
fpg.run(tweetsDS) val minConfidence = 0.8 val associationRules = model.generateAssociationRules(minConfidence)

[ethereum] => [bitcoin], 0.8405797101449275 [ethereum] => [blockchain], 0.9130434782608695 [ethereum,blockchain] =>
[bitcoin], 0.8888888888888888 [ethereum,cryptocurrency] => [bitcoin], 1.0 [ethereum,cryptocurrency] => [blockchain], 0.9787234042553191 [fintech] => [ai], 0.8170731707317073 [datascientist] => [datascience], 0.9921875 [datascientist,bigdata,datascience] => [iot], 0.825 [bigdata,datascience] => [datascientist], 0.8602150537634409 [bitcoin,cryptocurrency] => [blockchain], 0.8333333333333334

stream .map(record => JsonMethods.parse(record.value()).extract[Tweet]) ... .map((_, 1)) .reduceByKeyAndWindow(_ + _,
Duration(30 000)) .foreachRDD(rdd => { ... rdd.sortBy(_._2, false) .map(rec => "{\"hashtag\":\"" + rec._1 + "\", \"count\" : " + rec._2 + "}") .take(10).mkString(",") val producerRecord = new ProducerRecord[Integer, String]("tweets-trends", "[" + toSend + "]") kafkaProducer.send(producerRecord) })

Topics Tweets Bruts Tendances …. Relations Tendances

“ OK, donc 3 serveurs Cassandra + 3 Kafka +
3 Zookeeper + quelques noeuds Akka et combien pour Spark... ? Comment vais-je gérer tout cela ?

• • •

• •

https://github.com/killrweather/killrweather https://github.com/bythebay/pipeline

http://kafka.apache.org/quickstart https://dev.to/simplesteph/how-to-use-apache-kafka-to-transform-a-batch-pipeline-into-a-real-time-o ne-a3p https://heroku.github.io/kafka-demo https://dzone.com/articles/akka-cassandra-activator https://www.cakesolutions.net/teamblogs/2013/07/31/akka-cassandra-activator http://blog.abhinav.ca/blog/2015/02/19/scaling-with-akka-streams/ https://github.com/twissandra/twissandra https://stories.thinkingmachin.es/twitter-listener/ https://opencredo.com/data-analytics-using-cassandra-and-spark/
https://hashnode.com/post/architecture-how-would-you-go-about-building-an-activity-feed-like-faceb ook-cioe6ea7q017aru53phul68t1 https://fr.slideshare.net/jericevans/cassandra-by-example-data-modelling-with-cql3 https://www.datastax.com/dev/blog/new-in-cassandra-3-0-materialized-views

SMACK my data up!

SMACK my data up!

More Decks by Manuel Verriez

Other Decks in Programming

Featured

Transcript