PHPとApache Sparkで始めるデータ解析処理 / php-with-apache-spark

by yuuki takezawa

Slide 1

Slide 1 text

PHPͱApache SparkͰ࢝ΊΔ σʔλղੳॲཧ yuuki takezawa PHP Conference 2018

Slide 2

Slide 2 text

Proﬁle • ஛ᖒ ༗و / ytake • גࣜձࣾΞΠελΠϧ CTO • PHP, Hack, Go, Scala • Apache Hadoop, Apache Spark, Apache Kafka  • twitter https://twitter.com/ex_takezawa • facebook https://www.facebook.com/yuuki.takezawa • github https://github.com/ytake

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Agenda • Apache Spark Introduction • For Web Application • Apache Spark + Elasticsearch + MLlib

Slide 6

Slide 6 text

Laravel Spark?

Slide 7

Slide 7 text

Apache Spark?

Slide 8

Slide 8 text

Apache Spark • ༷ʑͳσʔλʹରͯ͠෼ࢄॲཧΛߦ͏  ϑϨʔϜϫʔΫ • JavaɺPythonɺScala  ͲͷݴޠΛબΜͰ΋API͸ಉ͡ • Apache SparkͷॲཧΛ  PHPͰ࣮૷͢Δ͜ͱ͸Ͱ͖·ͤΜ

Slide 9

Slide 9 text

Apache Spark • Spark SQL  ༷ʑͳσʔλΛ݁߹ͨ͠Γɺ෼ࢄॲཧͰߴ଎ • Spark MLlib  ػցֶश • Spark GraphX  άϥϑॲཧ • Spark Streaming  ετϦʔϜॲཧ / ϚΠΫϩόον

Slide 10

Slide 10 text

୲౰ྖҬྫ "Spark Streamingͷ֓ཁͱݕূγφϦΦ". Think IT.   https://thinkit.co.jp/article/9958 ࢀর

Slide 11

Slide 11 text

Apache SparkΛ࢖͏৔໘ • ѻ͏σʔλ͕େྔͰ͋Δ͜ͱ   Hadoop΍CassandraͳͲ • େྔσʔλͰϦΞϧλΠϜʹ͍ۙॲཧ͕ඞཁͳ͜ͱ  ϨίϝϯσʔγϣϯͳͲ • Hadoopʹ͍ۙ͠؀ڥΛ͢ͰʹߏஙࡁΈ(༧ఆ͍ͯ͠Δ) • ϚΠΫϩαʔϏεͳͲͰσʔλϕʔε͕෼ׂ͞Ε͍ͯΔ  ͦΕΒΛ࢖ͬͯ৽͍͠Կ͔Λ࢈Έग़͍ͨ࣌͠

Slide 12

Slide 12 text

σʔλϕʔεͱ͸ผ෺

Slide 13

Slide 13 text

Apache Spark with DataSource

Slide 14

Slide 14 text

Apache Spark with Data • Apache Sparkࣗମ͸Hadoop؀ڥ͕ͳͯ͘΋ಈ࡞Մ • ΞϓϦέʔγϣϯ͔Β௚઀໰͍߹Θͤͯऔಘ͢Δ  ΫΤϦΤϯδϯ༻్ͱ  ༷ʑͳॲཧ݁ՌΛΞϓϦέʔγϣϯʹ߹Θͤͨ  σʔλϕʔε౳ʹॻ͖ࠐΉ  όονϥΠΫͳॲཧͷ૊Έ߹Θͤ

Slide 15

Slide 15 text

Apache Spark with Data protected def sparkReadJdbcDatabase(spark: SparkSession): DataFrame = { spark.read.format("jdbc").options( Map("url" -> connect, "dbtable" -> targetTable(), "driver" -> jdbcDriver)) .load } protected def execute(spark: SparkSession, df: DataFrame): Unit = { df.count match { case 0L => spark.stop case _ => SparkInsertExecutor.saveToCassandra( df, config.cassandraKeyspace(), config.cassandraHosts(), Map( "contentTable" -> config.cassandraContentTable() ) ) } }

Slide 16

Slide 16 text

Spark Streaming

Slide 17

Slide 17 text

Spark Streaming • WebΞϓϦέʔγϣϯͰಋೖ͠΍͍͢૊Έ߹Θͤ  Spark Streaming + Kafka / Kinesis • Կ͔ͷϝοηʔδɾΠϕϯτΛτϦΨʔʹ  ෼ੳॲཧΛ࣮ߦ͠ɺσʔλϕʔεʹॻ͖ࠐΉ • ετϦʔϜରԠ

Slide 18

Slide 18 text

DStream / Batch "Spark Streaming". THIRD EYES.   https://thirdeyedata.io/spark-streaming/ ࢀর

Slide 19

Slide 19 text

Web Application Event Sourcing + CQRS

Slide 20

Slide 20 text

CQRS "A few myths about CQRS". Ouarzy's Blog.   http://www.ouarzy.com/2016/10/02/a-few-myths-about-cqrs/ ࢀর

Slide 21

Slide 21 text

Event Sourcing / Command ྫ

Slide 22

Slide 22 text

Event Sourcing / Command ྫ ΞϓϦέʔγϣϯ͔Β Πϕϯτૹ৴

Slide 23

Slide 23 text

Event Sourcing / Command ྫ ϝοηʔδΛड͚औΓɺ ஋ͷՃ޻ͳͲ

Slide 24

Slide 24 text

Event Sourcing / Command ྫ RDBMS΍ɺNoSQLͳͲ ༻్ʹ߹Θͤͯॻ͖ࠐΈ

Slide 25

Slide 25 text

Event Sourcing / Command ྫ ଞͷΞϓϦέʔγϣϯ͔Β σʔλϕʔεͷΈΛࢀর Ճ޻ʹ͍ͭͯͷ஌ࣝ͸஌Βͳ͍

Slide 26

Slide 26 text

Event Sourcing / Command ྫ

Slide 27

Slide 27 text

{ "contents": "answers", "identiﬁer": 7854395, "operator": "9999", "operatorType": "SYSTEM", "tags": [ 48353 ] }

Slide 28

Slide 28 text

val schema = StructType( StructField("contents", StringType, true) :: StructField("identiﬁer", IntegerType, true) :: StructField("operator", StringType, true) :: StructField("operatorType", StringType, true) :: StructField("tags", ArrayType(IntegerType, true), true) :: Nil)

Slide 29

Slide 29 text

ಈ࡞؀ڥʹ͍ͭͯ

Slide 30

Slide 30 text

Cluster Management • Spark Standalone • Apache Mesos • Hadoop YARN(Yet Another Resource Negotiator) • Kubernetes

Slide 31

Slide 31 text

Apache Spark ͜Μͳॲཧ͕͍ͨ͠ʂ

Slide 32

Slide 32 text

Apache Spark ॲཧʹඞཁͳϦιʔεׂΓ౰ͯ

Slide 33

Slide 33 text

Apache Spark ॲཧ࣮ߦ

Slide 34

Slide 34 text

Apache Spark with PHP • PHPͰSpark Streamingͷૢ࡞ͳͲ͸ෆՄ • Kafkaܦ༝ͳͲͰ஋Λૹ৴͠  Sparkࣗମͷॲཧ͸ଞݴޠͰ࣮૷ • Spark SQLͰ࡞੒ͨ͠ςʔϒϧʹ໰͍߹Θͤɺ • ෼ࢄΫΤϦΤϯδϯͱͯ͠ར༻͢Δ͜ͱ͸Մೳ  PrestoɺHiveͳͲͱಉ༷͡ʹ࢖͏͜ͱ΋

Slide 35

Slide 35 text

Apache Spark with PHP ෼ࢄΫΤϦΤϯδϯʹ ໰͍߹Θ͍ͤͨʂ "QBDIF5ISJGU DPNQPTFSSFRVJSFBQBDIFUISJGU TUBSUUISJGUTFSWFS CFFMJOF

Slide 36

Slide 36 text

ΞϓϦέʔγϣϯʹಋೖ͢Δ ۩ମྫ

Slide 37

Slide 37 text

Creating a Scalable Recommender with Apache Spark & Elasticsearch

Slide 38

Slide 38 text

Flow • CSV΍σʔλϕʔε͔ΒσʔλΛऔಘ  ϨϏϡʔͷධՁ఺ɺಈըσʔλͳͲ • ͦΕͧΕͷσʔλΛ૊Έ߹Θͤͯॲཧ͢ΔͨΊɺ  લॲཧ(Apache Spark) • લॲཧޙɺElasticsearchʹॻ͖ࠐΈֶश • ֶशޙͷϞσϧΛElasticsearch΁

Slide 39

Slide 39 text

Load Data PATH_TO_DATA = "../data/ml-latest-small" ratings = spark.read.csv(PATH_TO_DATA + "/ratings.csv", header=True, inferSchema=True) ratings.cache() print("Number of ratings: %i" % ratings.count()) print("Sample of ratings:") ratings.show(5)

Slide 40

Slide 40 text

Load Data Number of ratings: 100836 Sample of ratings: +------+-------+------+---------+ |userId|movieId|rating|timestamp| +------+-------+------+---------+ | 1| 1| 4.0|964982703| | 1| 3| 4.0|964981247| | 1| 6| 4.0|964982224| | 1| 47| 5.0|964983815| | 1| 50| 5.0|964982931| +------+-------+------+---------+ only showing top 5 rows

Slide 41

Slide 41 text

Load Data • ෼ੳ༻్޲͚ͳͲʹ࡞ΒΕ͍ͯͳ͍σʔλΛ੒ܗ͢Δ • ෼ੳ༻ͷΧϥϜΛσʔλϕʔεͦͷ΋ͷʹ௥Ճ͸͠ͳ͍ • ΞΫηεϩά͚ͩɺͳͲͰ͸ෆे෼ • େྔσʔλͷ৔߹ɺ  RDBMSʹର࣮ͯ͠ߦ͍ͨ͠৔߹͸  ඞͣઐ༻ͷσʔλϕʔεαʔόͳͲʹ෼཭͢Δ͜ͱ  *ΞϓϦέʔγϣϯͰར༻͞Ε͍ͯΔͱࢥΘ͵ࣄނʹ

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Load Data import re def extract_year_fn(title): result = re.search("\(\d{4}\)", title) try: if result: group = result.group() year = group[1:-1] start_pos = result.start() title = title[:start_pos-1] return (title, year) else: return (title, 1970) except: print(title) extract_year = udf(extract_year_fn,\ StructType([StructField("title", StringType(), True),\ StructField("release_date", StringType(), True)]))

Slide 44

Slide 44 text

Load Data link_data = spark.read.csv(PATH_TO_DATA + "/links.csv", header=True, inferSchema=True) # join movies with links to get TMDB id movie_data = movies.join(link_data, movies.movieId == link_data.movieId)\ .select(movies.movieId, movies.title, movies.release_date, movies.genres, link_data.tmdbId) num_movies = movie_data.count() print("Cleaned movie data with tmdbId links:") movie_data.show(5, truncate=False)

Slide 45

Slide 45 text

Load Data Cleaned movie data with tmdbId links: +-------+---------------------------+------------+-------------------------------------------------+------+ |movieId|title |release_date|genres |tmdbId| +-------+---------------------------+------------+-------------------------------------------------+------+ |1 |Toy Story |1995 |[adventure, animation, children, comedy, fantasy]|862 | |2 |Jumanji |1995 |[adventure, children, fantasy] |8844 | |3 |Grumpier Old Men |1995 |[comedy, romance] |15602 | |4 |Waiting to Exhale |1995 |[comedy, drama, romance] |31357 | |5 |Father of the Bride Part II|1995 |[comedy] |11862 | +-------+---------------------------+------------+-------------------------------------------------+------+ only showing top 5 rows σʔλ෼ׂ ଞͷσʔλιʔεͱ݁߹

Slide 46

Slide 46 text

ػցֶश • ALS (Alternating Least Squares)ར༻  Spark MLlibΛ࢖ֶͬͯश

Slide 47

Slide 47 text

ﬁtting from pyspark.ml.recommendation import ALS from pyspark.sql.functions import col als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", regParam=0.01, rank=20, seed=12) model = als.ﬁt(ratings_from_es) model.userFactors.show(5) model.itemFactors.show(5)

Slide 48

Slide 48 text

ﬁtting +---+--------------------+ | id| features| +---+--------------------+ | 10|[-1.2723869, -1.0...| | 20|[-0.28433216, -0....| | 30|[-0.23606953, 0.6...| | 40|[0.31492335, -0.0...| | 50|[0.006268716, 0.0...| +---+--------------------+ only showing top 5 rows +---+--------------------+ | id| features| +---+--------------------+ | 10|[0.07222334, 0.37...| | 20|[-0.40369913, 0.5...| | 30|[-0.65819603, -0....| | 40|[-0.2619177, 0.49...| | 50|[-0.46155798, 0.1...| +---+--------------------+ only showing top 5 rows

Slide 49

Slide 49 text

Elasticsearch Query "query": { "function_score": { "query" : { "query_string": { "query": q } }, "script_score": { "script": { "inline": "payload_vector_score", "lang": "native", "params": { "ﬁeld": "@model.factor", "vector": query_vec, "cosine" : cosine } } }, "boost_mode": "replace" } }

Slide 50

Slide 50 text

Example { "took": 60, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 111188, "max_score": 1.0, "hits": [ { "_index": "demo", "_type": "ratings", "_id": "AWeY8pNfmDlwHiSe3lD3", "_score": 1.0, "_source": { "userId": 338, "movieId": 190215, "rating": 1.5, "timestamp": 1530148477000 } } ] } }

Slide 51

Slide 51 text

Apache Spark with Data ͞·͟ͳσʔλΛ΋ͱʹ લॲཧ σʔλɺ͓ΑͼϞσϧอଘ "-4 GSPN"QQMJDBUJPO

Slide 52

Slide 52 text

example

Slide 53

Slide 53 text

·ͱΊ

Slide 54

Slide 54 text

·ͱΊ • ࢓૊ΈΛਖ਼͘͠ཧղ͢Δ͜ͱ • αʔϏε΁ͷߩݙ͸  ΤϯδχΞ͔Β΋ڧ͘ߦ͑Δ • ΍Γํ͸ෳ਺  Έͳ͞ΜͷΞϓϦέʔγϣϯɾཁ݅ʹ  ߹ΘͤͯνϟϨϯδ͠·͠ΐ͏ • https://github.com/IBM/elasticsearch-spark-recommender