PHPとApache Sparkで始めるデータ解析処理 / php-with-apache-spark

PHPͱApache SparkͰ࢝ΊΔ σʔλղੳॲཧ yuuki takezawa PHP Conference 2018

Proﬁle • ஛ᖒ ༗و / ytake • גࣜձࣾΞΠελΠϧ CTO •
PHP, Hack, Go, Scala • Apache Hadoop, Apache Spark, Apache Kafka  • twitter https://twitter.com/ex_takezawa • facebook https://www.facebook.com/yuuki.takezawa • github https://github.com/ytake

Agenda • Apache Spark Introduction • For Web Application •
Apache Spark + Elasticsearch + MLlib

Laravel Spark?

Apache Spark?

Apache Spark • ༷ʑͳσʔλʹରͯ͠෼ࢄॲཧΛߦ͏  ϑϨʔϜϫʔΫ • JavaɺPythonɺScala  ͲͷݴޠΛબΜͰ΋API͸ಉ͡ • Apache
SparkͷॲཧΛ  PHPͰ࣮૷͢Δ͜ͱ͸Ͱ͖·ͤΜ

Apache Spark • Spark SQL  ༷ʑͳσʔλΛ݁߹ͨ͠Γɺ෼ࢄॲཧͰߴ଎ • Spark MLlib  ػցֶश
• Spark GraphX  άϥϑॲཧ • Spark Streaming  ετϦʔϜॲཧ / ϚΠΫϩόον

୲౰ྖҬྫ "Spark Streamingͷ֓ཁͱݕূγφϦΦ". Think IT.   https://thinkit.co.jp/article/9958 ࢀর

Apache SparkΛ࢖͏৔໘ • ѻ͏σʔλ͕େྔͰ͋Δ͜ͱ   Hadoop΍CassandraͳͲ • େྔσʔλͰϦΞϧλΠϜʹ͍ۙॲཧ͕ඞཁͳ͜ͱ  ϨίϝϯσʔγϣϯͳͲ •
Hadoopʹ͍ۙ͠؀ڥΛ͢ͰʹߏஙࡁΈ(༧ఆ͍ͯ͠Δ) • ϚΠΫϩαʔϏεͳͲͰσʔλϕʔε͕෼ׂ͞Ε͍ͯΔ  ͦΕΒΛ࢖ͬͯ৽͍͠Կ͔Λ࢈Έग़͍ͨ࣌͠

σʔλϕʔεͱ͸ผ෺

Apache Spark with DataSource

Apache Spark with Data • Apache Sparkࣗମ͸Hadoop؀ڥ͕ͳͯ͘΋ಈ࡞Մ • ΞϓϦέʔγϣϯ͔Β௚઀໰͍߹Θͤͯऔಘ͢Δ  ΫΤϦΤϯδϯ༻్ͱ 
༷ʑͳॲཧ݁ՌΛΞϓϦέʔγϣϯʹ߹Θͤͨ  σʔλϕʔε౳ʹॻ͖ࠐΉ  όονϥΠΫͳॲཧͷ૊Έ߹Θͤ

Apache Spark with Data protected def sparkReadJdbcDatabase(spark: SparkSession): DataFrame =
{ spark.read.format("jdbc").options( Map("url" -> connect, "dbtable" -> targetTable(), "driver" -> jdbcDriver)) .load } protected def execute(spark: SparkSession, df: DataFrame): Unit = { df.count match { case 0L => spark.stop case _ => SparkInsertExecutor.saveToCassandra( df, config.cassandraKeyspace(), config.cassandraHosts(), Map( "contentTable" -> config.cassandraContentTable() ) ) } }

Spark Streaming

Spark Streaming • WebΞϓϦέʔγϣϯͰಋೖ͠΍͍͢૊Έ߹Θͤ  Spark Streaming + Kafka / Kinesis
• Կ͔ͷϝοηʔδɾΠϕϯτΛτϦΨʔʹ  ෼ੳॲཧΛ࣮ߦ͠ɺσʔλϕʔεʹॻ͖ࠐΉ • ετϦʔϜରԠ

DStream / Batch "Spark Streaming". THIRD EYES.   https://thirdeyedata.io/spark-streaming/ ࢀর

Web Application Event Sourcing + CQRS

CQRS "A few myths about CQRS". Ouarzy's Blog.   http://www.ouarzy.com/2016/10/02/a-few-myths-about-cqrs/
ࢀর

Event Sourcing / Command ྫ

Event Sourcing / Command ྫ ΞϓϦέʔγϣϯ͔Β Πϕϯτૹ৴

Event Sourcing / Command ྫ ϝοηʔδΛड͚औΓɺ ஋ͷՃ޻ͳͲ

Event Sourcing / Command ྫ RDBMS΍ɺNoSQLͳͲ ༻్ʹ߹Θͤͯॻ͖ࠐΈ

Event Sourcing / Command ྫ ଞͷΞϓϦέʔγϣϯ͔Β σʔλϕʔεͷΈΛࢀর Ճ޻ʹ͍ͭͯͷ஌ࣝ͸஌Βͳ͍

Event Sourcing / Command ྫ

{ "contents": "answers", "identiﬁer": 7854395, "operator": "9999", "operatorType": "SYSTEM", "tags":
[ 48353 ] }

val schema = StructType( StructField("contents", StringType, true) :: StructField("identiﬁer", IntegerType,
true) :: StructField("operator", StringType, true) :: StructField("operatorType", StringType, true) :: StructField("tags", ArrayType(IntegerType, true), true) :: Nil)

ಈ࡞؀ڥʹ͍ͭͯ

Cluster Management • Spark Standalone • Apache Mesos • Hadoop
YARN(Yet Another Resource Negotiator) • Kubernetes

Apache Spark ͜Μͳॲཧ͕͍ͨ͠ʂ

Apache Spark ॲཧʹඞཁͳϦιʔεׂΓ౰ͯ

Apache Spark ॲཧ࣮ߦ

Apache Spark with PHP • PHPͰSpark Streamingͷૢ࡞ͳͲ͸ෆՄ • Kafkaܦ༝ͳͲͰ஋Λૹ৴͠  Sparkࣗମͷॲཧ͸ଞݴޠͰ࣮૷
• Spark SQLͰ࡞੒ͨ͠ςʔϒϧʹ໰͍߹Θͤɺ • ෼ࢄΫΤϦΤϯδϯͱͯ͠ར༻͢Δ͜ͱ͸Մೳ  PrestoɺHiveͳͲͱಉ༷͡ʹ࢖͏͜ͱ΋

Apache Spark with PHP ෼ࢄΫΤϦΤϯδϯʹ ໰͍߹Θ͍ͤͨʂ "QBDIF5ISJGU DPNQPTFSSFRVJSFBQBDIFUISJGU TUBSUUISJGUTFSWFS CFFMJOF

ΞϓϦέʔγϣϯʹಋೖ͢Δ ۩ମྫ

Creating a Scalable Recommender with Apache Spark & Elasticsearch

Flow • CSV΍σʔλϕʔε͔ΒσʔλΛऔಘ  ϨϏϡʔͷධՁ఺ɺಈըσʔλͳͲ • ͦΕͧΕͷσʔλΛ૊Έ߹Θͤͯॲཧ͢ΔͨΊɺ  લॲཧ(Apache Spark) • લॲཧޙɺElasticsearchʹॻ͖ࠐΈֶश
• ֶशޙͷϞσϧΛElasticsearch΁

Load Data PATH_TO_DATA = "../data/ml-latest-small" ratings = spark.read.csv(PATH_TO_DATA + "/ratings.csv",
header=True, inferSchema=True) ratings.cache() print("Number of ratings: %i" % ratings.count()) print("Sample of ratings:") ratings.show(5)

Load Data Number of ratings: 100836 Sample of ratings: +------+-------+------+---------+
|userId|movieId|rating|timestamp| +------+-------+------+---------+ | 1| 1| 4.0|964982703| | 1| 3| 4.0|964981247| | 1| 6| 4.0|964982224| | 1| 47| 5.0|964983815| | 1| 50| 5.0|964982931| +------+-------+------+---------+ only showing top 5 rows

Load Data • ෼ੳ༻్޲͚ͳͲʹ࡞ΒΕ͍ͯͳ͍σʔλΛ੒ܗ͢Δ • ෼ੳ༻ͷΧϥϜΛσʔλϕʔεͦͷ΋ͷʹ௥Ճ͸͠ͳ͍ • ΞΫηεϩά͚ͩɺͳͲͰ͸ෆे෼ • େྔσʔλͷ৔߹ɺ 
RDBMSʹର࣮ͯ͠ߦ͍ͨ͠৔߹͸  ඞͣઐ༻ͷσʔλϕʔεαʔόͳͲʹ෼཭͢Δ͜ͱ  *ΞϓϦέʔγϣϯͰར༻͞Ε͍ͯΔͱࢥΘ͵ࣄނʹ

Load Data import re def extract_year_fn(title): result = re.search("\(\d{4}\)", title)
try: if result: group = result.group() year = group[1:-1] start_pos = result.start() title = title[:start_pos-1] return (title, year) else: return (title, 1970) except: print(title) extract_year = udf(extract_year_fn,\ StructType([StructField("title", StringType(), True),\ StructField("release_date", StringType(), True)]))

Load Data link_data = spark.read.csv(PATH_TO_DATA + "/links.csv", header=True, inferSchema=True) #
join movies with links to get TMDB id movie_data = movies.join(link_data, movies.movieId == link_data.movieId)\ .select(movies.movieId, movies.title, movies.release_date, movies.genres, link_data.tmdbId) num_movies = movie_data.count() print("Cleaned movie data with tmdbId links:") movie_data.show(5, truncate=False)

Load Data Cleaned movie data with tmdbId links: +-------+---------------------------+------------+-------------------------------------------------+------+ |movieId|title
|release_date|genres |tmdbId| +-------+---------------------------+------------+-------------------------------------------------+------+ |1 |Toy Story |1995 |[adventure, animation, children, comedy, fantasy]|862 | |2 |Jumanji |1995 |[adventure, children, fantasy] |8844 | |3 |Grumpier Old Men |1995 |[comedy, romance] |15602 | |4 |Waiting to Exhale |1995 |[comedy, drama, romance] |31357 | |5 |Father of the Bride Part II|1995 |[comedy] |11862 | +-------+---------------------------+------------+-------------------------------------------------+------+ only showing top 5 rows σʔλ෼ׂ ଞͷσʔλιʔεͱ݁߹

ػցֶश • ALS (Alternating Least Squares)ར༻  Spark MLlibΛ࢖ֶͬͯश

ﬁtting from pyspark.ml.recommendation import ALS from pyspark.sql.functions import col als
= ALS(userCol="userId", itemCol="movieId", ratingCol="rating", regParam=0.01, rank=20, seed=12) model = als.ﬁt(ratings_from_es) model.userFactors.show(5) model.itemFactors.show(5)

ﬁtting +---+--------------------+ | id| features| +---+--------------------+ | 10|[-1.2723869, -1.0...| |
20|[-0.28433216, -0....| | 30|[-0.23606953, 0.6...| | 40|[0.31492335, -0.0...| | 50|[0.006268716, 0.0...| +---+--------------------+ only showing top 5 rows +---+--------------------+ | id| features| +---+--------------------+ | 10|[0.07222334, 0.37...| | 20|[-0.40369913, 0.5...| | 30|[-0.65819603, -0....| | 40|[-0.2619177, 0.49...| | 50|[-0.46155798, 0.1...| +---+--------------------+ only showing top 5 rows

Elasticsearch Query "query": { "function_score": { "query" : { "query_string":
{ "query": q } }, "script_score": { "script": { "inline": "payload_vector_score", "lang": "native", "params": { "ﬁeld": "@model.factor", "vector": query_vec, "cosine" : cosine } } }, "boost_mode": "replace" } }

Example { "took": 60, "timed_out": false, "_shards": { "total": 5,
"successful": 5, "failed": 0 }, "hits": { "total": 111188, "max_score": 1.0, "hits": [ { "_index": "demo", "_type": "ratings", "_id": "AWeY8pNfmDlwHiSe3lD3", "_score": 1.0, "_source": { "userId": 338, "movieId": 190215, "rating": 1.5, "timestamp": 1530148477000 } } ] } }

Apache Spark with Data ͞·͟ͳσʔλΛ΋ͱʹ લॲཧ σʔλɺ͓ΑͼϞσϧอଘ "-4 GSPN"QQMJDBUJPO

example

·ͱΊ

·ͱΊ • ࢓૊ΈΛਖ਼͘͠ཧղ͢Δ͜ͱ • αʔϏε΁ͷߩݙ͸  ΤϯδχΞ͔Β΋ڧ͘ߦ͑Δ • ΍Γํ͸ෳ਺  Έͳ͞ΜͷΞϓϦέʔγϣϯɾཁ݅ʹ  ߹ΘͤͯνϟϨϯδ͠·͠ΐ͏
• https://github.com/IBM/elasticsearch-spark-recommender

PHPとApache Sparkで始めるデータ解析処理 / php-with-apache-s...

PHPとApache Sparkで始めるデータ解析処理 / php-with-apache-spark

More Decks by yuuki takezawa

Other Decks in Technology

Featured

Transcript