PHPとApache Sparkで始めるデータ解析処理 / php-with-apache-spark

17d4ef53b432ebf7c566fd6a11345570?s=47 yuuki takezawa
December 11, 2018

PHPとApache Sparkで始めるデータ解析処理 / php-with-apache-spark

17d4ef53b432ebf7c566fd6a11345570?s=128

yuuki takezawa

December 11, 2018
Tweet

Transcript

  1. PHPͱApache SparkͰ࢝ΊΔ σʔλղੳॲཧ yuuki takezawa PHP Conference 2018

  2. Profile • ஛ᖒ ༗و / ytake • גࣜձࣾΞΠελΠϧ CTO •

    PHP, Hack, Go, Scala • Apache Hadoop, Apache Spark, Apache Kafka
 • twitter https://twitter.com/ex_takezawa • facebook https://www.facebook.com/yuuki.takezawa • github https://github.com/ytake
  3. None
  4. None
  5. Agenda • Apache Spark Introduction • For Web Application •

    Apache Spark + Elasticsearch + MLlib
  6. Laravel Spark?

  7. Apache Spark?

  8. Apache Spark • ༷ʑͳσʔλʹରͯ͠෼ࢄॲཧΛߦ͏
 ϑϨʔϜϫʔΫ • JavaɺPythonɺScala
 ͲͷݴޠΛબΜͰ΋API͸ಉ͡ • Apache

    SparkͷॲཧΛ
 PHPͰ࣮૷͢Δ͜ͱ͸Ͱ͖·ͤΜ
  9. Apache Spark • Spark SQL
 ༷ʑͳσʔλΛ݁߹ͨ͠Γɺ෼ࢄॲཧͰߴ଎ • Spark MLlib
 ػցֶश

    • Spark GraphX
 άϥϑॲཧ • Spark Streaming
 ετϦʔϜॲཧ / ϚΠΫϩόον
  10. ୲౰ྖҬྫ "Spark Streamingͷ֓ཁͱݕূγφϦΦ". Think IT. 
 https://thinkit.co.jp/article/9958 ࢀর

  11. Apache SparkΛ࢖͏৔໘ • ѻ͏σʔλ͕େྔͰ͋Δ͜ͱ 
 Hadoop΍CassandraͳͲ • େྔσʔλͰϦΞϧλΠϜʹ͍ۙॲཧ͕ඞཁͳ͜ͱ
 ϨίϝϯσʔγϣϯͳͲ •

    Hadoopʹ͍ۙ͠؀ڥΛ͢ͰʹߏஙࡁΈ(༧ఆ͍ͯ͠Δ) • ϚΠΫϩαʔϏεͳͲͰσʔλϕʔε͕෼ׂ͞Ε͍ͯΔ
 ͦΕΒΛ࢖ͬͯ৽͍͠Կ͔Λ࢈Έग़͍ͨ࣌͠
  12. σʔλϕʔεͱ͸ผ෺

  13. Apache Spark with DataSource

  14. Apache Spark with Data • Apache Sparkࣗମ͸Hadoop؀ڥ͕ͳͯ͘΋ಈ࡞Մ • ΞϓϦέʔγϣϯ͔Β௚઀໰͍߹Θͤͯऔಘ͢Δ
 ΫΤϦΤϯδϯ༻్ͱ


    ༷ʑͳॲཧ݁ՌΛΞϓϦέʔγϣϯʹ߹Θͤͨ
 σʔλϕʔε౳ʹॻ͖ࠐΉ
 όονϥΠΫͳॲཧͷ૊Έ߹Θͤ
  15. Apache Spark with Data protected def sparkReadJdbcDatabase(spark: SparkSession): DataFrame =

    { spark.read.format("jdbc").options( Map("url" -> connect, "dbtable" -> targetTable(), "driver" -> jdbcDriver)) .load } protected def execute(spark: SparkSession, df: DataFrame): Unit = { df.count match { case 0L => spark.stop case _ => SparkInsertExecutor.saveToCassandra( df, config.cassandraKeyspace(), config.cassandraHosts(), Map( "contentTable" -> config.cassandraContentTable() ) ) } }
  16. Spark Streaming

  17. Spark Streaming • WebΞϓϦέʔγϣϯͰಋೖ͠΍͍͢૊Έ߹Θͤ
 Spark Streaming + Kafka / Kinesis

    • Կ͔ͷϝοηʔδɾΠϕϯτΛτϦΨʔʹ
 ෼ੳॲཧΛ࣮ߦ͠ɺσʔλϕʔεʹॻ͖ࠐΉ • ετϦʔϜରԠ
  18. DStream / Batch "Spark Streaming". THIRD EYES. 
 https://thirdeyedata.io/spark-streaming/ ࢀর

  19. Web Application Event Sourcing + CQRS

  20. CQRS "A few myths about CQRS". Ouarzy's Blog. 
 http://www.ouarzy.com/2016/10/02/a-few-myths-about-cqrs/

    ࢀর
  21. Event Sourcing / Command ྫ

  22. Event Sourcing / Command ྫ ΞϓϦέʔγϣϯ͔Β Πϕϯτૹ৴

  23. Event Sourcing / Command ྫ ϝοηʔδΛड͚औΓɺ ஋ͷՃ޻ͳͲ

  24. Event Sourcing / Command ྫ RDBMS΍ɺNoSQLͳͲ ༻్ʹ߹Θͤͯॻ͖ࠐΈ

  25. Event Sourcing / Command ྫ ଞͷΞϓϦέʔγϣϯ͔Β σʔλϕʔεͷΈΛࢀর Ճ޻ʹ͍ͭͯͷ஌ࣝ͸஌Βͳ͍

  26. Event Sourcing / Command ྫ

  27. { "contents": "answers", "identifier": 7854395, "operator": "9999", "operatorType": "SYSTEM", "tags":

    [ 48353 ] }
  28. val schema = StructType( StructField("contents", StringType, true) :: StructField("identifier", IntegerType,

    true) :: StructField("operator", StringType, true) :: StructField("operatorType", StringType, true) :: StructField("tags", ArrayType(IntegerType, true), true) :: Nil)
  29. ಈ࡞؀ڥʹ͍ͭͯ

  30. Cluster Management • Spark Standalone • Apache Mesos • Hadoop

    YARN(Yet Another Resource Negotiator) • Kubernetes
  31. Apache Spark ͜Μͳॲཧ͕͍ͨ͠ʂ

  32. Apache Spark ॲཧʹඞཁͳϦιʔεׂΓ౰ͯ

  33. Apache Spark ॲཧ࣮ߦ

  34. Apache Spark with PHP • PHPͰSpark Streamingͷૢ࡞ͳͲ͸ෆՄ • Kafkaܦ༝ͳͲͰ஋Λૹ৴͠
 Sparkࣗମͷॲཧ͸ଞݴޠͰ࣮૷

    • Spark SQLͰ࡞੒ͨ͠ςʔϒϧʹ໰͍߹Θͤɺ • ෼ࢄΫΤϦΤϯδϯͱͯ͠ར༻͢Δ͜ͱ͸Մೳ
 PrestoɺHiveͳͲͱಉ༷͡ʹ࢖͏͜ͱ΋
  35. Apache Spark with PHP ෼ࢄΫΤϦΤϯδϯʹ ໰͍߹Θ͍ͤͨʂ "QBDIF5ISJGU DPNQPTFSSFRVJSFBQBDIFUISJGU TUBSUUISJGUTFSWFS CFFMJOF

  36. ΞϓϦέʔγϣϯʹಋೖ͢Δ ۩ମྫ

  37. Creating a Scalable Recommender with Apache Spark & Elasticsearch

  38. Flow • CSV΍σʔλϕʔε͔ΒσʔλΛऔಘ
 ϨϏϡʔͷධՁ఺ɺಈըσʔλͳͲ • ͦΕͧΕͷσʔλΛ૊Έ߹Θͤͯॲཧ͢ΔͨΊɺ
 લॲཧ(Apache Spark) • લॲཧޙɺElasticsearchʹॻ͖ࠐΈֶश

    • ֶशޙͷϞσϧΛElasticsearch΁
  39. Load Data PATH_TO_DATA = "../data/ml-latest-small" ratings = spark.read.csv(PATH_TO_DATA + "/ratings.csv",

    header=True, inferSchema=True) ratings.cache() print("Number of ratings: %i" % ratings.count()) print("Sample of ratings:") ratings.show(5)
  40. Load Data Number of ratings: 100836 Sample of ratings: +------+-------+------+---------+

    |userId|movieId|rating|timestamp| +------+-------+------+---------+ | 1| 1| 4.0|964982703| | 1| 3| 4.0|964981247| | 1| 6| 4.0|964982224| | 1| 47| 5.0|964983815| | 1| 50| 5.0|964982931| +------+-------+------+---------+ only showing top 5 rows
  41. Load Data • ෼ੳ༻్޲͚ͳͲʹ࡞ΒΕ͍ͯͳ͍σʔλΛ੒ܗ͢Δ • ෼ੳ༻ͷΧϥϜΛσʔλϕʔεͦͷ΋ͷʹ௥Ճ͸͠ͳ͍ • ΞΫηεϩά͚ͩɺͳͲͰ͸ෆे෼ • େྔσʔλͷ৔߹ɺ


    RDBMSʹର࣮ͯ͠ߦ͍ͨ͠৔߹͸
 ඞͣઐ༻ͷσʔλϕʔεαʔόͳͲʹ෼཭͢Δ͜ͱ
 *ΞϓϦέʔγϣϯͰར༻͞Ε͍ͯΔͱࢥΘ͵ࣄނʹ
  42. Load Data Raw movie data: +-------+----------------------------------+-------------------------------------------+ |movieId|title |genres | +-------+----------------------------------+-------------------------------------------+

    |1 |Toy Story (1995) |Adventure|Animation|Children|Comedy|Fantasy| |2 |Jumanji (1995) |Adventure|Children|Fantasy | |3 |Grumpier Old Men (1995) |Comedy|Romance | |4 |Waiting to Exhale (1995) |Comedy|Drama|Romance | |5 |Father of the Bride Part II (1995)|Comedy | +-------+----------------------------------+-------------------------------------------+ only showing top 5 rows ಈըλΠτϧͱެ։೥͕Ұॹ δϟϯϧ͕c۠੾Γ
  43. Load Data import re def extract_year_fn(title): result = re.search("\(\d{4}\)", title)

    try: if result: group = result.group() year = group[1:-1] start_pos = result.start() title = title[:start_pos-1] return (title, year) else: return (title, 1970) except: print(title) extract_year = udf(extract_year_fn,\ StructType([StructField("title", StringType(), True),\ StructField("release_date", StringType(), True)]))
  44. Load Data link_data = spark.read.csv(PATH_TO_DATA + "/links.csv", header=True, inferSchema=True) #

    join movies with links to get TMDB id movie_data = movies.join(link_data, movies.movieId == link_data.movieId)\ .select(movies.movieId, movies.title, movies.release_date, movies.genres, link_data.tmdbId) num_movies = movie_data.count() print("Cleaned movie data with tmdbId links:") movie_data.show(5, truncate=False)
  45. Load Data Cleaned movie data with tmdbId links: +-------+---------------------------+------------+-------------------------------------------------+------+ |movieId|title

    |release_date|genres |tmdbId| +-------+---------------------------+------------+-------------------------------------------------+------+ |1 |Toy Story |1995 |[adventure, animation, children, comedy, fantasy]|862 | |2 |Jumanji |1995 |[adventure, children, fantasy] |8844 | |3 |Grumpier Old Men |1995 |[comedy, romance] |15602 | |4 |Waiting to Exhale |1995 |[comedy, drama, romance] |31357 | |5 |Father of the Bride Part II|1995 |[comedy] |11862 | +-------+---------------------------+------------+-------------------------------------------------+------+ only showing top 5 rows σʔλ෼ׂ ଞͷσʔλιʔεͱ݁߹
  46. ػցֶश • ALS (Alternating Least Squares)ར༻
 Spark MLlibΛ࢖ֶͬͯश

  47. fitting from pyspark.ml.recommendation import ALS from pyspark.sql.functions import col als

    = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", regParam=0.01, rank=20, seed=12) model = als.fit(ratings_from_es) model.userFactors.show(5) model.itemFactors.show(5)
  48. fitting +---+--------------------+ | id| features| +---+--------------------+ | 10|[-1.2723869, -1.0...| |

    20|[-0.28433216, -0....| | 30|[-0.23606953, 0.6...| | 40|[0.31492335, -0.0...| | 50|[0.006268716, 0.0...| +---+--------------------+ only showing top 5 rows +---+--------------------+ | id| features| +---+--------------------+ | 10|[0.07222334, 0.37...| | 20|[-0.40369913, 0.5...| | 30|[-0.65819603, -0....| | 40|[-0.2619177, 0.49...| | 50|[-0.46155798, 0.1...| +---+--------------------+ only showing top 5 rows
  49. Elasticsearch Query "query": { "function_score": { "query" : { "query_string":

    { "query": q } }, "script_score": { "script": { "inline": "payload_vector_score", "lang": "native", "params": { "field": "@model.factor", "vector": query_vec, "cosine" : cosine } } }, "boost_mode": "replace" } }
  50. Example { "took": 60, "timed_out": false, "_shards": { "total": 5,

    "successful": 5, "failed": 0 }, "hits": { "total": 111188, "max_score": 1.0, "hits": [ { "_index": "demo", "_type": "ratings", "_id": "AWeY8pNfmDlwHiSe3lD3", "_score": 1.0, "_source": { "userId": 338, "movieId": 190215, "rating": 1.5, "timestamp": 1530148477000 } } ] } }
  51. Apache Spark with Data ͞·͟ͳσʔλΛ΋ͱʹ લॲཧ σʔλɺ͓ΑͼϞσϧอଘ "-4 GSPN"QQMJDBUJPO

  52. example

  53. ·ͱΊ

  54. ·ͱΊ • ࢓૊ΈΛਖ਼͘͠ཧղ͢Δ͜ͱ • αʔϏε΁ͷߩݙ͸
 ΤϯδχΞ͔Β΋ڧ͘ߦ͑Δ • ΍Γํ͸ෳ਺
 Έͳ͞ΜͷΞϓϦέʔγϣϯɾཁ݅ʹ
 ߹ΘͤͯνϟϨϯδ͠·͠ΐ͏

    • https://github.com/IBM/elasticsearch-spark-recommender