$30 off During Our Annual Pro Sale. View Details »

PHPとApache Sparkで始めるデータ解析処理 / php-with-apache-spark

yuuki takezawa
December 11, 2018

PHPとApache Sparkで始めるデータ解析処理 / php-with-apache-spark

yuuki takezawa

December 11, 2018
Tweet

More Decks by yuuki takezawa

Other Decks in Technology

Transcript

  1. PHPͱApache SparkͰ࢝ΊΔ
    σʔλղੳॲཧ
    yuuki takezawa
    PHP Conference 2018

    View Slide

  2. Profile
    • ஛ᖒ ༗و / ytake
    • גࣜձࣾΞΠελΠϧ CTO
    • PHP, Hack, Go, Scala
    • Apache Hadoop, Apache Spark, Apache Kafka

    • twitter https://twitter.com/ex_takezawa
    • facebook https://www.facebook.com/yuuki.takezawa
    • github https://github.com/ytake

    View Slide

  3. View Slide

  4. View Slide

  5. Agenda
    • Apache Spark Introduction
    • For Web Application
    • Apache Spark + Elasticsearch + MLlib

    View Slide

  6. Laravel Spark?

    View Slide

  7. Apache Spark?

    View Slide

  8. Apache Spark
    • ༷ʑͳσʔλʹରͯ͠෼ࢄॲཧΛߦ͏

    ϑϨʔϜϫʔΫ
    • JavaɺPythonɺScala

    ͲͷݴޠΛબΜͰ΋API͸ಉ͡
    • Apache SparkͷॲཧΛ

    PHPͰ࣮૷͢Δ͜ͱ͸Ͱ͖·ͤΜ

    View Slide

  9. Apache Spark
    • Spark SQL

    ༷ʑͳσʔλΛ݁߹ͨ͠Γɺ෼ࢄॲཧͰߴ଎
    • Spark MLlib

    ػցֶश
    • Spark GraphX

    άϥϑॲཧ
    • Spark Streaming

    ετϦʔϜॲཧ / ϚΠΫϩόον

    View Slide

  10. ୲౰ྖҬྫ
    "Spark Streamingͷ֓ཁͱݕূγφϦΦ". Think IT. 

    https://thinkit.co.jp/article/9958 ࢀর

    View Slide

  11. Apache SparkΛ࢖͏৔໘
    • ѻ͏σʔλ͕େྔͰ͋Δ͜ͱ 

    Hadoop΍CassandraͳͲ
    • େྔσʔλͰϦΞϧλΠϜʹ͍ۙॲཧ͕ඞཁͳ͜ͱ

    ϨίϝϯσʔγϣϯͳͲ
    • Hadoopʹ͍ۙ͠؀ڥΛ͢ͰʹߏஙࡁΈ(༧ఆ͍ͯ͠Δ)
    • ϚΠΫϩαʔϏεͳͲͰσʔλϕʔε͕෼ׂ͞Ε͍ͯΔ

    ͦΕΒΛ࢖ͬͯ৽͍͠Կ͔Λ࢈Έग़͍ͨ࣌͠

    View Slide

  12. σʔλϕʔεͱ͸ผ෺

    View Slide

  13. Apache Spark with DataSource

    View Slide

  14. Apache Spark with Data
    • Apache Sparkࣗମ͸Hadoop؀ڥ͕ͳͯ͘΋ಈ࡞Մ
    • ΞϓϦέʔγϣϯ͔Β௚઀໰͍߹Θͤͯऔಘ͢Δ

    ΫΤϦΤϯδϯ༻్ͱ

    ༷ʑͳॲཧ݁ՌΛΞϓϦέʔγϣϯʹ߹Θͤͨ

    σʔλϕʔε౳ʹॻ͖ࠐΉ

    όονϥΠΫͳॲཧͷ૊Έ߹Θͤ

    View Slide

  15. Apache Spark with Data
    protected def sparkReadJdbcDatabase(spark: SparkSession): DataFrame = {
    spark.read.format("jdbc").options(
    Map("url" -> connect,
    "dbtable" -> targetTable(),
    "driver" -> jdbcDriver))
    .load
    }
    protected def execute(spark: SparkSession, df: DataFrame): Unit = {
    df.count match {
    case 0L => spark.stop
    case _ => SparkInsertExecutor.saveToCassandra(
    df,
    config.cassandraKeyspace(), config.cassandraHosts(),
    Map(
    "contentTable" -> config.cassandraContentTable()
    )
    )
    }
    }

    View Slide

  16. Spark Streaming

    View Slide

  17. Spark Streaming
    • WebΞϓϦέʔγϣϯͰಋೖ͠΍͍͢૊Έ߹Θͤ

    Spark Streaming + Kafka / Kinesis
    • Կ͔ͷϝοηʔδɾΠϕϯτΛτϦΨʔʹ

    ෼ੳॲཧΛ࣮ߦ͠ɺσʔλϕʔεʹॻ͖ࠐΉ
    • ετϦʔϜରԠ

    View Slide

  18. DStream / Batch
    "Spark Streaming". THIRD EYES. 

    https://thirdeyedata.io/spark-streaming/ ࢀর

    View Slide

  19. Web Application
    Event Sourcing + CQRS

    View Slide

  20. CQRS
    "A few myths about CQRS". Ouarzy's Blog. 

    http://www.ouarzy.com/2016/10/02/a-few-myths-about-cqrs/ ࢀর

    View Slide

  21. Event Sourcing / Command ྫ

    View Slide

  22. Event Sourcing / Command ྫ
    ΞϓϦέʔγϣϯ͔Β
    Πϕϯτૹ৴

    View Slide

  23. Event Sourcing / Command ྫ
    ϝοηʔδΛड͚औΓɺ
    ஋ͷՃ޻ͳͲ

    View Slide

  24. Event Sourcing / Command ྫ
    RDBMS΍ɺNoSQLͳͲ
    ༻్ʹ߹Θͤͯॻ͖ࠐΈ

    View Slide

  25. Event Sourcing / Command ྫ
    ଞͷΞϓϦέʔγϣϯ͔Β
    σʔλϕʔεͷΈΛࢀর
    Ճ޻ʹ͍ͭͯͷ஌ࣝ͸஌Βͳ͍

    View Slide

  26. Event Sourcing / Command ྫ

    View Slide

  27. {
    "contents": "answers",
    "identifier": 7854395,
    "operator": "9999",
    "operatorType": "SYSTEM",
    "tags": [
    48353
    ]
    }

    View Slide

  28. val schema = StructType(
    StructField("contents", StringType, true) ::
    StructField("identifier", IntegerType, true) ::
    StructField("operator", StringType, true) ::
    StructField("operatorType", StringType, true) ::
    StructField("tags", ArrayType(IntegerType, true), true) :: Nil)

    View Slide

  29. ಈ࡞؀ڥʹ͍ͭͯ

    View Slide

  30. Cluster Management
    • Spark Standalone
    • Apache Mesos
    • Hadoop YARN(Yet Another Resource Negotiator)
    • Kubernetes

    View Slide

  31. Apache Spark
    ͜Μͳॲཧ͕͍ͨ͠ʂ

    View Slide

  32. Apache Spark
    ॲཧʹඞཁͳϦιʔεׂΓ౰ͯ

    View Slide

  33. Apache Spark
    ॲཧ࣮ߦ

    View Slide

  34. Apache Spark with PHP
    • PHPͰSpark Streamingͷૢ࡞ͳͲ͸ෆՄ
    • Kafkaܦ༝ͳͲͰ஋Λૹ৴͠

    Sparkࣗମͷॲཧ͸ଞݴޠͰ࣮૷
    • Spark SQLͰ࡞੒ͨ͠ςʔϒϧʹ໰͍߹Θͤɺ
    • ෼ࢄΫΤϦΤϯδϯͱͯ͠ར༻͢Δ͜ͱ͸Մೳ

    PrestoɺHiveͳͲͱಉ༷͡ʹ࢖͏͜ͱ΋

    View Slide

  35. Apache Spark with PHP
    ෼ࢄΫΤϦΤϯδϯʹ
    ໰͍߹Θ͍ͤͨʂ
    "QBDIF5ISJGU
    DPNQPTFSSFRVJSFBQBDIFUISJGU
    TUBSUUISJGUTFSWFS
    CFFMJOF

    View Slide

  36. ΞϓϦέʔγϣϯʹಋೖ͢Δ
    ۩ମྫ

    View Slide

  37. Creating a Scalable Recommender
    with
    Apache Spark & Elasticsearch

    View Slide

  38. Flow
    • CSV΍σʔλϕʔε͔ΒσʔλΛऔಘ

    ϨϏϡʔͷධՁ఺ɺಈըσʔλͳͲ
    • ͦΕͧΕͷσʔλΛ૊Έ߹Θͤͯॲཧ͢ΔͨΊɺ

    લॲཧ(Apache Spark)
    • લॲཧޙɺElasticsearchʹॻ͖ࠐΈֶश
    • ֶशޙͷϞσϧΛElasticsearch΁

    View Slide

  39. Load Data
    PATH_TO_DATA = "../data/ml-latest-small"
    ratings = spark.read.csv(PATH_TO_DATA + "/ratings.csv", header=True, inferSchema=True)
    ratings.cache()
    print("Number of ratings: %i" % ratings.count())
    print("Sample of ratings:")
    ratings.show(5)

    View Slide

  40. Load Data
    Number of ratings: 100836
    Sample of ratings:
    +------+-------+------+---------+
    |userId|movieId|rating|timestamp|
    +------+-------+------+---------+
    | 1| 1| 4.0|964982703|
    | 1| 3| 4.0|964981247|
    | 1| 6| 4.0|964982224|
    | 1| 47| 5.0|964983815|
    | 1| 50| 5.0|964982931|
    +------+-------+------+---------+
    only showing top 5 rows

    View Slide

  41. Load Data
    • ෼ੳ༻్޲͚ͳͲʹ࡞ΒΕ͍ͯͳ͍σʔλΛ੒ܗ͢Δ
    • ෼ੳ༻ͷΧϥϜΛσʔλϕʔεͦͷ΋ͷʹ௥Ճ͸͠ͳ͍
    • ΞΫηεϩά͚ͩɺͳͲͰ͸ෆे෼
    • େྔσʔλͷ৔߹ɺ

    RDBMSʹର࣮ͯ͠ߦ͍ͨ͠৔߹͸

    ඞͣઐ༻ͷσʔλϕʔεαʔόͳͲʹ෼཭͢Δ͜ͱ

    *ΞϓϦέʔγϣϯͰར༻͞Ε͍ͯΔͱࢥΘ͵ࣄނʹ

    View Slide

  42. Load Data
    Raw movie data:
    +-------+----------------------------------+-------------------------------------------+
    |movieId|title |genres |
    +-------+----------------------------------+-------------------------------------------+
    |1 |Toy Story (1995) |Adventure|Animation|Children|Comedy|Fantasy|
    |2 |Jumanji (1995) |Adventure|Children|Fantasy |
    |3 |Grumpier Old Men (1995) |Comedy|Romance |
    |4 |Waiting to Exhale (1995) |Comedy|Drama|Romance |
    |5 |Father of the Bride Part II (1995)|Comedy |
    +-------+----------------------------------+-------------------------------------------+
    only showing top 5 rows
    ಈըλΠτϧͱެ։೥͕Ұॹ
    δϟϯϧ͕c۠੾Γ

    View Slide

  43. Load Data
    import re
    def extract_year_fn(title):
    result = re.search("\(\d{4}\)", title)
    try:
    if result:
    group = result.group()
    year = group[1:-1]
    start_pos = result.start()
    title = title[:start_pos-1]
    return (title, year)
    else:
    return (title, 1970)
    except:
    print(title)
    extract_year = udf(extract_year_fn,\
    StructType([StructField("title", StringType(), True),\
    StructField("release_date", StringType(), True)]))

    View Slide

  44. Load Data
    link_data = spark.read.csv(PATH_TO_DATA + "/links.csv", header=True, inferSchema=True)
    # join movies with links to get TMDB id
    movie_data = movies.join(link_data, movies.movieId == link_data.movieId)\
    .select(movies.movieId, movies.title, movies.release_date, movies.genres, link_data.tmdbId)
    num_movies = movie_data.count()
    print("Cleaned movie data with tmdbId links:")
    movie_data.show(5, truncate=False)

    View Slide

  45. Load Data
    Cleaned movie data with tmdbId links:
    +-------+---------------------------+------------+-------------------------------------------------+------+
    |movieId|title |release_date|genres |tmdbId|
    +-------+---------------------------+------------+-------------------------------------------------+------+
    |1 |Toy Story |1995 |[adventure, animation, children, comedy, fantasy]|862 |
    |2 |Jumanji |1995 |[adventure, children, fantasy] |8844 |
    |3 |Grumpier Old Men |1995 |[comedy, romance] |15602 |
    |4 |Waiting to Exhale |1995 |[comedy, drama, romance] |31357 |
    |5 |Father of the Bride Part II|1995 |[comedy] |11862 |
    +-------+---------------------------+------------+-------------------------------------------------+------+
    only showing top 5 rows
    σʔλ෼ׂ
    ଞͷσʔλιʔεͱ݁߹

    View Slide

  46. ػցֶश
    • ALS (Alternating Least Squares)ར༻

    Spark MLlibΛ࢖ֶͬͯश

    View Slide

  47. fitting
    from pyspark.ml.recommendation import ALS
    from pyspark.sql.functions import col
    als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", regParam=0.01, rank=20, seed=12)
    model = als.fit(ratings_from_es)
    model.userFactors.show(5)
    model.itemFactors.show(5)

    View Slide

  48. fitting
    +---+--------------------+
    | id| features|
    +---+--------------------+
    | 10|[-1.2723869, -1.0...|
    | 20|[-0.28433216, -0....|
    | 30|[-0.23606953, 0.6...|
    | 40|[0.31492335, -0.0...|
    | 50|[0.006268716, 0.0...|
    +---+--------------------+
    only showing top 5 rows
    +---+--------------------+
    | id| features|
    +---+--------------------+
    | 10|[0.07222334, 0.37...|
    | 20|[-0.40369913, 0.5...|
    | 30|[-0.65819603, -0....|
    | 40|[-0.2619177, 0.49...|
    | 50|[-0.46155798, 0.1...|
    +---+--------------------+
    only showing top 5 rows

    View Slide

  49. Elasticsearch Query
    "query": {
    "function_score": {
    "query" : {
    "query_string": {
    "query": q
    }
    },
    "script_score": {
    "script": {
    "inline": "payload_vector_score",
    "lang": "native",
    "params": {
    "field": "@model.factor",
    "vector": query_vec,
    "cosine" : cosine
    }
    }
    },
    "boost_mode": "replace"
    }
    }

    View Slide

  50. Example
    {
    "took": 60,
    "timed_out": false,
    "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
    },
    "hits": {
    "total": 111188,
    "max_score": 1.0,
    "hits": [
    {
    "_index": "demo",
    "_type": "ratings",
    "_id": "AWeY8pNfmDlwHiSe3lD3",
    "_score": 1.0,
    "_source": {
    "userId": 338,
    "movieId": 190215,
    "rating": 1.5,
    "timestamp": 1530148477000
    }
    }
    ]
    }
    }

    View Slide

  51. Apache Spark with Data
    ͞·͟ͳσʔλΛ΋ͱʹ
    લॲཧ
    σʔλɺ͓ΑͼϞσϧอଘ
    "-4
    GSPN"QQMJDBUJPO

    View Slide

  52. example

    View Slide

  53. ·ͱΊ

    View Slide

  54. ·ͱΊ
    • ࢓૊ΈΛਖ਼͘͠ཧղ͢Δ͜ͱ
    • αʔϏε΁ͷߩݙ͸

    ΤϯδχΞ͔Β΋ڧ͘ߦ͑Δ
    • ΍Γํ͸ෳ਺

    Έͳ͞ΜͷΞϓϦέʔγϣϯɾཁ݅ʹ

    ߹ΘͤͯνϟϨϯδ͠·͠ΐ͏
    • https://github.com/IBM/elasticsearch-spark-recommender

    View Slide