Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PHPとApache Sparkで始めるデータ解析処理 / php-with-apache-spark

yuuki takezawa
December 11, 2018

PHPとApache Sparkで始めるデータ解析処理 / php-with-apache-spark

yuuki takezawa

December 11, 2018
Tweet

More Decks by yuuki takezawa

Other Decks in Technology

Transcript

  1. Profile • ஛ᖒ ༗و / ytake • גࣜձࣾΞΠελΠϧ CTO •

    PHP, Hack, Go, Scala • Apache Hadoop, Apache Spark, Apache Kafka
 • twitter https://twitter.com/ex_takezawa • facebook https://www.facebook.com/yuuki.takezawa • github https://github.com/ytake
  2. Apache Spark • Spark SQL
 ༷ʑͳσʔλΛ݁߹ͨ͠Γɺ෼ࢄॲཧͰߴ଎ • Spark MLlib
 ػցֶश

    • Spark GraphX
 άϥϑॲཧ • Spark Streaming
 ετϦʔϜॲཧ / ϚΠΫϩόον
  3. Apache SparkΛ࢖͏৔໘ • ѻ͏σʔλ͕େྔͰ͋Δ͜ͱ 
 Hadoop΍CassandraͳͲ • େྔσʔλͰϦΞϧλΠϜʹ͍ۙॲཧ͕ඞཁͳ͜ͱ
 ϨίϝϯσʔγϣϯͳͲ •

    Hadoopʹ͍ۙ͠؀ڥΛ͢ͰʹߏஙࡁΈ(༧ఆ͍ͯ͠Δ) • ϚΠΫϩαʔϏεͳͲͰσʔλϕʔε͕෼ׂ͞Ε͍ͯΔ
 ͦΕΒΛ࢖ͬͯ৽͍͠Կ͔Λ࢈Έग़͍ͨ࣌͠
  4. Apache Spark with Data • Apache Sparkࣗମ͸Hadoop؀ڥ͕ͳͯ͘΋ಈ࡞Մ • ΞϓϦέʔγϣϯ͔Β௚઀໰͍߹Θͤͯऔಘ͢Δ
 ΫΤϦΤϯδϯ༻్ͱ


    ༷ʑͳॲཧ݁ՌΛΞϓϦέʔγϣϯʹ߹Θͤͨ
 σʔλϕʔε౳ʹॻ͖ࠐΉ
 όονϥΠΫͳॲཧͷ૊Έ߹Θͤ
  5. Apache Spark with Data protected def sparkReadJdbcDatabase(spark: SparkSession): DataFrame =

    { spark.read.format("jdbc").options( Map("url" -> connect, "dbtable" -> targetTable(), "driver" -> jdbcDriver)) .load } protected def execute(spark: SparkSession, df: DataFrame): Unit = { df.count match { case 0L => spark.stop case _ => SparkInsertExecutor.saveToCassandra( df, config.cassandraKeyspace(), config.cassandraHosts(), Map( "contentTable" -> config.cassandraContentTable() ) ) } }
  6. Spark Streaming • WebΞϓϦέʔγϣϯͰಋೖ͠΍͍͢૊Έ߹Θͤ
 Spark Streaming + Kafka / Kinesis

    • Կ͔ͷϝοηʔδɾΠϕϯτΛτϦΨʔʹ
 ෼ੳॲཧΛ࣮ߦ͠ɺσʔλϕʔεʹॻ͖ࠐΉ • ετϦʔϜରԠ
  7. val schema = StructType( StructField("contents", StringType, true) :: StructField("identifier", IntegerType,

    true) :: StructField("operator", StringType, true) :: StructField("operatorType", StringType, true) :: StructField("tags", ArrayType(IntegerType, true), true) :: Nil)
  8. Cluster Management • Spark Standalone • Apache Mesos • Hadoop

    YARN(Yet Another Resource Negotiator) • Kubernetes
  9. Apache Spark with PHP • PHPͰSpark Streamingͷૢ࡞ͳͲ͸ෆՄ • Kafkaܦ༝ͳͲͰ஋Λૹ৴͠
 Sparkࣗମͷॲཧ͸ଞݴޠͰ࣮૷

    • Spark SQLͰ࡞੒ͨ͠ςʔϒϧʹ໰͍߹Θͤɺ • ෼ࢄΫΤϦΤϯδϯͱͯ͠ར༻͢Δ͜ͱ͸Մೳ
 PrestoɺHiveͳͲͱಉ༷͡ʹ࢖͏͜ͱ΋
  10. Load Data PATH_TO_DATA = "../data/ml-latest-small" ratings = spark.read.csv(PATH_TO_DATA + "/ratings.csv",

    header=True, inferSchema=True) ratings.cache() print("Number of ratings: %i" % ratings.count()) print("Sample of ratings:") ratings.show(5)
  11. Load Data Number of ratings: 100836 Sample of ratings: +------+-------+------+---------+

    |userId|movieId|rating|timestamp| +------+-------+------+---------+ | 1| 1| 4.0|964982703| | 1| 3| 4.0|964981247| | 1| 6| 4.0|964982224| | 1| 47| 5.0|964983815| | 1| 50| 5.0|964982931| +------+-------+------+---------+ only showing top 5 rows
  12. Load Data • ෼ੳ༻్޲͚ͳͲʹ࡞ΒΕ͍ͯͳ͍σʔλΛ੒ܗ͢Δ • ෼ੳ༻ͷΧϥϜΛσʔλϕʔεͦͷ΋ͷʹ௥Ճ͸͠ͳ͍ • ΞΫηεϩά͚ͩɺͳͲͰ͸ෆे෼ • େྔσʔλͷ৔߹ɺ


    RDBMSʹର࣮ͯ͠ߦ͍ͨ͠৔߹͸
 ඞͣઐ༻ͷσʔλϕʔεαʔόͳͲʹ෼཭͢Δ͜ͱ
 *ΞϓϦέʔγϣϯͰར༻͞Ε͍ͯΔͱࢥΘ͵ࣄނʹ
  13. Load Data Raw movie data: +-------+----------------------------------+-------------------------------------------+ |movieId|title |genres | +-------+----------------------------------+-------------------------------------------+

    |1 |Toy Story (1995) |Adventure|Animation|Children|Comedy|Fantasy| |2 |Jumanji (1995) |Adventure|Children|Fantasy | |3 |Grumpier Old Men (1995) |Comedy|Romance | |4 |Waiting to Exhale (1995) |Comedy|Drama|Romance | |5 |Father of the Bride Part II (1995)|Comedy | +-------+----------------------------------+-------------------------------------------+ only showing top 5 rows ಈըλΠτϧͱެ։೥͕Ұॹ δϟϯϧ͕c۠੾Γ
  14. Load Data import re def extract_year_fn(title): result = re.search("\(\d{4}\)", title)

    try: if result: group = result.group() year = group[1:-1] start_pos = result.start() title = title[:start_pos-1] return (title, year) else: return (title, 1970) except: print(title) extract_year = udf(extract_year_fn,\ StructType([StructField("title", StringType(), True),\ StructField("release_date", StringType(), True)]))
  15. Load Data link_data = spark.read.csv(PATH_TO_DATA + "/links.csv", header=True, inferSchema=True) #

    join movies with links to get TMDB id movie_data = movies.join(link_data, movies.movieId == link_data.movieId)\ .select(movies.movieId, movies.title, movies.release_date, movies.genres, link_data.tmdbId) num_movies = movie_data.count() print("Cleaned movie data with tmdbId links:") movie_data.show(5, truncate=False)
  16. Load Data Cleaned movie data with tmdbId links: +-------+---------------------------+------------+-------------------------------------------------+------+ |movieId|title

    |release_date|genres |tmdbId| +-------+---------------------------+------------+-------------------------------------------------+------+ |1 |Toy Story |1995 |[adventure, animation, children, comedy, fantasy]|862 | |2 |Jumanji |1995 |[adventure, children, fantasy] |8844 | |3 |Grumpier Old Men |1995 |[comedy, romance] |15602 | |4 |Waiting to Exhale |1995 |[comedy, drama, romance] |31357 | |5 |Father of the Bride Part II|1995 |[comedy] |11862 | +-------+---------------------------+------------+-------------------------------------------------+------+ only showing top 5 rows σʔλ෼ׂ ଞͷσʔλιʔεͱ݁߹
  17. fitting from pyspark.ml.recommendation import ALS from pyspark.sql.functions import col als

    = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", regParam=0.01, rank=20, seed=12) model = als.fit(ratings_from_es) model.userFactors.show(5) model.itemFactors.show(5)
  18. fitting +---+--------------------+ | id| features| +---+--------------------+ | 10|[-1.2723869, -1.0...| |

    20|[-0.28433216, -0....| | 30|[-0.23606953, 0.6...| | 40|[0.31492335, -0.0...| | 50|[0.006268716, 0.0...| +---+--------------------+ only showing top 5 rows +---+--------------------+ | id| features| +---+--------------------+ | 10|[0.07222334, 0.37...| | 20|[-0.40369913, 0.5...| | 30|[-0.65819603, -0....| | 40|[-0.2619177, 0.49...| | 50|[-0.46155798, 0.1...| +---+--------------------+ only showing top 5 rows
  19. Elasticsearch Query "query": { "function_score": { "query" : { "query_string":

    { "query": q } }, "script_score": { "script": { "inline": "payload_vector_score", "lang": "native", "params": { "field": "@model.factor", "vector": query_vec, "cosine" : cosine } } }, "boost_mode": "replace" } }
  20. Example { "took": 60, "timed_out": false, "_shards": { "total": 5,

    "successful": 5, "failed": 0 }, "hits": { "total": 111188, "max_score": 1.0, "hits": [ { "_index": "demo", "_type": "ratings", "_id": "AWeY8pNfmDlwHiSe3lD3", "_score": 1.0, "_source": { "userId": 338, "movieId": 190215, "rating": 1.5, "timestamp": 1530148477000 } } ] } }