Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction of GCP Dataflow

soymsk
April 12, 2018

Introduction of GCP Dataflow

(In Japanese)

soymsk

April 12, 2018
Tweet

More Decks by soymsk

Other Decks in Technology

Transcript

  1. ੈͷதʹ෼ࢄ͍ͨ͠΋ͷ͸ͨ͘͞Μ͋Δ • σʔλूܭ: Sum, Average, etc.. • σʔλՃ޻: JSONύʔεɺMapMatch •

    I/O: ෳ਺Ϧιʔεͷೖग़ྗ • ߦྻԋࢉ • ػցֶश:෼ࢄֶश ( CV, RL)
  2. ͜Μͳ͓೰Έ͕͋Γ·ͤΜ͔ʁ • ෼ࢄॲཧͷ࣮૷͕೉͍͠ɻɻ • →PythonͰαΫοͱ(※ʣ͔͚·͢ • ࠓ͸όονॲཧ͍͚ͨ͠Ͳɺকདྷ͸ϦΞϧλΠϜʹॲཧ ͍ͨ͠ʂ • →ಉҰίʔυ(※)

    Ͱ όον/ετϦʔϛϯά྆ํಈ͔ͤ ·͢ɻ • ෼ࢄॲཧج൫(Hadoop, Yarn, Spark)ͷӡ༻͕ͭΒ͍ɻɻ • → Fully managed ͳͷͰӡ༻ෆཁ(※) ※: ΄΅
  3. ෼ࢄ vs ฒྻ • ※ఆٛ͸໌֬ʹ͸ܾ·͍ͬͯ·ͤΜ • ෼ࢄॲཧΛࣗલͰ࣮૷͢Δ͜ͱ͸ෆՄೳʹ͍ۙ • ނোɺ௨৴ःஅ͕͋Γ͏ΔΫϥελ্ͰॲཧΛҡ࣋͢Δʹ͸ɾɾɾʁ •

    ࠶࣮ߦʁ • ҟͳΔεϖοΫͷෳ਺Ϛγϯ্Ͱɺద੾ʹෛՙΛ෼ࢄͤ͞Δʹ͸ɾɾɾʁ • ద੾ͳεέδϡʔϦϯά • ୭্͕هΛ΍Δͷ͔ɾɾɾʁ • Ϛελʔ ←ɹϚελʔ͕͓ͪͨΒʁ • ϚϧνϚελʔʁ • Ϛελʔબग़ΞϧΰϦζϜɺίϯηϯαεΞϧΰϦζϜ(Raft, Paxos) • http://block-chain.jp/blockchain/distributed-consensus-algorithm-protocol/ ฒྻ 1BSBMMFM ෼ࢄʢ%JTUSJCVUFE ఆٛ˞ ୯ҰϚγϯ্Ͱฒྻॲཧ ෳ਺Ϛγϯ্Ͱฒྻॲཧ ࣮ݱํ๏ ɾϚϧνϓϩηεԽ ɾϚϧνεϨουԽ ɾ$16໋ྩ ɾࣗલͰ࣮૷ ɾʮ෼ࢄॲཧج൫ʯ্Ͱ࣮ ૷
  4. ෼ࢄॲཧج൫ • ಘҙͳ͜ͱ͕ͦΕͧΕҟͳΔ • όον͔ετϦʔϛϯά͔ • ࣮૷ݴޠ(Java, Python) • ͲͷϨΠϠʔΛ୲౰͢Δ͔

    • Ϋϥελ؅ཧ·ͰҰ؏ͯ͠΍Δ • Ϋϥελ؅ཧ͸ผͷίϯϙʔωϯτʹҕৡ͢Δ( on YARN) • ϑϨʔϜϫʔΫ͝ͱʹ࣮૷ํ๏͕ҟͳΔͷͰɺֶशίετ͕͔͔Γɺ ίʔυͷՄൖੑ΋ͳ͍ɻ • ӡ༻͕େม • ίʔυͷσόοά͢Βେมͳͷʹɺج൫ͷϝϯς΋ඞཁ
  5. Dataflow(Apache Beam)ͱ͸ • GCP্ͷϑϧϚωʔδυ෼ࢄॲཧαʔϏε • PythonɺJavaɺ(Scala)ɺ(Go)Ͱ࣮૷Մೳ • Implement once ,

    Run on any runtimes. • ϑϨʔϜϫʔΫࣗମ͸Apache Beamͱ͍͏໊લͰ࣮૷͕ ެ։͞Ε͍ͯΔɻ • Apache BeamͰ࣮૷͢Ε͹ɺ͍ΖΜͳϥϯλΠϜ্Ͱ࣮ ߦՄೳ • Dataflow͸ɺ࣮͸ʮApache Beam͕ಈ࡞͢ΔϥϯλΠ Ϝʯͷ̍ͭɺͱ͍͏Ґஔ͚ͮ
  6. I/O Language File-based Messaging Database Java Beam Java supports Apache

    HDFS, Amazon S3, Google Cloud Storage, and local filesystems. FileIO (general-purpose reading, writing, and matching of files) AvroIO TextIO TFRecordIO XmlIO TikaIO Amazon Kinesis AMQP Apache Kafka Google Cloud PubSub JMS MQTT Apache Cassandra Apache Hadoop InputFormat Apache HBase Apache Hive (HCatalog) Apache Solr Elasticsearch (v2.x and v5.x) Google BigQuery Google Cloud Bigtable Google Cloud Datastore Google Cloud Spanner JDBC MongoDB Redis Python Beam Python supports Google Cloud Storage and local filesystems. avroio textio tfrecordio vcfio Google BigQuery Google Cloud Datastore
  7. Transforms %BUB ParDo(Map) %BUB GroupBy %BUB %BUB %BUB Side Input

    %BUB %BUB 1 1 N M Multiple Output %BUB %BUB %BUB Combine %BUB %BUB N 1
  8. Windowing • StreamingͱBatchͷॲཧΛந৅Խ͢Δ࢓૊Έͱͯ͠ ʮBounded dataʯʮUnbounded dataʯ
 ͱ͍͏֓೦͕͋Δ • Bounded …

    σʔλͷ࠷ॳͱ࠷ޙ͕ఆٛ͞Ε͍ͯΔ(όο νʣ • Unbounded …࠷ޙ͕ະఆʢετϦʔϛϯά) • Boundedͳσʔλ͸ GroupByͰ͖Δɻ
  9. Windowing • Unbounded data͸Bounded dataʹม׵͢Ε͹Α͍ɻ • σʔλΛ۠෼͚ͯ͠Bounded ʹ͢Δ࢓૊Έ͕ ʮWindowingʯ •

    ࣮͸ɺόονॲཧ͸ɺσʔλΛ͢΂ͯ
 ʮGlobalWindowʯͱ͍͏̍ͭͷWindow಺ͷσʔλ
 ʹର͢ΔετϦʔϛϯάͱಉٛͱߟ͑Δ͜ͱ͕Ͱ͖Δɻ
  10. ߟ͑ΒΕΔͱ͜Ζ • ෼ࢄ؀ڥͰ͸ɺϝϞϦۭؒ͸Ұ੾ڞ༗͞Εͳ͍ • εςʔτϑϧͳϩδοΫ͸޻෉͕ඞཁ • ϫʔΧʔؒͷঢ়ଶ؅ཧ͸֎෦ίϯϙʔωϯτ (memcached, GCS ͳͲ)Λܦ༝͢Δ

    • ϫʔΧϦιʔεෆ଍ • Transform͸1Ϩίʔυ࠷௿1ճ࣮ߦɻ • ϫʔΧʔ͕མͪͨͱ͖ͳͲɺ࠶࣮ߦ͞ΕΔՄೳੑɻ • ֎෦DBʹϨίʔυॲཧճ਺ΛΠϯΫϦϝϯτͯͨ͠Γ ͢ΔͱɺೖྗϨίʔυͱҰக͠ͳ͍Մೳੑ͕͋Δɻ
  11. Dataflowνϡʔχϯά • νϡʔχϯά߲໨͸ଟ͘ͳ͍ • Ϧʔδϣϯ(σʔλ, Πϯελϯε͸ͳΔ΂ۙ͘͘) • ΠϯελϯελΠϓ x ʮ࠷େϫʔΧʔ਺ʯ

    • جຊɺUIΛݟͳ͕Β٧·͍ͬͯΔͱ͜ΖͷίʔυΛվળɻ • ೖྗ/ग़ྗ͕ϘτϧωοΫʹͳΔ͔΋ʁ • PubSubͷQuotaΛ͋͛Δ • GCSͷϑΝΠϧΛେྔʹಡΈࠐΉͱ஗͔ͬͨɻ • Combine͔ͯ͠ΒಡΈࠐΉ • ࠷ۙ͸վળ͞Εͨʁ
  12. νϡʔχϯά - Fusion - • ༥߹(Fusion) • Dataflow͸ɺύΠϓϥΠϯͷߏ଄Λղੳͯ͠ɺෳ਺ͷ TransformΛ༥߹ͯ̍ͭ͠ͷTransformͱ࣮ͯ͠ߦ͢Δ࢓૊Έ ͕ೖ͍ͬͯΔɻ

    • Fusion͞ΕͨTransform͸ɺϥούʔͷPCollection͕࣮ମԽ ͞ΕͣɺೖྗσʔλΛͦͷ··࣍ͷTransformʹྲྀ͢ͷͰɺύ ϑΥʔϚϯε্͕͕Δɻ • Fusion͞ΕͨҰ࿈ͷTransform͸ɺ̍ͭͷTransformͱ࣮ͯ͠ ߦ͞ΕΔɻ
  13. νϡʔχϯά - Fusion - • جຊతʹFusion͸ѱ͞Λ͠ͳ͍͕ɺ
 ʮೖྗϨίʔυ਺ʯ<<< ʮग़ྗϨίʔυ਺ʯͳParDo ("high fan-

    out” ParDo)
 ͕ೖΔͱɺޙଓͷTransformͰॲཧ͕ͭ·Δɺͱ͍͏ݱ৅͕͋Δɻ • Fusion͞Εͨ݁ՌɺதؒͷTransform͕෼ࢄ͞Εͳ͍ɻ ※ೖྗϨίʔυ਺ʹΑͬͯ୲౰worker਺͕൑அ͞ΕΔ 5SBOTGPSN 5SBOTGPSN 5SBOTGPSN 5SBOTGPSN 'VTJPOFE5SBOTGPSN ຊདྷҙਤͨ͠෼ࢄ …. record: 1 record: 10000 record: 1
  14. νϡʔχϯά - Fusion - • ྫ͑͹ɺϑΝΠϧ໊ͷ഑ྻΛೖྗͱ͠ɺϑΝΠϧͷத ਎ͷϨίʔυΛग़ྗ͢ΔΑ͏ͳParDo ͸ةݥ(සग़)ɹ • ճආࡦ

    • GroupBy -> UnGroup ͱ͍͏ॲཧΛhigh fan-outͷ ParDoͷ௚ޙʹ࢓ࠐΉ • εςοϓͰग़ྗ͞ΕͨPCollectionΛ࣍ͷεςοϓͷʮSide InputʯʹೖΕ ΔʢSide Input͸ඞ࣮ͣମԽ͞ΕΔʣ