Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sparkによる分散処理 / 2015-01-16 PyData.Tokyo#3

Sparkによる分散処理 / 2015-01-16 PyData.Tokyo#3

shunsukeaihara

January 17, 2015
Tweet

More Decks by shunsukeaihara

Other Decks in Technology

Transcript

  1. SparkʹΑΔ෼ࢄॲཧ
    (ͱPythonͰͷ෼ࢄॲཧ)
    Gunosy Inc.
    Shunsuke Aihara

    View Slide

  2. ࣗݾ঺հ
    • ҄൧ݪढ़հ (http://argmax.jp) @shunsukeaihara
    • GunosyͷϚωʔδϟʔ
    • ޿ࠂ഑৴γεςϜͷ։ൃશମͱR&DܥΛ୲౰
    • ઐ໳: ܭࢉݴޠֶ
    • Pythonͱඇಉظ෼ࢄγεςϜΛ޷Ή
    • ը૾ॲཧɾԻ੠৴߸ॲཧ౳Ͱ͍Ζ͍ΖϥΠϒϥϦ࡞ͬͯΔ
    • https://bitbucket.org/aihara

    View Slide

  3. Agenda
    • Spark֓ཁ
    • ෼ࢄॲཧ(ͱSpark)ͷ࿩
    • GunosyͰͷSparkͷϢʔεέʔε
    • PythonͰͷ෼ࢄॲཧΤίγεςϜ

    View Slide

  4. Sparkʹ͍ͭͯ(1)
    • HadoopͷΤίγεςϜ(HDFS, MESOS, YARN౳)ͱ࿈ܞ͢ΔΦϯϝϞ
    Ϧ෼ࢄॲཧܥ
    • Resillient Distributed Datasetsͱ͍͏ো֐଱ੑΛ࣋ͬͨ෼ࢄσʔλߏ଄
    ʹର͢Δ෼ࢄϓϩάϥϛϯά؀ڥ
    • RDDʹద༻͢ΔฒྻܭࢉΛɺߴ֊ؔ਺ͷνΣΠϯͷܗͰScalaɺ
    PythonͰ࣮ߦ
    • immutableͳσʔλߏ଄
    • RDD಺ͷཁૉ͸ΫϥελͷΦϯϝϞϦʹ෼ࢄɾϨϓϦέʔγϣϯ
    • ഁଛɾϩετͨ͠σʔλ͸ӬଓԽͨ͠ݩσʔλ͔Β෮ݩ

    View Slide

  5. Sparkʹ͍ͭͯ(2)
    • RDDʹର͢Δ෼ࢄॲཧج൫ͷ্ʹҎԼΛ࣮૷
    • σʔλετϦʔϜॲཧ(Spark Streaming)
    • ෼ࢄSQL(SparkSQL)
    • ෼ࢄػցֶशϥΠϒϥϦ(Mllib)
    • ෼ࢄάϥϑॲཧϥΠϒϥϦ(GraphX)

    View Slide

  6. ෼ࢄॲཧ(ͱSpark)ͷ࿩

    View Slide

  7. େن໛σʔλ෼ࢄॲཧͷ؊
    • ΫϥελϚωʔδϝϯτ
    • σʔλͷ෼ࢄ഑ஔͷࣗಈԽ
    • σʔλଟॏԽ/ฒྻReadʹΑΔߴ଎Խ
    • σʔλϩʔΧϦςΟΛอͬͨܭࢉ
    • ো֐଱ੑ / ࠶ૹɾ࠶ܭࢉॲཧ

    View Slide

  8. HadoopʹࢸΔ·Ͱ
    • ෳࡶͳฒྻॲཧ͸ϝοηʔδύογϯάͰಠࣗʹ࣮૷͢Δͱେม
    • εέϧτϯฒྻϓϩάϥϛϯά(Cole, 1989)
    • සग़͢Δฒྻܭࢉύλʔϯͷ૊Έ߹ΘͤͰɺ༷ʑͳฒྻॲཧΛߏ੒తʹߏங
    ͢Δؔ਺ϓϩάϥϛϯάͷ࿮૊Έͱෳ਺ͷ࣮૷
    • σʔλฒྻεέϧτϯ(map, fold/reduce, filter, zip…)
    • σʔλͷҟͳΔ෦෼ʹɼಉ࣌ʹಉ͡ૢ࡞Λߦ͏ܭࢉύλʔϯ
    • λεΫฒྻεέϧτϯ(pipe, farm…)
    • σʔλͷετϦʔϜʹରͯ͠ɼͦΕͧΕܭࢉΛద༻ͨ͠σʔλετϦʔ
    ϜΛฦ͢ύλʔϯ

    View Slide

  9. εέϧτϯฒྻϓϩάϥϛϯά
    މৼߐ ؠ࡚ӳ࠸ εέϧτϯฒྻϓϩάϥϛϯά৘ใॲཧ 7PM /P QQ

    View Slide

  10. HadoopҎલͷ෼ࢄॲཧ
    • MPI ΍άϦουγΣϧΛ༻͍࣮ͯ૷
    • σʔλͷ഑ஔ͸ࣗ෼ͰϚωʔδ
    • ڞ༗ϝϞϦ͔ڞ༗FSʹࣗ෼Ͱ഑ஔ͕લఏ
    • ڊେσʔλͷ഑ஔ͸ͱͯ΋໘౗
    • ো֐଱ੑ͸ಠ࣮ࣗ૷Ͱอূ
    • ϝϞϦʹࡌΓ੾Βͳ͍σʔλΛѻ͏ͷ΋೉͍͠

    View Slide

  11. T-shirts message@WOMPAT2001
    “Life is too short for MPI.”

    View Slide

  12. Hadoop͕ղܾͨ͠΋ͷ
    • Պֶܭࢉ޲͚Ͱ͸ͳ͘େن໛σʔλʹಛԽ
    • ڊେσʔλͷ഑ஔͱॲཧͷ࣮ߦΛࣗಈ؅ཧ
    • HDFSͰͷࣗಈ෼ࢄ഑ஔͱɺ഑ஔ৔ॴͰMAPॲཧ

    View Slide

  13. HadoopҎ߱ͷ৽ͨͳχʔζ
    • Hadoop / Hive͸εϧʔϓοτॏࢹͷόονܥ
    • σʔλαΠΤϯςΟετͷχʔζ͸ΠϯλϥΫςΟϒͳ෼
    ੳɾϦΞϧλΠϜॲཧ΁
    • ॲཧֻ͚ͯ਺࣌ؒ଴ͪ͸ݫ͍͠
    • Hadoop, Hive͸ߴ৴པੑͷ֬อͱҾ͖׵͑ʹதؒσʔλ
    ͷDisk I/O͕ϘτϧωοΫʹ
    • αʔό౰ͨΓͷϝϞϦ༰ྔ΋૿େ

    View Slide

  14. HadoopޙͷϓϩμΫτ
    • HiveͷΦϯϝϞϦߴ଎Խ
    • ϦΞϧλΠϜͷετϦʔ
    Ϝσʔλॲཧ
    • ෳ਺ͷσʔλιʔε / DB
    ʹ·͕ͨͬͯͷߴ଎ूܭ
    • λεΫ࣮ߦΛ࠷దԽ͠௿ϨΠςϯγΛ࣮ݱ

    View Slide

  15. Spark
    • ൚༻ͷ෼ࢄϓϩάϥϛϯά؀ڥ
    • RDDΛجૅʹ͓͍ͨεέϧτϯฒྻϓϩάϥϛϯά؀ڥ
    • ΦϯϝϞϦͷRDDΛ༻͍Δ͜ͱͰɺ௿ϨΠςϯγʔͷ෼
    ࢄܭࢉΛ࣮ݱ
    • ϝϞϦʹ৐Βͳ͍΋ͷ͸Diskʹอଘ
    • RDDʹର͢Δૢ࡞Λ૊Έ߹ΘͤΔ͜ͱͰɺػցֶश΍ε
    τϦʔϜσʔλॲཧΛ࣮ݱ

    View Slide

  16. RDDʹର͢Δجຊԋࢉ
    • ScalaͷSeqॲཧͷߴ֊ؔ਺+α͕෼ࢄ࣮ߦ
    • map, flatMap, filter, sort, union, zip
    • reduce, fold, reduceByKey, groupBy, groupByKey, count
    cogroup, cross
    • join, leftOuterJoin, rightOuterJoin
    • sample, take, first, partitionBy, mapWith, pipe, save
    • etc….

    View Slide

  17. RDDͷσʔλϩʔΧϦςΟ
    • λεΫͷ࣮ߦ৔ॴɾॱং͸σʔλɾιʔεͷ
    ഑ஔ৔ॴΛݩʹ࠷దͳDAGදݱͰ؅ཧ
    )%'4 3%% 3%%
    NBQ
    NBQ
    NBQ
    NBQ
    3%%
    3FEVDF

    View Slide

  18. RDDͷো֐଱ੑ
    • RDDͷ֤ཁૉ͸ࣗ෼͕ͲͷΑ͏ͳܦ࿏Ͱੜ੒
    ͞Ε͔ͨه࿥
    )%'4 NBQ NBQ
    ☓ഁଛ
    )%'4 NBQ NBQ
    NBQ
    ࠶౓ඞཁʹͳͬͨ࣌ɺσʔλɾιʔε͔Β࠶ੜ੒

    View Slide

  19. Sparkʹ͍ͭͯ(2)
    • RDDʹର͢Δ෼ࢄॲཧج൫ͷ্ʹҎԼΛ࣮૷
    • σʔλετϦʔϜॲཧ(Spark Streaming)
    • ෼ࢄSQL(SparkSQL)
    • ෼ࢄػցֶशϥΠϒϥϦ(Mllib)
    • ෼ࢄάϥϑॲཧϥΠϒϥϦ(GraphX)

    View Slide

  20. PySpark + IPython Notebook
    • PySpark͸IPython্Ͱ࣮ߦՄೳ
    • AWSͳΒɺίϚϯυϥΠϯ1ൃͰΫϥελߏஙՄೳ
    • Spark on EMR(YARNରԠ)Λಈ͔͢
    • http://qiita.com/shunsukeaihara/items/1524b66579e91d1cf7cf

    View Slide

  21. • ఆظόονܥ͸fluentd -> RedshiftͰॲཧ
    • ΞυϗοΫͳϩά෼ੳ͸Fluentd -> S3 -> Spark
    • S3্ͷେྔͷϑΝΠϧΛखܰʹॲཧՄೳ
    GunosyͷSparkϢʔεέʔε
    "1*αʔό
    4QBSLPO"84&.3
    3FETIJGU$MVTUFS

    View Slide

  22. GunosyͷSparkϢʔεέʔε(1)
    • CloudTrailsͷϩά͔Β࢖ΘΕ͍ͯΔCredentialΛ୳ͯ͠
    ௵͢ͱ͔…
    • େྔͷJSONϑΝΠϧΛಡΈࠐΜͰHiveQLΛ࣮ߦ
    EBUBTDUFYU'JMF TCVDLFU@OBNFQBUIH[

    IJWFQZTQBSLTRM)JWF$POUFYU TD

    IUIJWFKTPO3%% EBUB

    IUSFHJTUFS5FNQ5BCMF USBJMMT

    IUDBDIF5BCMF USBJMMT

    IJWFTRM 4&-&$5%*45*/$5SFDPSEVTFS*EFOUJUZBDDFTT,FZ*E
    '30.USBJMMT-"5&3"-7*&8FYQMPEF 3FDPSET
    TBTSFDPSE

    View Slide

  23. GunosyͷSparkϢʔεέʔε(2)
    • Ϣʔβͷهࣄϩά͔Βͷੑผ෼ྨ
    • Ϣʔβຖʹclickͨ͠هࣄͷidΛListΛcsvͰS3ʹอଘ
    • TF-IDFͰॏΈ͚ͭ
    TD4QBSL$POUFYU

    NBMFTDUFYU'JMF lTCVDLFUQBUINBMF@H[l

    GFNBMFTDUFYU'JMF lTCVDLFUQBUINBMF@H[l

    UG)BTIJOH5' OVN'FBUVSFT

    NBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l z



    GFNBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l z



    JEG*%'

    JEG@NPEFMJEGpU NBMFVOJPO GFNBMF


    NBMFJEG@NPEFMUSBOTGPSN NBMF

    GFNBMFJEG@NPEFMUSBOTGPSN GFNBMF

    View Slide

  24. GunosyͷSparkϢʔεέʔε(2)
    • Ϣʔβͷهࣄϩά͔Βͷੑผ෼ྨ
    • LabeledPointʹม׵͠ϩδεςΟοΫճؼͰֶश/෼

    NBMFNBMFNBQ MBNCEBY-BCFMFE1PJOU Y


    GFNBMFGFNBMFNBQ MBNCEBY-BCFMFE1PJOU Y


    USBJOJOHNBMFVOJPO GFNBMF

    USBJOJOHDBDIF

    NPEFM-PHJTUJD3FHSFTTJPO8JUI4(%USBJO USBJOJOH

    View Slide

  25. GunosyͷSparkϢʔεέʔε(2)
    • Ϣʔβͷهࣄϩά͔Βͷੑผ෼ྨ
    • ઌ಄͕ϢʔβID, ͦΕҎ͕߱هࣄIDͷϦετ͔Βਪఆ
    EFGQBSTF Y

    EBUBGPSJJOYTQMJU l z
    >
    SFUVSO-BCFMFE1PJOU EBUB<> EBUB<>

    VOLOPXOTDUFYU'JMF lTCVDLFUQBUIVOLOPXO@H[l

    VOLOPXOVOLOPXONBQ MBNCEBYUGUSBOTGPSN YTQMJU l z



    VOLOPXOVOLOPXONBQ MBNCEBY Y<> JEG@NPEFMUPSBOTGPSN UGUSBOTGPSNY<>



    VOLOPXONBQ MBNCEBY Y<> NPEFMQSFEJDU Y<>


    DPMMFDU

    View Slide

  26. Pyspark͸͓ख͚ܰͩͲ…
    • Pythonͷؔ਺ΛPickleͯ͠෼ࢄ࣮ߦ͢ΔͷͰ͍Ζ͍Ζ஗͍
    • JavaͷϥΠϒϥϦ(kuromoji౳)Λར༻͍ͨ͠৔߹͸Scala
    ͷϥούʔ + py4jͷϥούʔ͕ඞཁ
    • Scala͔ΒͳΒͦͷ··࢖͑Δ
    • ؤுͬͯΈ͚ͨͲ࠳ંɻpy4j͸ͱʹ͔ͭ͘Β͍
    • Spark༻్ఔ౓ͳΒScalaͷֶशίετ΋௿͍
    • ͱ͸͍͑sbt΋໘౗͚ͩͲ…

    View Slide

  27. Pythonͷ෼ࢄॲཧ؀ڥ

    View Slide

  28. Pythonͷ෼ࢄॲཧϥΠϒϥϦ
    • Ϋϥελܭࢉ༻
    • PyRC, dispy, Pyro4(GensimͷLSI, LDAͷ෼ࢄԽόοΫΤϯυʹར༻)
    • ෼ࢄλεΫΩϡʔ
    • Celery : σίϨʔλΛ͚ͭΔ͚ͩͰؔ਺୯ҐͰඇಉظ෼ࢄԽ
    • IPython Cluster: ؆୯ͳλεΫ෼ࢄ༻
    • Spartan: Numpy arrayͷZeroMQʹΑΔ෼ࢄԽ(SparkͷRDDΠϯεύΠΞ)
    • Disco: Python੡MapReduceϑϨʔϜϫʔΫ

    View Slide

  29. GunosyͷPython෼ࢄॲཧ؀ڥ
    • ػցֶशͷαʔϏε࿈ܞ͸λεΫฒྻ(ฒྻετϦʔϜॲཧ)͕ॏ
    ཁͰφΠʔϒͳ෼ࢄॲཧͰ͍͍ͨͯ໰୊ͳ͍(ex. Jubatus)
    • aws্ͩͱجຊσʔλ͸શͯS3ʹूੵ
    • λεΫ؅ཧͱϦτϥΠ͸Celery(AMQP)ʹ೚ͤΔ
    • ϫʔΧʔͷσϓϩΠ͸Chef + OpsworksͰ׬શࣗಈԽ
    • ΦϯϥΠϯֶशͷ෼ࢄԽ͸parameter iterative mixing
    • EMΞϧΰϦζϜͷ෼ࢄԽ͸σʔλΛਫฏ෼ࢄͯ͠ಠཱʹܭࢉͨ͠
    ύϥϝʔλͷฏۉΛऔΔ

    View Slide

  30. • هࣄऩू΍ϢʔβຖͷਪનΛϫʔΧʔʹόϥϚΩ
    GunosyͷPython෼ࢄॲཧ؀ڥ
    هࣄΫϩʔϥʔ
    DFMFSZXPSLFS

    ਪનΤϯδϯ
    DFMFSZXPSLFS

    هࣄΫϦοΫϩά
    ίϯτϩʔϥ
    EKBOHPDFMFSZ

    View Slide

  31. ·ͱΊ
    • Sparkͷ؊͸RDDͱ͍͏σʔλߏ଄ͱεέϧτϯฒྻϕʔ
    εͷ൚༻తͳฒྻϓϩάϥϛϯά؀ڥ
    • Python͔Βͷखܰʹ෼ࢄॲཧͱ෼ࢄػցֶश͕࢖͑ͯศར
    • Ͱ΋Python͔Βෳࡶͳ͜ͱΛ͠Α͏ͱ͢Δͱຊ౰ʹΩπΠ
    ͷͰScalaͰॻ͖·͠ΐ͏
    • Ͳ͏ͯ͠΋Python͕ྑ͍ͳΒଞͷPythonͷ෼ࢄॲཧΤ
    ίγεςϜΛݕ౼͠·͠ΐ͏

    View Slide