Slide 1

Slide 1 text

SparkʹΑΔ෼ࢄॲཧ (ͱPythonͰͷ෼ࢄॲཧ) Gunosy Inc. Shunsuke Aihara

Slide 2

Slide 2 text

ࣗݾ঺հ • ҄൧ݪढ़հ (http://argmax.jp) @shunsukeaihara • GunosyͷϚωʔδϟʔ • ޿ࠂ഑৴γεςϜͷ։ൃશମͱR&DܥΛ୲౰ • ઐ໳: ܭࢉݴޠֶ • Pythonͱඇಉظ෼ࢄγεςϜΛ޷Ή • ը૾ॲཧɾԻ੠৴߸ॲཧ౳Ͱ͍Ζ͍ΖϥΠϒϥϦ࡞ͬͯΔ • https://bitbucket.org/aihara

Slide 3

Slide 3 text

Agenda • Spark֓ཁ • ෼ࢄॲཧ(ͱSpark)ͷ࿩ • GunosyͰͷSparkͷϢʔεέʔε • PythonͰͷ෼ࢄॲཧΤίγεςϜ

Slide 4

Slide 4 text

Sparkʹ͍ͭͯ(1) • HadoopͷΤίγεςϜ(HDFS, MESOS, YARN౳)ͱ࿈ܞ͢ΔΦϯϝϞ Ϧ෼ࢄॲཧܥ • Resillient Distributed Datasetsͱ͍͏ো֐଱ੑΛ࣋ͬͨ෼ࢄσʔλߏ଄ ʹର͢Δ෼ࢄϓϩάϥϛϯά؀ڥ • RDDʹద༻͢ΔฒྻܭࢉΛɺߴ֊ؔ਺ͷνΣΠϯͷܗͰScalaɺ PythonͰ࣮ߦ • immutableͳσʔλߏ଄ • RDD಺ͷཁૉ͸ΫϥελͷΦϯϝϞϦʹ෼ࢄɾϨϓϦέʔγϣϯ • ഁଛɾϩετͨ͠σʔλ͸ӬଓԽͨ͠ݩσʔλ͔Β෮ݩ

Slide 5

Slide 5 text

Sparkʹ͍ͭͯ(2) • RDDʹର͢Δ෼ࢄॲཧج൫ͷ্ʹҎԼΛ࣮૷ • σʔλετϦʔϜॲཧ(Spark Streaming) • ෼ࢄSQL(SparkSQL) • ෼ࢄػցֶशϥΠϒϥϦ(Mllib) • ෼ࢄάϥϑॲཧϥΠϒϥϦ(GraphX)

Slide 6

Slide 6 text

෼ࢄॲཧ(ͱSpark)ͷ࿩

Slide 7

Slide 7 text

େن໛σʔλ෼ࢄॲཧͷ؊ • ΫϥελϚωʔδϝϯτ • σʔλͷ෼ࢄ഑ஔͷࣗಈԽ • σʔλଟॏԽ/ฒྻReadʹΑΔߴ଎Խ • σʔλϩʔΧϦςΟΛอͬͨܭࢉ • ো֐଱ੑ / ࠶ૹɾ࠶ܭࢉॲཧ

Slide 8

Slide 8 text

HadoopʹࢸΔ·Ͱ • ෳࡶͳฒྻॲཧ͸ϝοηʔδύογϯάͰಠࣗʹ࣮૷͢Δͱେม • εέϧτϯฒྻϓϩάϥϛϯά(Cole, 1989) • සग़͢Δฒྻܭࢉύλʔϯͷ૊Έ߹ΘͤͰɺ༷ʑͳฒྻॲཧΛߏ੒తʹߏங ͢Δؔ਺ϓϩάϥϛϯάͷ࿮૊Έͱෳ਺ͷ࣮૷ • σʔλฒྻεέϧτϯ(map, fold/reduce, filter, zip…) • σʔλͷҟͳΔ෦෼ʹɼಉ࣌ʹಉ͡ૢ࡞Λߦ͏ܭࢉύλʔϯ • λεΫฒྻεέϧτϯ(pipe, farm…) • σʔλͷετϦʔϜʹରͯ͠ɼͦΕͧΕܭࢉΛద༻ͨ͠σʔλετϦʔ ϜΛฦ͢ύλʔϯ

Slide 9

Slide 9 text

εέϧτϯฒྻϓϩάϥϛϯά މৼߐ ؠ࡚ӳ࠸ εέϧτϯฒྻϓϩάϥϛϯά৘ใॲཧ 7PM /P QQ

Slide 10

Slide 10 text

HadoopҎલͷ෼ࢄॲཧ • MPI ΍άϦουγΣϧΛ༻͍࣮ͯ૷ • σʔλͷ഑ஔ͸ࣗ෼ͰϚωʔδ • ڞ༗ϝϞϦ͔ڞ༗FSʹࣗ෼Ͱ഑ஔ͕લఏ • ڊେσʔλͷ഑ஔ͸ͱͯ΋໘౗ • ো֐଱ੑ͸ಠ࣮ࣗ૷Ͱอূ • ϝϞϦʹࡌΓ੾Βͳ͍σʔλΛѻ͏ͷ΋೉͍͠

Slide 11

Slide 11 text

T-shirts message@WOMPAT2001 “Life is too short for MPI.”

Slide 12

Slide 12 text

Hadoop͕ղܾͨ͠΋ͷ • Պֶܭࢉ޲͚Ͱ͸ͳ͘େن໛σʔλʹಛԽ • ڊେσʔλͷ഑ஔͱॲཧͷ࣮ߦΛࣗಈ؅ཧ • HDFSͰͷࣗಈ෼ࢄ഑ஔͱɺ഑ஔ৔ॴͰMAPॲཧ

Slide 13

Slide 13 text

HadoopҎ߱ͷ৽ͨͳχʔζ • Hadoop / Hive͸εϧʔϓοτॏࢹͷόονܥ • σʔλαΠΤϯςΟετͷχʔζ͸ΠϯλϥΫςΟϒͳ෼ ੳɾϦΞϧλΠϜॲཧ΁ • ॲཧֻ͚ͯ਺࣌ؒ଴ͪ͸ݫ͍͠ • Hadoop, Hive͸ߴ৴པੑͷ֬อͱҾ͖׵͑ʹதؒσʔλ ͷDisk I/O͕ϘτϧωοΫʹ • αʔό౰ͨΓͷϝϞϦ༰ྔ΋૿େ

Slide 14

Slide 14 text

HadoopޙͷϓϩμΫτ • HiveͷΦϯϝϞϦߴ଎Խ • ϦΞϧλΠϜͷετϦʔ Ϝσʔλॲཧ • ෳ਺ͷσʔλιʔε / DB ʹ·͕ͨͬͯͷߴ଎ूܭ • λεΫ࣮ߦΛ࠷దԽ͠௿ϨΠςϯγΛ࣮ݱ

Slide 15

Slide 15 text

Spark • ൚༻ͷ෼ࢄϓϩάϥϛϯά؀ڥ • RDDΛجૅʹ͓͍ͨεέϧτϯฒྻϓϩάϥϛϯά؀ڥ • ΦϯϝϞϦͷRDDΛ༻͍Δ͜ͱͰɺ௿ϨΠςϯγʔͷ෼ ࢄܭࢉΛ࣮ݱ • ϝϞϦʹ৐Βͳ͍΋ͷ͸Diskʹอଘ • RDDʹର͢Δૢ࡞Λ૊Έ߹ΘͤΔ͜ͱͰɺػցֶश΍ε τϦʔϜσʔλॲཧΛ࣮ݱ

Slide 16

Slide 16 text

RDDʹର͢Δجຊԋࢉ • ScalaͷSeqॲཧͷߴ֊ؔ਺+α͕෼ࢄ࣮ߦ • map, flatMap, filter, sort, union, zip • reduce, fold, reduceByKey, groupBy, groupByKey, count cogroup, cross • join, leftOuterJoin, rightOuterJoin • sample, take, first, partitionBy, mapWith, pipe, save • etc….

Slide 17

Slide 17 text

RDDͷσʔλϩʔΧϦςΟ • λεΫͷ࣮ߦ৔ॴɾॱং͸σʔλɾιʔεͷ ഑ஔ৔ॴΛݩʹ࠷దͳDAGදݱͰ؅ཧ )%'4 3%% 3%% NBQ NBQ NBQ NBQ 3%% 3FEVDF

Slide 18

Slide 18 text

RDDͷো֐଱ੑ • RDDͷ֤ཁૉ͸ࣗ෼͕ͲͷΑ͏ͳܦ࿏Ͱੜ੒ ͞Ε͔ͨه࿥ )%'4 NBQ NBQ ☓ഁଛ )%'4 NBQ NBQ NBQ ࠶౓ඞཁʹͳͬͨ࣌ɺσʔλɾιʔε͔Β࠶ੜ੒

Slide 19

Slide 19 text

Sparkʹ͍ͭͯ(2) • RDDʹର͢Δ෼ࢄॲཧج൫ͷ্ʹҎԼΛ࣮૷ • σʔλετϦʔϜॲཧ(Spark Streaming) • ෼ࢄSQL(SparkSQL) • ෼ࢄػցֶशϥΠϒϥϦ(Mllib) • ෼ࢄάϥϑॲཧϥΠϒϥϦ(GraphX)

Slide 20

Slide 20 text

PySpark + IPython Notebook • PySpark͸IPython্Ͱ࣮ߦՄೳ • AWSͳΒɺίϚϯυϥΠϯ1ൃͰΫϥελߏஙՄೳ • Spark on EMR(YARNରԠ)Λಈ͔͢ • http://qiita.com/shunsukeaihara/items/1524b66579e91d1cf7cf

Slide 21

Slide 21 text

• ఆظόονܥ͸fluentd -> RedshiftͰॲཧ • ΞυϗοΫͳϩά෼ੳ͸Fluentd -> S3 -> Spark • S3্ͷେྔͷϑΝΠϧΛखܰʹॲཧՄೳ GunosyͷSparkϢʔεέʔε "1*αʔό 4QBSLPO"84&.3 3FETIJGU$MVTUFS

Slide 22

Slide 22 text

GunosyͷSparkϢʔεέʔε(1) • CloudTrailsͷϩά͔Β࢖ΘΕ͍ͯΔCredentialΛ୳ͯ͠ ௵͢ͱ͔… • େྔͷJSONϑΝΠϧΛಡΈࠐΜͰHiveQLΛ࣮ߦ EBUBTDUFYU'JMF TCVDLFU@OBNFQBUIH[ IJWFQZTQBSLTRM)JWF$POUFYU TD IUIJWFKTPO3%% EBUB IUSFHJTUFS5FNQ5BCMF USBJMMT IUDBDIF5BCMF USBJMMT IJWFTRM 4&-&$5%*45*/$5SFDPSEVTFS*EFOUJUZBDDFTT,FZ*E '30.USBJMMT-"5&3"-7*&8FYQMPEF 3FDPSET TBTSFDPSE

Slide 23

Slide 23 text

GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผ෼ྨ • Ϣʔβຖʹclickͨ͠هࣄͷidΛListΛcsvͰS3ʹอଘ • TF-IDFͰॏΈ͚ͭ TD4QBSL$POUFYU NBMFTDUFYU'JMF lTCVDLFUQBUINBMF@H[l GFNBMFTDUFYU'JMF lTCVDLFUQBUINBMF@H[l UG)BTIJOH5' OVN'FBUVSFT NBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l z GFNBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l z JEG*%' JEG@NPEFMJEGpU NBMFVOJPO GFNBMF NBMFJEG@NPEFMUSBOTGPSN NBMF GFNBMFJEG@NPEFMUSBOTGPSN GFNBMF

Slide 24

Slide 24 text

GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผ෼ྨ • LabeledPointʹม׵͠ϩδεςΟοΫճؼͰֶश/෼ ྨ NBMFNBMFNBQ MBNCEBY-BCFMFE1PJOU Y GFNBMFGFNBMFNBQ MBNCEBY-BCFMFE1PJOU Y USBJOJOHNBMFVOJPO GFNBMF USBJOJOHDBDIF NPEFM-PHJTUJD3FHSFTTJPO8JUI4(%USBJO USBJOJOH

Slide 25

Slide 25 text

GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผ෼ྨ • ઌ಄͕ϢʔβID, ͦΕҎ͕߱هࣄIDͷϦετ͔Βਪఆ EFGQBSTF Y EBUB SFUVSO-BCFMFE1PJOU EBUB<> EBUB<> VOLOPXOTDUFYU'JMF lTCVDLFUQBUIVOLOPXO@H[l VOLOPXOVOLOPXONBQ MBNCEBYUGUSBOTGPSN YTQMJU l z VOLOPXOVOLOPXONBQ MBNCEBY Y<> JEG@NPEFMUPSBOTGPSN UGUSBOTGPSNY<> VOLOPXONBQ MBNCEBY Y<> NPEFMQSFEJDU Y<> DPMMFDU

Slide 26

Slide 26 text

Pyspark͸͓ख͚ܰͩͲ… • Pythonͷؔ਺ΛPickleͯ͠෼ࢄ࣮ߦ͢ΔͷͰ͍Ζ͍Ζ஗͍ • JavaͷϥΠϒϥϦ(kuromoji౳)Λར༻͍ͨ͠৔߹͸Scala ͷϥούʔ + py4jͷϥούʔ͕ඞཁ • Scala͔ΒͳΒͦͷ··࢖͑Δ • ؤுͬͯΈ͚ͨͲ࠳ંɻpy4j͸ͱʹ͔ͭ͘Β͍ • Spark༻్ఔ౓ͳΒScalaͷֶशίετ΋௿͍ • ͱ͸͍͑sbt΋໘౗͚ͩͲ…

Slide 27

Slide 27 text

Pythonͷ෼ࢄॲཧ؀ڥ

Slide 28

Slide 28 text

Pythonͷ෼ࢄॲཧϥΠϒϥϦ • Ϋϥελܭࢉ༻ • PyRC, dispy, Pyro4(GensimͷLSI, LDAͷ෼ࢄԽόοΫΤϯυʹར༻) • ෼ࢄλεΫΩϡʔ • Celery : σίϨʔλΛ͚ͭΔ͚ͩͰؔ਺୯ҐͰඇಉظ෼ࢄԽ • IPython Cluster: ؆୯ͳλεΫ෼ࢄ༻ • Spartan: Numpy arrayͷZeroMQʹΑΔ෼ࢄԽ(SparkͷRDDΠϯεύΠΞ) • Disco: Python੡MapReduceϑϨʔϜϫʔΫ

Slide 29

Slide 29 text

GunosyͷPython෼ࢄॲཧ؀ڥ • ػցֶशͷαʔϏε࿈ܞ͸λεΫฒྻ(ฒྻετϦʔϜॲཧ)͕ॏ ཁͰφΠʔϒͳ෼ࢄॲཧͰ͍͍ͨͯ໰୊ͳ͍(ex. Jubatus) • aws্ͩͱجຊσʔλ͸શͯS3ʹूੵ • λεΫ؅ཧͱϦτϥΠ͸Celery(AMQP)ʹ೚ͤΔ • ϫʔΧʔͷσϓϩΠ͸Chef + OpsworksͰ׬શࣗಈԽ • ΦϯϥΠϯֶशͷ෼ࢄԽ͸parameter iterative mixing • EMΞϧΰϦζϜͷ෼ࢄԽ͸σʔλΛਫฏ෼ࢄͯ͠ಠཱʹܭࢉͨ͠ ύϥϝʔλͷฏۉΛऔΔ

Slide 30

Slide 30 text

• هࣄऩू΍ϢʔβຖͷਪનΛϫʔΧʔʹόϥϚΩ GunosyͷPython෼ࢄॲཧ؀ڥ هࣄΫϩʔϥʔ DFMFSZXPSLFS ਪનΤϯδϯ DFMFSZXPSLFS هࣄΫϦοΫϩά ίϯτϩʔϥ EKBOHPDFMFSZ

Slide 31

Slide 31 text

·ͱΊ • Sparkͷ؊͸RDDͱ͍͏σʔλߏ଄ͱεέϧτϯฒྻϕʔ εͷ൚༻తͳฒྻϓϩάϥϛϯά؀ڥ • Python͔Βͷखܰʹ෼ࢄॲཧͱ෼ࢄػցֶश͕࢖͑ͯศར • Ͱ΋Python͔Βෳࡶͳ͜ͱΛ͠Α͏ͱ͢Δͱຊ౰ʹΩπΠ ͷͰScalaͰॻ͖·͠ΐ͏ • Ͳ͏ͯ͠΋Python͕ྑ͍ͳΒଞͷPythonͷ෼ࢄॲཧΤ ίγεςϜΛݕ౼͠·͠ΐ͏