SparkʹΑΔࢄॲཧ(ͱPythonͰͷࢄॲཧ)Gunosy Inc.Shunsuke Aihara
View Slide
ࣗݾհ• ҄൧ݪढ़հ (http://argmax.jp) @shunsukeaihara• GunosyͷϚωʔδϟʔ• ࠂ৴γεςϜͷ։ൃશମͱR&DܥΛ୲• ઐ: ܭࢉݴޠֶ• PythonͱඇಉظࢄγεςϜΛΉ• ը૾ॲཧɾԻ৴߸ॲཧͰ͍Ζ͍ΖϥΠϒϥϦ࡞ͬͯΔ• https://bitbucket.org/aihara
Agenda• Spark֓ཁ• ࢄॲཧ(ͱSpark)ͷ• GunosyͰͷSparkͷϢʔεέʔε• PythonͰͷࢄॲཧΤίγεςϜ
Sparkʹ͍ͭͯ(1)• HadoopͷΤίγεςϜ(HDFS, MESOS, YARN)ͱ࿈ܞ͢ΔΦϯϝϞϦࢄॲཧܥ• Resillient Distributed Datasetsͱ͍͏োੑΛ࣋ͬͨࢄσʔλߏʹର͢Δࢄϓϩάϥϛϯάڥ• RDDʹద༻͢ΔฒྻܭࢉΛɺߴ֊ؔͷνΣΠϯͷܗͰScalaɺPythonͰ࣮ߦ• immutableͳσʔλߏ• RDDͷཁૉΫϥελͷΦϯϝϞϦʹࢄɾϨϓϦέʔγϣϯ• ഁଛɾϩετͨ͠σʔλӬଓԽͨ͠ݩσʔλ͔Β෮ݩ
Sparkʹ͍ͭͯ(2)• RDDʹର͢Δࢄॲཧج൫ͷ্ʹҎԼΛ࣮• σʔλετϦʔϜॲཧ(Spark Streaming)• ࢄSQL(SparkSQL)• ࢄػցֶशϥΠϒϥϦ(Mllib)• ࢄάϥϑॲཧϥΠϒϥϦ(GraphX)
ࢄॲཧ(ͱSpark)ͷ
େنσʔλࢄॲཧͷ؊• ΫϥελϚωʔδϝϯτ• σʔλͷࢄஔͷࣗಈԽ• σʔλଟॏԽ/ฒྻReadʹΑΔߴԽ• σʔλϩʔΧϦςΟΛอͬͨܭࢉ• োੑ / ࠶ૹɾ࠶ܭࢉॲཧ
HadoopʹࢸΔ·Ͱ• ෳࡶͳฒྻॲཧϝοηʔδύογϯάͰಠࣗʹ࣮͢Δͱେม• εέϧτϯฒྻϓϩάϥϛϯά(Cole, 1989)• සग़͢ΔฒྻܭࢉύλʔϯͷΈ߹ΘͤͰɺ༷ʑͳฒྻॲཧΛߏతʹߏங͢ΔؔϓϩάϥϛϯάͷΈͱෳͷ࣮• σʔλฒྻεέϧτϯ(map, fold/reduce, filter, zip…)• σʔλͷҟͳΔ෦ʹɼಉ࣌ʹಉ͡ૢ࡞Λߦ͏ܭࢉύλʔϯ• λεΫฒྻεέϧτϯ(pipe, farm…)• σʔλͷετϦʔϜʹରͯ͠ɼͦΕͧΕܭࢉΛద༻ͨ͠σʔλετϦʔϜΛฦ͢ύλʔϯ
εέϧτϯฒྻϓϩάϥϛϯάމৼߐ ؠ࡚ӳ࠸ εέϧτϯฒྻϓϩάϥϛϯάใॲཧ 7PM /P QQ
HadoopҎલͷࢄॲཧ• MPI άϦουγΣϧΛ༻͍࣮ͯ• σʔλͷஔࣗͰϚωʔδ• ڞ༗ϝϞϦ͔ڞ༗FSʹࣗͰஔ͕લఏ• ڊେσʔλͷஔͱͯ໘• োੑಠ࣮ࣗͰอূ• ϝϞϦʹࡌΓΒͳ͍σʔλΛѻ͏ͷ͍͠
T-shirts message@WOMPAT2001“Life is too short for MPI.”
Hadoop͕ղܾͨ͠ͷ• Պֶܭࢉ͚Ͱͳ͘େنσʔλʹಛԽ• ڊେσʔλͷஔͱॲཧͷ࣮ߦΛࣗಈཧ• HDFSͰͷࣗಈࢄஔͱɺஔॴͰMAPॲཧ
HadoopҎ߱ͷ৽ͨͳχʔζ• Hadoop / Hiveεϧʔϓοτॏࢹͷόονܥ• σʔλαΠΤϯςΟετͷχʔζΠϯλϥΫςΟϒͳੳɾϦΞϧλΠϜॲཧ• ॲཧֻ͚ͯ࣌ؒͪݫ͍͠• Hadoop, Hiveߴ৴པੑͷ֬อͱҾ͖͑ʹதؒσʔλͷDisk I/O͕ϘτϧωοΫʹ• αʔόͨΓͷϝϞϦ༰ྔ૿େ
HadoopޙͷϓϩμΫτ• HiveͷΦϯϝϞϦߴԽ• ϦΞϧλΠϜͷετϦʔϜσʔλॲཧ• ෳͷσʔλιʔε / DBʹ·͕ͨͬͯͷߴूܭ• λεΫ࣮ߦΛ࠷దԽ͠ϨΠςϯγΛ࣮ݱ
Spark• ൚༻ͷࢄϓϩάϥϛϯάڥ• RDDΛجૅʹ͓͍ͨεέϧτϯฒྻϓϩάϥϛϯάڥ• ΦϯϝϞϦͷRDDΛ༻͍Δ͜ͱͰɺϨΠςϯγʔͷࢄܭࢉΛ࣮ݱ• ϝϞϦʹΒͳ͍ͷDiskʹอଘ• RDDʹର͢Δૢ࡞ΛΈ߹ΘͤΔ͜ͱͰɺػցֶशετϦʔϜσʔλॲཧΛ࣮ݱ
RDDʹର͢Δجຊԋࢉ• ScalaͷSeqॲཧͷߴ֊ؔ+α͕ࢄ࣮ߦ• map, flatMap, filter, sort, union, zip• reduce, fold, reduceByKey, groupBy, groupByKey, countcogroup, cross• join, leftOuterJoin, rightOuterJoin• sample, take, first, partitionBy, mapWith, pipe, save• etc….
RDDͷσʔλϩʔΧϦςΟ• λεΫͷ࣮ߦॴɾॱংσʔλɾιʔεͷஔॴΛݩʹ࠷దͳDAGදݱͰཧ)%'4 3%% 3%%NBQNBQNBQNBQ3%%3FEVDF
RDDͷোੑ• RDDͷ֤ཁૉ͕ࣗͲͷΑ͏ͳܦ࿏Ͱੜ͞Ε͔ͨه)%'4 NBQ NBQ☓ഁଛ)%'4 NBQ NBQNBQ࠶ඞཁʹͳͬͨ࣌ɺσʔλɾιʔε͔Β࠶ੜ
PySpark + IPython Notebook• PySparkIPython্Ͱ࣮ߦՄೳ• AWSͳΒɺίϚϯυϥΠϯ1ൃͰΫϥελߏஙՄೳ• Spark on EMR(YARNରԠ)Λಈ͔͢• http://qiita.com/shunsukeaihara/items/1524b66579e91d1cf7cf
• ఆظόονܥfluentd -> RedshiftͰॲཧ• ΞυϗοΫͳϩάੳFluentd -> S3 -> Spark• S3্ͷେྔͷϑΝΠϧΛखܰʹॲཧՄೳGunosyͷSparkϢʔεέʔε"1*αʔό4QBSLPO"84&.33FETIJGU$MVTUFS
GunosyͷSparkϢʔεέʔε(1)• CloudTrailsͷϩά͔ΒΘΕ͍ͯΔCredentialΛ୳ͯ͠௵͢ͱ͔…• େྔͷJSONϑΝΠϧΛಡΈࠐΜͰHiveQLΛ࣮ߦEBUBTDUFYU'JMF TCVDLFU@OBNFQBUIH[IJWFQZTQBSLTRM)JWF$POUFYU TDIUIJWFKTPO3%% EBUBIUSFHJTUFS5FNQ5BCMF USBJMMTIUDBDIF5BCMF USBJMMTIJWFTRM 4&-&$5%*45*/$5SFDPSEVTFS*EFOUJUZBDDFTT,FZ*E'30.USBJMMT-"5&3"-7*&8FYQMPEF 3FDPSETTBTSFDPSE
GunosyͷSparkϢʔεέʔε(2)• Ϣʔβͷهࣄϩά͔Βͷੑผྨ• Ϣʔβຖʹclickͨ͠هࣄͷidΛListΛcsvͰS3ʹอଘ• TF-IDFͰॏΈ͚ͭTD4QBSL$POUFYU NBMFTDUFYU'JMF lTCVDLFUQBUINBMF@H[lGFNBMFTDUFYU'JMF lTCVDLFUQBUINBMF@H[lUG)BTIJOH5' OVN'FBUVSFTNBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l zGFNBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l zJEG*%' JEG@NPEFMJEGpU NBMFVOJPO GFNBMFNBMFJEG@NPEFMUSBOTGPSN NBMFGFNBMFJEG@NPEFMUSBOTGPSN GFNBMF
GunosyͷSparkϢʔεέʔε(2)• Ϣʔβͷهࣄϩά͔Βͷੑผྨ• LabeledPointʹม͠ϩδεςΟοΫճؼͰֶश/ྨNBMFNBMFNBQ MBNCEBY-BCFMFE1PJOU YGFNBMFGFNBMFNBQ MBNCEBY-BCFMFE1PJOU YUSBJOJOHNBMFVOJPO GFNBMFUSBJOJOHDBDIF NPEFM-PHJTUJD3FHSFTTJPO8JUI4(%USBJO USBJOJOH
GunosyͷSparkϢʔεέʔε(2)• Ϣʔβͷهࣄϩά͔Βͷੑผྨ• ઌ಄͕ϢʔβID, ͦΕҎ͕߱هࣄIDͷϦετ͔ΒਪఆEFGQBSTF YEBUBGPSJJOYTQMJU l z>SFUVSO-BCFMFE1PJOU EBUB<> EBUB<>VOLOPXOTDUFYU'JMF lTCVDLFUQBUIVOLOPXO@H[lVOLOPXOVOLOPXONBQ MBNCEBYUGUSBOTGPSN YTQMJU l zVOLOPXOVOLOPXONBQ MBNCEBY Y<> JEG@NPEFMUPSBOTGPSN UGUSBOTGPSNY<>VOLOPXONBQ MBNCEBY Y<> NPEFMQSFEJDU Y<>DPMMFDU
Pyspark͓ख͚ܰͩͲ…• PythonͷؔΛPickleͯ͠ࢄ࣮ߦ͢ΔͷͰ͍Ζ͍Ζ͍• JavaͷϥΠϒϥϦ(kuromoji)Λར༻͍ͨ͠߹Scalaͷϥούʔ + py4jͷϥούʔ͕ඞཁ• Scala͔ΒͳΒͦͷ··͑Δ• ؤுͬͯΈ͚ͨͲ࠳ંɻpy4jͱʹ͔ͭ͘Β͍• Spark༻్ఔͳΒScalaͷֶशίετ͍• ͱ͍͑sbt໘͚ͩͲ…
Pythonͷࢄॲཧڥ
PythonͷࢄॲཧϥΠϒϥϦ• Ϋϥελܭࢉ༻• PyRC, dispy, Pyro4(GensimͷLSI, LDAͷࢄԽόοΫΤϯυʹར༻)• ࢄλεΫΩϡʔ• Celery : σίϨʔλΛ͚ͭΔ͚ͩͰؔ୯ҐͰඇಉظࢄԽ• IPython Cluster: ؆୯ͳλεΫࢄ༻• Spartan: Numpy arrayͷZeroMQʹΑΔࢄԽ(SparkͷRDDΠϯεύΠΞ)• Disco: PythonMapReduceϑϨʔϜϫʔΫ
GunosyͷPythonࢄॲཧڥ• ػցֶशͷαʔϏε࿈ܞλεΫฒྻ(ฒྻετϦʔϜॲཧ)͕ॏཁͰφΠʔϒͳࢄॲཧͰ͍͍ͨͯͳ͍(ex. Jubatus)• aws্ͩͱجຊσʔλશͯS3ʹूੵ• λεΫཧͱϦτϥΠCelery(AMQP)ʹͤΔ• ϫʔΧʔͷσϓϩΠChef + OpsworksͰશࣗಈԽ• ΦϯϥΠϯֶशͷࢄԽparameter iterative mixing• EMΞϧΰϦζϜͷࢄԽσʔλΛਫฏࢄͯ͠ಠཱʹܭࢉͨ͠ύϥϝʔλͷฏۉΛऔΔ
• هࣄऩूϢʔβຖͷਪનΛϫʔΧʔʹόϥϚΩGunosyͷPythonࢄॲཧڥهࣄΫϩʔϥʔDFMFSZXPSLFSਪનΤϯδϯDFMFSZXPSLFSهࣄΫϦοΫϩάίϯτϩʔϥEKBOHPDFMFSZ
·ͱΊ• Sparkͷ؊RDDͱ͍͏σʔλߏͱεέϧτϯฒྻϕʔεͷ൚༻తͳฒྻϓϩάϥϛϯάڥ• Python͔Βͷखܰʹࢄॲཧͱࢄػցֶश͕͑ͯศར• ͰPython͔Βෳࡶͳ͜ͱΛ͠Α͏ͱ͢ΔͱຊʹΩπΠͷͰScalaͰॻ͖·͠ΐ͏• Ͳ͏ͯ͠Python͕ྑ͍ͳΒଞͷPythonͷࢄॲཧΤίγεςϜΛݕ౼͠·͠ΐ͏