Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sparkによる分散処理 / 2015-01-16 PyData.Tokyo#3

Sparkによる分散処理 / 2015-01-16 PyData.Tokyo#3

shunsukeaihara

January 17, 2015
Tweet

More Decks by shunsukeaihara

Other Decks in Technology

Transcript

  1. ࣗݾ঺հ • ҄൧ݪढ़հ (http://argmax.jp) @shunsukeaihara • GunosyͷϚωʔδϟʔ • ޿ࠂ഑৴γεςϜͷ։ൃશମͱR&DܥΛ୲౰ •

    ઐ໳: ܭࢉݴޠֶ • Pythonͱඇಉظ෼ࢄγεςϜΛ޷Ή • ը૾ॲཧɾԻ੠৴߸ॲཧ౳Ͱ͍Ζ͍ΖϥΠϒϥϦ࡞ͬͯΔ • https://bitbucket.org/aihara
  2. Sparkʹ͍ͭͯ(1) • HadoopͷΤίγεςϜ(HDFS, MESOS, YARN౳)ͱ࿈ܞ͢ΔΦϯϝϞ Ϧ෼ࢄॲཧܥ • Resillient Distributed Datasetsͱ͍͏ো֐଱ੑΛ࣋ͬͨ෼ࢄσʔλߏ଄

    ʹର͢Δ෼ࢄϓϩάϥϛϯά؀ڥ • RDDʹద༻͢ΔฒྻܭࢉΛɺߴ֊ؔ਺ͷνΣΠϯͷܗͰScalaɺ PythonͰ࣮ߦ • immutableͳσʔλߏ଄ • RDD಺ͷཁૉ͸ΫϥελͷΦϯϝϞϦʹ෼ࢄɾϨϓϦέʔγϣϯ • ഁଛɾϩετͨ͠σʔλ͸ӬଓԽͨ͠ݩσʔλ͔Β෮ݩ
  3. HadoopʹࢸΔ·Ͱ • ෳࡶͳฒྻॲཧ͸ϝοηʔδύογϯάͰಠࣗʹ࣮૷͢Δͱେม • εέϧτϯฒྻϓϩάϥϛϯά(Cole, 1989) • සग़͢Δฒྻܭࢉύλʔϯͷ૊Έ߹ΘͤͰɺ༷ʑͳฒྻॲཧΛߏ੒తʹߏங ͢Δؔ਺ϓϩάϥϛϯάͷ࿮૊Έͱෳ਺ͷ࣮૷ •

    σʔλฒྻεέϧτϯ(map, fold/reduce, filter, zip…) • σʔλͷҟͳΔ෦෼ʹɼಉ࣌ʹಉ͡ૢ࡞Λߦ͏ܭࢉύλʔϯ • λεΫฒྻεέϧτϯ(pipe, farm…) • σʔλͷετϦʔϜʹରͯ͠ɼͦΕͧΕܭࢉΛద༻ͨ͠σʔλετϦʔ ϜΛฦ͢ύλʔϯ
  4. RDDʹର͢Δجຊԋࢉ • ScalaͷSeqॲཧͷߴ֊ؔ਺+α͕෼ࢄ࣮ߦ • map, flatMap, filter, sort, union, zip

    • reduce, fold, reduceByKey, groupBy, groupByKey, count cogroup, cross • join, leftOuterJoin, rightOuterJoin • sample, take, first, partitionBy, mapWith, pipe, save • etc….
  5. PySpark + IPython Notebook • PySpark͸IPython্Ͱ࣮ߦՄೳ • AWSͳΒɺίϚϯυϥΠϯ1ൃͰΫϥελߏஙՄೳ • Spark

    on EMR(YARNରԠ)Λಈ͔͢ • http://qiita.com/shunsukeaihara/items/1524b66579e91d1cf7cf
  6. • ఆظόονܥ͸fluentd -> RedshiftͰॲཧ • ΞυϗοΫͳϩά෼ੳ͸Fluentd -> S3 -> Spark

    • S3্ͷେྔͷϑΝΠϧΛखܰʹॲཧՄೳ GunosyͷSparkϢʔεέʔε "1*αʔό 4QBSLPO"84&.3 3FETIJGU$MVTUFS
  7. GunosyͷSparkϢʔεέʔε(1) • CloudTrailsͷϩά͔Β࢖ΘΕ͍ͯΔCredentialΛ୳ͯ͠ ௵͢ͱ͔… • େྔͷJSONϑΝΠϧΛಡΈࠐΜͰHiveQLΛ࣮ߦ EBUBTDUFYU'JMF TCVDLFU@OBNFQBUI  H[

     IJWFQZTQBSLTRM)JWF$POUFYU TD  IUIJWFKTPO3%% EBUB  IUSFHJTUFS5FNQ5BCMF USBJMMT  IUDBDIF5BCMF USBJMMT  IJWFTRM 4&-&$5%*45*/$5SFDPSEVTFS*EFOUJUZBDDFTT,FZ*E '30.USBJMMT-"5&3"-7*&8FYQMPEF 3FDPSET TBTSFDPSE
  8. GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผ෼ྨ • Ϣʔβຖʹclickͨ͠هࣄͷidΛListΛcsvͰS3ʹอଘ • TF-IDFͰॏΈ͚ͭ TD4QBSL$POUFYU  NBMFTDUFYU'JMF

    lTCVDLFUQBUINBMF@ H[l  GFNBMFTDUFYU'JMF lTCVDLFUQBUINBMF@ H[l  UG)BTIJOH5' OVN'FBUVSFT  NBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l z  GFNBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l z  JEG*%'  JEG@NPEFMJEGpU NBMFVOJPO GFNBMF  NBMFJEG@NPEFMUSBOTGPSN NBMF  GFNBMFJEG@NPEFMUSBOTGPSN GFNBMF
  9. GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผ෼ྨ • LabeledPointʹม׵͠ϩδεςΟοΫճؼͰֶश/෼ ྨ NBMFNBMFNBQ MBNCEBY-BCFMFE1PJOU  Y

     GFNBMFGFNBMFNBQ MBNCEBY-BCFMFE1PJOU  Y  USBJOJOHNBMFVOJPO GFNBMF  USBJOJOHDBDIF  NPEFM-PHJTUJD3FHSFTTJPO8JUI4(%USBJO USBJOJOH 
  10. GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผ෼ྨ • ઌ಄͕ϢʔβID, ͦΕҎ͕߱هࣄIDͷϦετ͔Βਪఆ EFGQBSTF Y  EBUB<JOU

    J GPSJJOYTQMJU l z > SFUVSO-BCFMFE1PJOU EBUB<> EBUB<>  VOLOPXOTDUFYU'JMF lTCVDLFUQBUIVOLOPXO@ H[l  VOLOPXOVOLOPXONBQ MBNCEBYUGUSBOTGPSN YTQMJU l z  VOLOPXOVOLOPXONBQ MBNCEBY Y<> JEG@NPEFMUPSBOTGPSN UGUSBOTGPSNY<>  VOLOPXONBQ MBNCEBY Y<> NPEFMQSFEJDU Y<> DPMMFDU 
  11. Pythonͷ෼ࢄॲཧϥΠϒϥϦ • Ϋϥελܭࢉ༻ • PyRC, dispy, Pyro4(GensimͷLSI, LDAͷ෼ࢄԽόοΫΤϯυʹར༻) • ෼ࢄλεΫΩϡʔ

    • Celery : σίϨʔλΛ͚ͭΔ͚ͩͰؔ਺୯ҐͰඇಉظ෼ࢄԽ • IPython Cluster: ؆୯ͳλεΫ෼ࢄ༻ • Spartan: Numpy arrayͷZeroMQʹΑΔ෼ࢄԽ(SparkͷRDDΠϯεύΠΞ) • Disco: Python੡MapReduceϑϨʔϜϫʔΫ
  12. GunosyͷPython෼ࢄॲཧ؀ڥ • ػցֶशͷαʔϏε࿈ܞ͸λεΫฒྻ(ฒྻετϦʔϜॲཧ)͕ॏ ཁͰφΠʔϒͳ෼ࢄॲཧͰ͍͍ͨͯ໰୊ͳ͍(ex. Jubatus) • aws্ͩͱجຊσʔλ͸શͯS3ʹूੵ • λεΫ؅ཧͱϦτϥΠ͸Celery(AMQP)ʹ೚ͤΔ •

    ϫʔΧʔͷσϓϩΠ͸Chef + OpsworksͰ׬શࣗಈԽ • ΦϯϥΠϯֶशͷ෼ࢄԽ͸parameter iterative mixing • EMΞϧΰϦζϜͷ෼ࢄԽ͸σʔλΛਫฏ෼ࢄͯ͠ಠཱʹܭࢉͨ͠ ύϥϝʔλͷฏۉΛऔΔ