Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Sparkによる分散処理 / 2015-01-16 PyData.Tokyo#3
Search
shunsukeaihara
January 17, 2015
Technology
11
3.2k
Sparkによる分散処理 / 2015-01-16 PyData.Tokyo#3
shunsukeaihara
January 17, 2015
Tweet
Share
More Decks by shunsukeaihara
See All by shunsukeaihara
BONXを支える技術:発話区間検出(VAD)の話/Akerun & BONX Tech Talk
shunsukeaihara
4
7.3k
Goのnet.TCPConnの話/shibuya.go01
shunsukeaihara
2
680
Norikra in Gunosy Network Ads@Norikra meetup #2
shunsukeaihara
1
5.7k
LevelDB on S3 As A KVS
shunsukeaihara
1
2.6k
色恒常性仮説に基づく色補正ライブラリcolorcorrect / 2015-01-31-kantocv27
shunsukeaihara
3
2.1k
ゼロから始めた Gunosyアドサーバ開発運用記 / 2014-12-16-dots
shunsukeaihara
6
1.1k
Gunosy.Go#5 index/io/log
shunsukeaihara
0
130
Gunosy.go#2 package/compress
shunsukeaihara
0
90
Other Decks in Technology
See All in Technology
私のRSpecの書き方 / How I write RSpec
tmtms
4
820
期待しすぎずに取り組む両面 TypeScript
shozawa
2
270
マイクロサービス環境におけるDB戦略 in DMMプラットフォーム
pospome
11
3k
Evolutionary Optimization of Model Merging Recipes
fuyu_quant0
3
510
既存プロセスからの脱却と変化に適応するために必要なこと
cybozuinsideout
PRO
2
170
2023 Japan AWS Jr.Championsに選出されての振り返りとこれから
hiropy877
1
130
GitHub最新情報キャッチアップ 2024年3月
dzeyelid
16
3.1k
Cloud Friendly(?) Jenkins. How we failed to make Jenkins cloud native and what we learned?
onenashev
PRO
0
110
Autify Company Deck
autifyhq
1
30k
SREsのためのSRE定着ガイド
netmarkjp
10
1.4k
中央集権体制からDataOpsへの転換 / centralized-to-dataops-transformation
pei0804
7
1.3k
OCI Data Integration技術情報 / ocidi_technical_jp
oracle4engineer
PRO
1
1.5k
Featured
See All Featured
Ruby is Unlike a Banana
tanoku
95
10k
Docker and Python
trallard
33
2.6k
Teambox: Starting and Learning
jrom
126
8.4k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
8
8.2k
Clear Off the Table
cherdarchuk
82
310k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
12
1.4k
Testing 201, or: Great Expectations
jmmastey
27
6.3k
Happy Clients
brianwarren
91
6.3k
Product Roadmaps are Hard
iamctodd
43
9.6k
Code Reviewing Like a Champion
maltzj
512
39k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
67
38k
Music & Morning Musume
bryan
39
5.5k
Transcript
SparkʹΑΔࢄॲཧ (ͱPythonͰͷࢄॲཧ) Gunosy Inc. Shunsuke Aihara
ࣗݾհ • ҄൧ݪढ़հ (http://argmax.jp) @shunsukeaihara • GunosyͷϚωʔδϟʔ • ࠂ৴γεςϜͷ։ൃશମͱR&DܥΛ୲ •
ઐ: ܭࢉݴޠֶ • PythonͱඇಉظࢄγεςϜΛΉ • ը૾ॲཧɾԻ৴߸ॲཧͰ͍Ζ͍ΖϥΠϒϥϦ࡞ͬͯΔ • https://bitbucket.org/aihara
Agenda • Spark֓ཁ • ࢄॲཧ(ͱSpark)ͷ • GunosyͰͷSparkͷϢʔεέʔε • PythonͰͷࢄॲཧΤίγεςϜ
Sparkʹ͍ͭͯ(1) • HadoopͷΤίγεςϜ(HDFS, MESOS, YARN)ͱ࿈ܞ͢ΔΦϯϝϞ Ϧࢄॲཧܥ • Resillient Distributed Datasetsͱ͍͏োੑΛ࣋ͬͨࢄσʔλߏ
ʹର͢Δࢄϓϩάϥϛϯάڥ • RDDʹద༻͢ΔฒྻܭࢉΛɺߴ֊ؔͷνΣΠϯͷܗͰScalaɺ PythonͰ࣮ߦ • immutableͳσʔλߏ • RDDͷཁૉΫϥελͷΦϯϝϞϦʹࢄɾϨϓϦέʔγϣϯ • ഁଛɾϩετͨ͠σʔλӬଓԽͨ͠ݩσʔλ͔Β෮ݩ
Sparkʹ͍ͭͯ(2) • RDDʹର͢Δࢄॲཧج൫ͷ্ʹҎԼΛ࣮ • σʔλετϦʔϜॲཧ(Spark Streaming) • ࢄSQL(SparkSQL) • ࢄػցֶशϥΠϒϥϦ(Mllib)
• ࢄάϥϑॲཧϥΠϒϥϦ(GraphX)
ࢄॲཧ(ͱSpark)ͷ
େنσʔλࢄॲཧͷ؊ • ΫϥελϚωʔδϝϯτ • σʔλͷࢄஔͷࣗಈԽ • σʔλଟॏԽ/ฒྻReadʹΑΔߴԽ • σʔλϩʔΧϦςΟΛอͬͨܭࢉ •
োੑ / ࠶ૹɾ࠶ܭࢉॲཧ
HadoopʹࢸΔ·Ͱ • ෳࡶͳฒྻॲཧϝοηʔδύογϯάͰಠࣗʹ࣮͢Δͱେม • εέϧτϯฒྻϓϩάϥϛϯά(Cole, 1989) • සग़͢ΔฒྻܭࢉύλʔϯͷΈ߹ΘͤͰɺ༷ʑͳฒྻॲཧΛߏతʹߏங ͢ΔؔϓϩάϥϛϯάͷΈͱෳͷ࣮ •
σʔλฒྻεέϧτϯ(map, fold/reduce, filter, zip…) • σʔλͷҟͳΔ෦ʹɼಉ࣌ʹಉ͡ૢ࡞Λߦ͏ܭࢉύλʔϯ • λεΫฒྻεέϧτϯ(pipe, farm…) • σʔλͷετϦʔϜʹରͯ͠ɼͦΕͧΕܭࢉΛద༻ͨ͠σʔλετϦʔ ϜΛฦ͢ύλʔϯ
εέϧτϯฒྻϓϩάϥϛϯά މৼߐ ؠ࡚ӳ࠸ εέϧτϯฒྻϓϩάϥϛϯάใॲཧ 7PM /P QQ
HadoopҎલͷࢄॲཧ • MPI άϦουγΣϧΛ༻͍࣮ͯ • σʔλͷஔࣗͰϚωʔδ • ڞ༗ϝϞϦ͔ڞ༗FSʹࣗͰஔ͕લఏ • ڊେσʔλͷஔͱͯ໘
• োੑಠ࣮ࣗͰอূ • ϝϞϦʹࡌΓΒͳ͍σʔλΛѻ͏ͷ͍͠
T-shirts message@WOMPAT2001 “Life is too short for MPI.”
Hadoop͕ղܾͨ͠ͷ • Պֶܭࢉ͚Ͱͳ͘େنσʔλʹಛԽ • ڊେσʔλͷஔͱॲཧͷ࣮ߦΛࣗಈཧ • HDFSͰͷࣗಈࢄஔͱɺஔॴͰMAPॲཧ
HadoopҎ߱ͷ৽ͨͳχʔζ • Hadoop / Hiveεϧʔϓοτॏࢹͷόονܥ • σʔλαΠΤϯςΟετͷχʔζΠϯλϥΫςΟϒͳ ੳɾϦΞϧλΠϜॲཧ • ॲཧֻ͚ͯ࣌ؒͪݫ͍͠
• Hadoop, Hiveߴ৴པੑͷ֬อͱҾ͖͑ʹதؒσʔλ ͷDisk I/O͕ϘτϧωοΫʹ • αʔόͨΓͷϝϞϦ༰ྔ૿େ
HadoopޙͷϓϩμΫτ • HiveͷΦϯϝϞϦߴԽ • ϦΞϧλΠϜͷετϦʔ Ϝσʔλॲཧ • ෳͷσʔλιʔε / DB
ʹ·͕ͨͬͯͷߴूܭ • λεΫ࣮ߦΛ࠷దԽ͠ϨΠςϯγΛ࣮ݱ
Spark • ൚༻ͷࢄϓϩάϥϛϯάڥ • RDDΛجૅʹ͓͍ͨεέϧτϯฒྻϓϩάϥϛϯάڥ • ΦϯϝϞϦͷRDDΛ༻͍Δ͜ͱͰɺϨΠςϯγʔͷ ࢄܭࢉΛ࣮ݱ • ϝϞϦʹΒͳ͍ͷDiskʹอଘ
• RDDʹର͢Δૢ࡞ΛΈ߹ΘͤΔ͜ͱͰɺػցֶशε τϦʔϜσʔλॲཧΛ࣮ݱ
RDDʹର͢Δجຊԋࢉ • ScalaͷSeqॲཧͷߴ֊ؔ+α͕ࢄ࣮ߦ • map, flatMap, filter, sort, union, zip
• reduce, fold, reduceByKey, groupBy, groupByKey, count cogroup, cross • join, leftOuterJoin, rightOuterJoin • sample, take, first, partitionBy, mapWith, pipe, save • etc….
RDDͷσʔλϩʔΧϦςΟ • λεΫͷ࣮ߦॴɾॱংσʔλɾιʔεͷ ஔॴΛݩʹ࠷దͳDAGදݱͰཧ )%'4 3%% 3%% NBQ NBQ NBQ
NBQ 3%% 3FEVDF
RDDͷোੑ • RDDͷ֤ཁૉ͕ࣗͲͷΑ͏ͳܦ࿏Ͱੜ ͞Ε͔ͨه )%'4 NBQ NBQ ☓ഁଛ )%'4 NBQ
NBQ NBQ ࠶ඞཁʹͳͬͨ࣌ɺσʔλɾιʔε͔Β࠶ੜ
Sparkʹ͍ͭͯ(2) • RDDʹର͢Δࢄॲཧج൫ͷ্ʹҎԼΛ࣮ • σʔλετϦʔϜॲཧ(Spark Streaming) • ࢄSQL(SparkSQL) • ࢄػցֶशϥΠϒϥϦ(Mllib)
• ࢄάϥϑॲཧϥΠϒϥϦ(GraphX)
PySpark + IPython Notebook • PySparkIPython্Ͱ࣮ߦՄೳ • AWSͳΒɺίϚϯυϥΠϯ1ൃͰΫϥελߏஙՄೳ • Spark
on EMR(YARNରԠ)Λಈ͔͢ • http://qiita.com/shunsukeaihara/items/1524b66579e91d1cf7cf
• ఆظόονܥfluentd -> RedshiftͰॲཧ • ΞυϗοΫͳϩάੳFluentd -> S3 -> Spark
• S3্ͷେྔͷϑΝΠϧΛखܰʹॲཧՄೳ GunosyͷSparkϢʔεέʔε "1*αʔό 4QBSLPO"84&.3 3FETIJGU$MVTUFS
GunosyͷSparkϢʔεέʔε(1) • CloudTrailsͷϩά͔ΒΘΕ͍ͯΔCredentialΛ୳ͯ͠ ௵͢ͱ͔… • େྔͷJSONϑΝΠϧΛಡΈࠐΜͰHiveQLΛ࣮ߦ EBUBTDUFYU'JMF TCVDLFU@OBNFQBUI H[
IJWFQZTQBSLTRM)JWF$POUFYU TD IUIJWFKTPO3%% EBUB IUSFHJTUFS5FNQ5BCMF USBJMMT IUDBDIF5BCMF USBJMMT IJWFTRM 4&-&$5%*45*/$5SFDPSEVTFS*EFOUJUZBDDFTT,FZ*E '30.USBJMMT-"5&3"-7*&8FYQMPEF 3FDPSET TBTSFDPSE
GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผྨ • Ϣʔβຖʹclickͨ͠هࣄͷidΛListΛcsvͰS3ʹอଘ • TF-IDFͰॏΈ͚ͭ TD4QBSL$POUFYU NBMFTDUFYU'JMF
lTCVDLFUQBUINBMF@ H[l GFNBMFTDUFYU'JMF lTCVDLFUQBUINBMF@ H[l UG)BTIJOH5' OVN'FBUVSFT NBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l z GFNBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l z JEG*%' JEG@NPEFMJEGpU NBMFVOJPO GFNBMF NBMFJEG@NPEFMUSBOTGPSN NBMF GFNBMFJEG@NPEFMUSBOTGPSN GFNBMF
GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผྨ • LabeledPointʹม͠ϩδεςΟοΫճؼͰֶश/ ྨ NBMFNBMFNBQ MBNCEBY-BCFMFE1PJOU Y
GFNBMFGFNBMFNBQ MBNCEBY-BCFMFE1PJOU Y USBJOJOHNBMFVOJPO GFNBMF USBJOJOHDBDIF NPEFM-PHJTUJD3FHSFTTJPO8JUI4(%USBJO USBJOJOH
GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผྨ • ઌ಄͕ϢʔβID, ͦΕҎ͕߱هࣄIDͷϦετ͔Βਪఆ EFGQBSTF Y EBUB<JOU
J GPSJJOYTQMJU l z > SFUVSO-BCFMFE1PJOU EBUB<> EBUB<> VOLOPXOTDUFYU'JMF lTCVDLFUQBUIVOLOPXO@ H[l VOLOPXOVOLOPXONBQ MBNCEBYUGUSBOTGPSN YTQMJU l z VOLOPXOVOLOPXONBQ MBNCEBY Y<> JEG@NPEFMUPSBOTGPSN UGUSBOTGPSNY<> VOLOPXONBQ MBNCEBY Y<> NPEFMQSFEJDU Y<> DPMMFDU
Pyspark͓ख͚ܰͩͲ… • PythonͷؔΛPickleͯ͠ࢄ࣮ߦ͢ΔͷͰ͍Ζ͍Ζ͍ • JavaͷϥΠϒϥϦ(kuromoji)Λར༻͍ͨ͠߹Scala ͷϥούʔ + py4jͷϥούʔ͕ඞཁ • Scala͔ΒͳΒͦͷ··͑Δ
• ؤுͬͯΈ͚ͨͲ࠳ંɻpy4jͱʹ͔ͭ͘Β͍ • Spark༻్ఔͳΒScalaͷֶशίετ͍ • ͱ͍͑sbt໘͚ͩͲ…
Pythonͷࢄॲཧڥ
PythonͷࢄॲཧϥΠϒϥϦ • Ϋϥελܭࢉ༻ • PyRC, dispy, Pyro4(GensimͷLSI, LDAͷࢄԽόοΫΤϯυʹར༻) • ࢄλεΫΩϡʔ
• Celery : σίϨʔλΛ͚ͭΔ͚ͩͰؔ୯ҐͰඇಉظࢄԽ • IPython Cluster: ؆୯ͳλεΫࢄ༻ • Spartan: Numpy arrayͷZeroMQʹΑΔࢄԽ(SparkͷRDDΠϯεύΠΞ) • Disco: PythonMapReduceϑϨʔϜϫʔΫ
GunosyͷPythonࢄॲཧڥ • ػցֶशͷαʔϏε࿈ܞλεΫฒྻ(ฒྻετϦʔϜॲཧ)͕ॏ ཁͰφΠʔϒͳࢄॲཧͰ͍͍ͨͯͳ͍(ex. Jubatus) • aws্ͩͱجຊσʔλશͯS3ʹूੵ • λεΫཧͱϦτϥΠCelery(AMQP)ʹͤΔ •
ϫʔΧʔͷσϓϩΠChef + OpsworksͰશࣗಈԽ • ΦϯϥΠϯֶशͷࢄԽparameter iterative mixing • EMΞϧΰϦζϜͷࢄԽσʔλΛਫฏࢄͯ͠ಠཱʹܭࢉͨ͠ ύϥϝʔλͷฏۉΛऔΔ
• هࣄऩूϢʔβຖͷਪનΛϫʔΧʔʹόϥϚΩ GunosyͷPythonࢄॲཧڥ هࣄΫϩʔϥʔ DFMFSZXPSLFS ਪનΤϯδϯ DFMFSZXPSLFS هࣄΫϦοΫϩά ίϯτϩʔϥ EKBOHPDFMFSZ
·ͱΊ • Sparkͷ؊RDDͱ͍͏σʔλߏͱεέϧτϯฒྻϕʔ εͷ൚༻తͳฒྻϓϩάϥϛϯάڥ • Python͔Βͷखܰʹࢄॲཧͱࢄػցֶश͕͑ͯศར • ͰPython͔Βෳࡶͳ͜ͱΛ͠Α͏ͱ͢ΔͱຊʹΩπΠ ͷͰScalaͰॻ͖·͠ΐ͏ •
Ͳ͏ͯ͠Python͕ྑ͍ͳΒଞͷPythonͷࢄॲཧΤ ίγεςϜΛݕ౼͠·͠ΐ͏