Slide 1

Slide 1 text

ӝ҅೟णਸ ഝਊೠ ѱ੐ য࠭૚ Ѩ୹ ӣ੿઱ PyCon APAC 2016 PyCon APAC 2016 1

Slide 2

Slide 2 text

ߊ಴੗ ࣗѐ ӣ੿઱ ([email protected]) ੹: ѱ੐ ѐߊ - NHN / NPLUTO - 3D ূ૓ / ѱ੐ ௿ۄ੉঱౟ ѐߊ അ: ѱ੐ ؘ੉ఠ ࣻ૘ / ࠙ࢳ - Webzen NPlay - ۽Ӓ ನਕ؊, Pandas, Scikit-Learn, PySpark PyCon APAC 2016 2

Slide 3

Slide 3 text

੉ ߊ಴ח 4 ӝ҅೟णী ؀ೠ ӝࠄ ૑ध੉ ੓ח ٜ࠙ਸ ؀࢚ 4 ౵੉ॆਸ ഝਊೠ ؘ੉ఠ ࠙ࢳҗ ӝ҅೟ण ࢎ۹ܳ ҕਬ 4 ѐߊҗ ࢲ࠺झী ӝ҅೟णਸ بੑೞח ҅ӝо غ঻ਵݶ ೤פ׮ PyCon APAC 2016 3

Slide 4

Slide 4 text

द੘ زӝ 4 ѱ੐ য࠭૚ ઁ੤ܳ 4 ਬ੷ नҊ / GM ݽפఠ݂ / ಁఢ ଺ӝ۽ח ೠ҅ 4 ࢎۈ੄ ѐੑ੉ ୭ࣗചػ য࠭૚ ఐ૑ दझమਸ ٜ݅੗ PyCon APAC 2016 4

Slide 5

Slide 5 text

ѱ੐ য࠭૚੉ۆ? 4 “ӝദ੸ਵ۽ ੄بೞ૑ ঋ਷ ߑधਵ۽ ѱ੐ ੿ࠁܳ ؀۝ ദٙೞѢա ب ਑ਸ ઱ח ೯ਤ” ! 4 ࢎ۹ 4 ࢲ࠺झ ҳഅ࢚੄ ೹੼ਸ ੉ਊೠ ೒ۨ੉ 4 ೧ఊ ోਸ ࢎਊೠ ࠺੿࢚ ೒ۨ੉ 4 ੹୓ ଻౴ହী بߓ۽ ҟҊ PyCon APAC 2016 5

Slide 6

Slide 6 text

ా҅৬ ఐ࢝੸ ؘ੉ఠ ࠙ࢳ PyCon APAC 2016 6

Slide 7

Slide 7 text

਋ࢶ, ా҅ 4 ా҅ח ೂࠗೞ૑ ޅೠ ؘ੉ఠ৬ ஹೊ౴ ౵ਕ੄ ജ҃ীࢲ ߊ੹ 4 ా҅ ೟੗ٜ਷ ؘ੉ఠ/҅࢑ਸ ઴੉ח ߑߨਸ োҳ 4 ৌঈೠ ജ҃ীࢲ ٜ݅য઎ӝী, ੸਷ ؘ੉ఠীࢲب о஖ܳ ߊѼೡ ࣻ ੓਺ 4 ӝࠄ੸ੋ ా҅ ૑ध਷ ѐߊ, ӝദ, ࢲ࠺झ ١ী ௾ ب਑੉ ؽ PyCon APAC 2016 7

Slide 8

Slide 8 text

ఐ࢝੸ ؘ੉ఠ ࠙ࢳ 4 ؘ੉ఠী ऀয੓ח ੿ࠁܳ, 4 ׮নೠ пب۽ ਃড, दпച ೧ࠁݴ ଺ח җ੿ ! 4 ୊਺ ੽ೞח ؘ੉ఠח ੉ җ੿ࠗఠ 4 ੗୓ दझమ(WzDat) ѐߊ೧ ഝਊ " 4 Jupyter + Utility + Dashboard 4 https://github.com/haje01/wzdat 4 http://www.pycon.kr/2014/program/14 PyCon APAC 2016 8

Slide 9

Slide 9 text

ࢎ۹1 рױೠ ా҅੸ ই੉٣য۽ झಁݠ Ѩ୹ PyCon APAC 2016 9

Slide 10

Slide 10 text

࢚ട 4 नӏ য়೑ೠ ѱ੐੄ ଻౴ହ੉ ѱ੐ ই੉మ ҟҊӖ۽ оٙ ! 4 ೧׼ ҅੿ਸ ઁ੤೧ب ߄۽ ࢜ ҅੿ਵ۽ ҟҊ ҅ࣘ 4 ࡅܲ ઁ੤о ೙ਃೞৈ, ӝ҅೟णਸ ૓೯ೞӝীח दр੉ ࠗ઒ PyCon APAC 2016 10

Slide 11

Slide 11 text

଻౴ਸ ੉ਊೠ झಅ (Spam) 4 ѱ੐ ղীࢲ ੹৉ ଻౴ਵ۽ ݠפ/ই੉మ ౸ݒ ҟҊ 4 য࠭੷ח ೐۽Ӓ۔੸ ౸߹ਸ ݄ӝਤ೧ ݫ द૑ܳ դةച PyCon APAC 2016 11

Slide 12

Slide 12 text

झಁݠ Ѩ୹ 4 ׮নೠ ߑߨ੉ оמೞѷਵա, 4 ੗োয ୊ܻա ӝ҅೟णэ਷ Ҋә ੽Ӕࠁ׮, 4 рױೠ ా҅੸ ই੉٣য۽ दب PyCon APAC 2016 12

Slide 13

Slide 13 text

ৡۄੋ ଻౴ ݫद૑ ӡ ੉੄ ࠙ನ 4 ੌ߈੸ਵ۽ ۽Ӓ ੿ӏ࠙ನܳ ٮܲ׮Ҋ ঌ ۰ઉ੓׮. 4 ইېח NPS Chat Corpus੄ ݫद૑ ӡ੉ ࠙ನ PyCon APAC 2016 13

Slide 14

Slide 14 text

ѱ੐ ղ ଻౴ ݫद૑ ӡ੉ ࠙ನ 4 ৡۄੋ ଻౴җ ࠺तೞա ખ ؊ فԁ਍ ҃ ೱ 4 ౠ੿ ӡ੉ ݫद૑о ౗(?) → झಅਵ۽ ഛੋ PyCon APAC 2016 14

Slide 15

Slide 15 text

ই੉٣য 4 ੌ߈ ਬ੷: ݫद૑ ӡ੉о ׮নೞҊ, ࠼بо ݆૑ ঋ਺ 4 झಁݠ: ݫद૑ ӡ੉о ׮নೞ૑ ঋҊ, ࠼بח ֫਺ 4 ૊, যڃ ਬ੷੄ ଻౴ ࠼بо ֫Ҋ ӡ੉о ׮নೞ૑ ঋਵݶ झಁݠ PyCon APAC 2016 15

Slide 16

Slide 16 text

рױೠ Ѩ୹ ҕध 4 ਬ੷ ߹ ଻౴੄ പࣻ / ݫद૑ ӡ੉ ઙܨ ࣻ 4 ࠺तೠ ӡ੉੄ ଻౴ ݫद૑ܳ ੗઱ ࠁյ ࣻ۾ ч੉ ழ૗ PyCon APAC 2016 16

Slide 17

Slide 17 text

࠙ܨ 4 spam_ratioо ӝળ ч ੉࢚ੋ Ѫਸ झಁݠ۽ р઱ 4 ӝળ ч Ѿ੿਷ ോܻझ౮ೞѱ... 4 ؀୽ ࢸ੿ റ, ࠙ܨػ நܼఠ੄ ݫद૑ ഛੋਵ۽ ч ઑ੺ PyCon APAC 2016 17

Slide 18

Slide 18 text

࠙ܨ റ ݫद૑ ӡ੉ ࠙ನ 4 ࠼بо ֫਷ ౠ੿ ӡ੉੄ ݫद૑(= झಅ)о ܻ࠙غ঻਺ PyCon APAC 2016 18

Slide 19

Slide 19 text

Ѿҗ ੸ਊ 4 ҳഅ੉ рױ೮૑݅, য়ఐ੄ оמࢿ ੓਺ 4 ӝળ чਸ ֫ѱ ੟ই न܉بܳ ֫੐ 4 ੉ Ѿҗܳ о૑Ҋ ઁ੤ PyCon APAC 2016 19

Slide 20

Slide 20 text

ѐࢶ ߑೱ 4 ӝળ ч Ѿ੿ਸ ખ ؊ җ೟੸ੋ ߑߨਵ۽ 4 ੗োয ୊ܻ ӝࣿ(NLP) بੑ 4 ױয߹ ࠼ب(Ziff’s Law)৬ ઺ਃب(TF-IDF) Ҋ۰ 4 ӝ҅೟ण ঌҊ્ܻ ੸ਊ PyCon APAC 2016 20

Slide 21

Slide 21 text

ӝ҅೟ण ࣗѐ PyCon APAC 2016 21

Slide 22

Slide 22 text

ӝ҅೟णਸ ॳח ੉ਬ 4 ੸਷ ֢۱ਵ۽ ҡଳ਷ Ѿҗޛ 4 ׮নೠ ޙઁী ؀ೠ ੌ߈੸ੋ ࣛܖ࣌ 4 ׮ࣻ੄ ౠࢿ(ೖ୛)ਸ زदী Ҋ۰ೡ ࣻ ੓׮ 4 ؘ੉ఠ ߸زী ъೣ(ъѤࢿ) PyCon APAC 2016 22

Slide 23

Slide 23 text

࠙ܨ৬ ഥӈ 4 ӝ҅೟ण਷ ௼ѱ ࠙ܨ (Classification)৬ ഥӈ (Regression)۽ ա׍ 4 ࠙ܨ - ઙܨܳ ৘ஏ ೞח Ѫ 4 ഥӈ - োࣘػ чਸ ৘ஏ ೞח Ѫ 4 য࠭૚ Ѩ୹਷ ࠙ܨী ࣘೣ PyCon APAC 2016 23

Slide 24

Slide 24 text

૑ب ೟णҗ ੗ਯ ೟ण 4 ૑ب ೟ण(Supervised Learning) 4 ӝઓ ҃೷ী ੄೧ ࠙ܨػ ࢠ೒ ؘ੉ఠо ੓ਸ ٸ 4 ੗ਯ ೟ण(Unsupervised Learning) 4 ࠙ܨػ ࢠ೒ ؘ੉ఠо হਸ ٸ 4 ؀ࠗ࠙੄ ؘ੉ఠח ੸੺൤ ࠙ܨغয ੓૑ ঋ׮ → ಽযঠೡ ޙઁ PyCon APAC 2016 24

Slide 25

Slide 25 text

ӝ҅೟ण ঌҊ્ܻٜ 4 ӝࠄ 4 ܻפয/۽૑झ౮ ܻӒۨ࣌(Linear/Logistic Regression) 4 Ѿ੿ ౟ܻ(Decision Tree) 4 Ҋә 4 ےؒ ನۨझ౟(Random Forest) 4 SVM(Support Vector Machine) 4 ੋҕ न҃ݎ(Neural Network) PyCon APAC 2016 25

Slide 26

Slide 26 text

ঌҊ્ܻ੄ ࢶఖ਷? 4 ੌ߈੸ਵ۽ Ҋә ঌҊ્ܻ਷ ؊ ࠂ੟ೠ ݽ؛ ೟ण оמ 4 Ӓ۞ա, Ҋә ঌҊ્ܻ੉ ޖઑѤ જ਷ Ѫ਷ ইש 4 ೟ण੄ Ѿҗܳ ࢎۈ੉ ੉೧ೞӝীח ӝࠄ ঌҊ્ܻ੉ જ׮ PyCon APAC 2016 26

Slide 27

Slide 27 text

৘ஏী ؀ೠ ಣо 4 ੿ഛࢿী ؀ೠ ੿੄о ೙ਃ ! 4 Q: ਬ੷ 100ݺ ઺ 2ݺ ੓ח য࠭੷ܳ Ѩ୹ೞ۰ ೠ׮. पࣻ۽ ݽف ੿࢚ ਬ੷۽ ౸ױ೮ਸ ٸ ੿ഛبח? 4 A: 100ݺ ઺ 2ݺ੉ ౣ۷ਵפ… 98% !?#@ PyCon APAC 2016 27

Slide 28

Slide 28 text

ஏ੿ ױਤ 4 ੿޻ب(Precision) ੤അਯ(Recall)җ ١ ׮নೠ ױਤ 4 ੿޻ب: ଺਷ Ѫ ઺ ঴݃ա ૓૞ য࠭੷ੋо? 4 ੤അਯ: ੹୓ য࠭੷ ઺ ঴݃ա ଺ওחо? 4 ؘ੉ఠо ࠛӐഋ(Imbalance)ੌٸח ౠ൤ ੿޻ب৬ ੤അਯਸ ೣԋ Ҋ۰೧ঠ 4 খ੄ ҃਋ח ੤അਯ੉ 0 PyCon APAC 2016 28

Slide 29

Slide 29 text

P/R Curve ৬ AUC જ਷ ࠙ܨӝח? PyCon APAC 2016 29

Slide 30

Slide 30 text

ࢎ۹2 ӝ҅೟णਵ۽ ౵߁ Ѩ୹ PyCon APAC 2016 30

Slide 31

Slide 31 text

࢚ട 4 ۄ੉࠳ ѱ੐ীࢲ пઙ ೧ఊ ోਸ ࢎਊೠ ౵߁ ೒ۨ੉о ഝѐ ! 4 ౵߁: ѱ੐ ղ ੤ചܳ ࠺ ੿࢚੸ੋ ߑߨਵ۽ णٙ 4 ࠈ੄ ౠࢿਸ ೞա ل۽ ౠ੿ೞӝ য۰਑ → ӝ҅೟ण੉ ೙ਃ PyCon APAC 2016 31

Slide 32

Slide 32 text

೟ण ߑध ࢶఖ 4 Ҷ੉ ׏ۡ֔/٩۞׬ਵ۽ ೡ ೙ਃח হח ٠… 4 җѢ ۽Ӓо ੷੢غҊ ੓঻Ҋ, 4 ਍৔ஏীࢲ ӝઓ য࠭੷ நܼఠ ܻझ౟ܳ о૑Ҋ ੓঻਺ ! → ӝ҅೟ण, ౠ൤ ૑ب ೟ण੉ оמ! 4 Decision Tree ߑध੄ ૑ب ೟णਵ۽ Ѿ੿ PyCon APAC 2016 32

Slide 33

Slide 33 text

ળ࠺ җ੿ 1. ۽Ӓ ࣻ૘ ࢚క ഛੋ 2. ۽Ӓ੄ ҳઑ/੄޷ ౵ঈ 3. ೟णਸ ਤೠ ೖ୛(Feature) ୶୹ PyCon APAC 2016 33

Slide 34

Slide 34 text

ӝ҅೟णب ۽Ӓ ࣻ૘ࠗఠ 4 ۽Ӓܳ ୓҅੸ਵ۽ ݽਵח Ѫب औ૑ ঋ਺ 4 ࠙ࢳ/೟णী Ѧܻח दр਷ 10~20% ੿ب 4 ؘ੉ఠܳ ݽਵҊ оҕೞחؘ ؀ࠗ࠙੄ दр੉ Ѧܽ׮. 4 ۽Ӓ ഋध਷ оә੸ Ӓ؀۽ ࢎਊ (झౚ٣য়ܳ ਤ೧… !) 4 ۽Ӓܳ ੸੺൤ ࠙ܨ೧ ੷੢ (ࢲߡ/۽Ӓ ઙܨ, द੼ ߹۽) 4 ௿ۄ਋٘ झషܻ૑(S3) ୶ୌ ☁ PyCon APAC 2016 34

Slide 35

Slide 35 text

ਦب਋ ࢲߡীࢲ ۽Ӓ ࣻ૘ೞӝ 4 ѱ੐ ࢲߡח ؀ࠗ࠙ ਦب਋ ӝ߈ 4 য়೑ ࣗझ੄ જ਷ ోٜ(fluentd, logstash ١)ਸ ॳҊ र঻ਵա 4 ਦب਋ ࢲߡী ࢸ஖о औ૑ ঋҊ, ੌࠗ ӝמ੉ ࠗ઒ 4 ੗୓ ѐߊ ! 4 https://github.com/haje01/wdfwd 4 ࢲߡী թ਷ ۽Ӓ ౵ੌਸ RSync۽ زӝೞѢա 4 ѱ੐ DBী ੽ࣘೞৈ Dump റ ੹࣠ PyCon APAC 2016 35

Slide 36

Slide 36 text

۽Ӓо ࣻ૘ غ঻ਵݶ ೖ୛ܳ ٜ݅੗ 4 ೖ୛(Feature, ౠࢿ): ೟ण ؀࢚੄ ౠ૚ਸ ࢸݺ೧઱ח ч 4 ৘) ૘ чਸ ৘ஏೞח ҃਋ ! → ૘੄ ௼ӝ, ߑೱ, ജ҃, Үా, ಞ੄दࢸ ١੉ ೖ୛ PyCon APAC 2016 36

Slide 37

Slide 37 text

ೖ୛ ѐߊ(Feature Engineering) 4 (࠺)੿ഋ ؘ੉ఠীࢲ ೖ୛ܳ ଺Ҋ ࢤࢿೞח ੘স 4 ׮ܲ ೖ୛ٜী ղ੤ػ ೖ୛ܳ ଺ইղӝب ೣ 4 ٸ۽ח ࠂ੟ೠ ௏٘о ೙ਃ(SQL۽ח ൨ٝ) 4 3ѐਘ ࠙۝੄ ۽Ӓীࢲ ೞنਸ ా೧ ೖ୛ ࢤࢿ PyCon APAC 2016 37

Slide 38

Slide 38 text

ೞنਸ ॄঠ݅ ೞա? 4 ؘ੉ఠо Bigೞ૑ ঋਵݶ ೙ਃ হ਺ 4 ؀न… 4 ߓ஖ Jobਸ য়ۖزউ جܻѢա 4 ઱ӝ੸ਵ۽ ETLਸ ా೧ DBী ֍যفח җ੿੉ ೙ਃೡ ࣻ ੓਺ 4 ࠺੿ഋ/؀ਊ۝ ؘ੉ఠীࢲ ࠼ߣೠ ೖ୛ ѐߊਸ ೠ׮ݶ જ਺ PyCon APAC 2016 38

Slide 39

Slide 39 text

যڌѱ ॄঠೞա? 4 ૒੽ ೞن ௿۞झఠܳ ҳ୷ೞৈ ࢎਊೡ ࣻب ੓ਵա, ࣇ౴җ ਍ਊ੄ য۰਑ 4 ௿ۄ਋٘ ࢲ࠺झীࢲ ઁҕೞח ೞن ࢲ࠺झܳ ੉ਊ ! - AWS੄ EMR(Elastic Map Reduce) PyCon APAC 2016 39

Slide 40

Slide 40 text

AWSח ࠺ऱ૑ ঋա? 4 ୭੸ച ೞݶ ࠺ऱ૑ ঋ਺ ! 4 ೙ਃೡ ٸ݅ ॳח ױࣘ੸ ௿۞झఠ(Transient Cluster)۽ ੉ਊ 4 Task ֢٘ח ҃ݒ ߑध੄ Spot Instance۽ 4 m4.xlarge(4 vCPU, 16 GiB RAM ): दр ׼ 0.036$ (ࢲ਎ ܻ੹, 2016-08-09 ӝળ) PyCon APAC 2016 40

Slide 41

Slide 41 text

AWS EMR ௿۞झఠ द੘ ചݶ PyCon APAC 2016 41

Slide 42

Slide 42 text

ೞنਸ ਤೠ ۽Ӓ оҕ 4 ೞن਷ ੘਷ ౵ੌ(< 100MB)ٜ੉ ݆਷ Ѫী ஂড 4 ੘਷ ౵ੌٜ਷ ߽೤, ࣗ౴, ঑୷ೡ ೙ਃ 4 ݃ٶೠ ోਸ ଺૑ ޅ೧ ѐߊ ! 4 https://github.com/haje01/mersoz 4 ߄Ո ౵ੌ݅ ੘স, ੄ઓ ҙ҅ܳ Ҋ۰ೠ ߽۳ ୊ܻ PyCon APAC 2016 42

Slide 43

Slide 43 text

ݠ૑, ࣗ౴ & ঑୷ റ S3ী ੷੢ػ ۽Ӓ PyCon APAC 2016 43

Slide 44

Slide 44 text

ೞن MapReduce ௏٬ - mrjob 4 Yelpীࢲ ݅ٚ Python ಁః૑ 4 ೞن झ౟ܿਸ ੉ਊ೧ ౵੉ॆਵ۽ MR ௏٬ 4 ۽ஸীࢲ ࢠ೒ ؘ੉ఠ۽ ѐߊೠ റ, EMRী ৢܿ ! 4 प೯ ࣘبח Javaߡ੹ ࠁ׮ ખ וܻ૑݅ ѐߊ ࣘبо ࡅܴ PyCon APAC 2016 44

Slide 45

Slide 45 text

from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): # ۽Ӓ ౵ੌ੄ п ۄੋ੄ for word in WORD_RE.findall(line): # ݽٚ ױযী ؀೧ yield word.lower(), 1 # 'ױয', 1 ߈ജ def combiner(self, word, counts): # ֢٘੄ Ѿҗܳ ஂ೤ yield word, sum(counts) def reducer(self, word, counts): # ௿۞झఠ੄ Ѿҗܳ ஂ೤ yield word, sum(counts) if __name__ == '__main__': MRWordFreqCount.run() PyCon APAC 2016 45

Slide 46

Slide 46 text

दझమ ҳࢿب PyCon APAC 2016 46

Slide 47

Slide 47 text

അട ౵ঈ 4 ӝ҅೟णਸ ਤ೧ 4 GM੉ ઁ੤ೞח ӔѢ(=ೖ୛)৬ 4 ઁ੤ػ நܼఠ ܻझ౟ܳ ਃ୒ PyCon APAC 2016 47

Slide 48

Slide 48 text

ೖ୛ ࢤࢿ ౲ 4 ۽Ӓীࢲ நܼఠ ӝળਵ۽ ҳೣ 4 ੿Үೠ ೖ୛ࠁ׮ח ׮নೠ ೖ୛ܳ 4 যରೖ ࠂ೤੸ਵ۽ ౸ױ 4 ୡӝীח ૣ਷ दрী ؀೧, উ੿ചغݶ ӡѱ PyCon APAC 2016 48

Slide 49

Slide 49 text

ୡӝী ࡳইࠄ ೖ୛ٜ 4 ۽Ӓੋ ࣻ 4 ೒ۨ੉ दр 4 ۽Ӓ ইਓ੉ ࠛ࠙ݺೠ ҃਋о ݆਺ 4 ࣁ࣌ ইਓ بੑ: 5࠙ ⏱ 4 ই੉మ/ݠפ णٙ ࣻ 4 ௮झ౟ ઙܐ ࣻ 4 NPC/PC р ੹ై ࣻ PyCon APAC 2016 49

Slide 50

Slide 50 text

ೖ୛੄ ఋੑ਷? 4 ௼ѱ पࣻ ഋ, ஠పҊܻ ഋ, ࠛܽ(Boolean) ഋਵ۽ աׇ૗ 4 оә੸ पࣻ ഋਵ۽ ాੌೞח Ѫ੉ ߄ۈ૒ 4 Bool਷ 0, 1۽ 4 ஠పҊܻ ఋੑ਷ OneHotEncoderܳ ࢎਊ೧ पࣻഋਵ۽ PyCon APAC 2016 50

Slide 51

Slide 51 text

ٜ݅য૓ ೖ୛੄ ৘ 4 ױࣽ ఫझ౟ (.txt) ౵ੌ 4 நܼఠݺ + ೖ୛ ߓৌ ഋध PyCon APAC 2016 51

Slide 52

Slide 52 text

ӝ҅೟ण ૓೯ PyCon APAC 2016 52

Slide 53

Slide 53 text

੿੘ ӝ҅೟ण਷ о߶਑ 4 ୭ઙ ೖ୛ ౵ੌ ௼ӝо ੘Ҋ, ӝ҅೟ण ࣻ೯ب о߶਍ ಞ 4 ۽ஸ PCীࢲ ࣻ೯ 4 ୶ୌ दझమ୊ۢ ݽٚ ؘ੉ఠܳ ࠊঠೞח ೟ण਷ ޖѢ਎ Ѫ 4 ݽ؛ਸ ࢶఖೞҊ ୭੸੄ ೞ੉ಌ ಁ۞޷ఠܳ Ѿ੿ೞח Ѫ੉ җઁ 4 ׮নೠ ࣇ౴ਵ۽ ৈ۞ߣ प೷೧ࠊঠ 4 ࠙࢑ दझమਸ ഝਊೞח ҃਋ب... PyCon APAC 2016 53

Slide 54

Slide 54 text

যڃ ঌҊ્ܻ ݽ؛ਸ ࢶఖೡ Ѫੋо? 4 द੘਷ рױೠ Ѫਵ۽ 4 ࠺तೠ ࢎ۹੄ ࢶ೯ োҳо ੓ਵݶ ଵҊೞ੗ 4 AUCա ROCܳ ాೠ ݽ؛ ಣо ߂ ࢶఖ PyCon APAC 2016 54

Slide 55

Slide 55 text

Decision Tree۽ द੘ 4 ࠂ੟ೞ૑ ঋҊ ౸ױ җ੿੄ ੉೧о ਊ੉ 4 ౵੉ॆ Scikit-Learn ಁః૑੄ Ѫਸ ࢎਊ 4 ׮নೠ ӝ҅೟ण ঌҊ્ܻਸ ୽प൤ ઁҕ 4 ੋఠಕ੉झо ాੌغয ੓য ݽ؛ Ү୓о ਊ੉ 4 ೖ୛(X)৬ য࠭੷ ৈࠗ(y)ܳ ֍Ҋ ೟ण 4 DTח ೖ୛ ੿ӏച ೙ਃ হয ಞܻ PyCon APAC 2016 55

Slide 56

Slide 56 text

DT ࢎਊ ৘ (ࠠԢ ࠙ܨ) from sklearn.datasets import load_iris from sklearn import tree iris = load_iris() clf = tree.DecisionTreeClassifier() clf = clf.fit(iris.data, iris.target) >>> clf.predict(iris.data[:1, :]) array([0]) PyCon APAC 2016 56

Slide 57

Slide 57 text

PyCon APAC 2016 57

Slide 58

Slide 58 text

Decision Tree ೟ण җ੿ 1. ೖ୛ ౵ੌীࢲ ӝઓ য࠭੷੄ ೖ୛ܳ ଺Ҋ 2. زࣻ੄ ੿࢚ ਬ੷ ೖ୛ ҳೣ 4 Under Sampling 3. ؘ੉ఠܳ Train/Test ࣇਵ۽ ա־Ҋ 4. ӝࠄ ಁ۞޷ఠ۽ ೟ण द੘ PyCon APAC 2016 58

Slide 59

Slide 59 text

ୡӝ Ѿҗ 4 ಣӐ ੿ഛب 80% ੿ب 4 Binary Class ࠙ܨ੄ ҃਋ ੼ࣻо ੜ աয়ח ಞ 4 աࢁ૑ ঋ਷Ѫ э૑݅, 4 ৘ஏ੄ Ѿҗо ઁ੤੄ ӔѢ۽ ॳੋ׮ח ੼ীࢲ ݆੉ ࠗ઒ PyCon APAC 2016 59

Slide 60

Slide 60 text

੿ഛبܳ ৢܻ੗ 4 Үର Ѩૐ(Cross Validation)ਸ ਤ೧ ؘ੉ఠ ࣇਸ ܻ࠙ ೞҊ 4 GridSearchCVܳ ా೧ ୭੸੄ ೞ੉ಌ ಁ۞޷ఠܳ ଺਺ 4 ಣӐ ੿ഛب 91%۽ ೱ࢚ 4 যڃ ӝળਵ۽ ౸ױೞח૑ ೠ ߣ ࠁҊ र׮ tree.export_graphviz۽ Ӓ۰ࠆ PyCon APAC 2016 60

Slide 61

Slide 61 text

PyCon APAC 2016 61

Slide 62

Slide 62 text

Ѿ੿ ౟ܻܳ ࠁפ... 4 ೟णػ ݽ؛੉ যڃ ӝળਵ۽ ౸ױೞח૑ ঌ ࣻ ੓਺ → ׮নೠ ૒ҵ੄ ࢎۈٜী ҕਬ оמ ! 4 ೞࠗ۽ ղ۰т ࣻ۾ ࠂ੟೧૑ח ޙઁ 4 DTח җ੸೤(Overfitting)غӝ औӝী, Depthо ցޖ Ө૑ ঋѱ ઱੄ PyCon APAC 2016 62

Slide 63

Slide 63 text

ৈӝࢲ ؊ ੉࢚ ੼ࣻо ৢۄо૑ ঋ਺ 4 GMשҗ ࢚੄ റ ࢜۽਍ ೖ୛ٜ ୶о 4 زदী ঳਷ ই੉మ/ݠפ ࣻ 4 ݗ ߈ࠂ പࣻ 4 ౠ੿ ௿ېझ݅ ࢶఖ 4 ਑૒੉૑ ঋҊ ই੉మਸ ঳਷ ࣻ 4 դ೧೧ ࠁ੉ח Ѫٜب ೖ୛۽ ٜ݅ ࣻ ੓ח Ѫ੉ ֢ೞ਋ 4 ৘) 'ࠈ਷ ےؒೞѱ ࢤࢿػ ੉ܴਸ о૑Ҋ ੓যਃ'' PyCon APAC 2016 63

Slide 64

Slide 64 text

৘) நܼఠ ੉ܴ੄ ےؒࢿ ౸ױ (੗/ݽ੄ ୹അ ಁఢ) ## நܼఠ ੉ܴ੉ ߊ਺ оמೠ૑ ౸ױೞח गب ௏٘ # ੉ܴਸ ੗ݽ बࠅ۽ ߄Է(1о ੗਺, 2о ݽ਺) # ৘) anything -> ‘21211211’ symbols = get_cv_symbols(char_name) # ׮਺җ э਷ ಁఢ੉ ੓ਵݶ ߊ਺ оמ (प੤۽ח ؊ ׮ন) if ‘2121’ or ‘2112’ or ‘1121’ or ‘22122’, … in symbols: can_pron = False else: can_pron = True PyCon APAC 2016 64

Slide 65

Slide 65 text

੿ഛೠ ߑߨ਷ ইפ૑݅... ࠂ೤੸ਵ۽ ౸ױೞӝী ب਑੉ ؽ PyCon APAC 2016 65

Slide 66

Slide 66 text

୶о ೖ୛۽ झ௏যо ೱ࢚, Ӓ۞ա… 4 ಣӐ ੿ഛب 96%۽ ೱ࢚. ੼ࣻח ֫਷ ಞ੉૑݅, 4 प੤ ੸ਊ೧ࠄ Ѿҗ 4 GMש੄ ഛੋ җ੿ীࢲ য়ఐ੉ Ԩ ա১ ! 4 DecisionTree੄ Ҋ૕੸ੋ җ੸೤ ޙઁ۽ ౸ױ PyCon APAC 2016 66

Slide 67

Slide 67 text

Random Forest۽ Ү୓ 4 ݆਷ Decision Tree ܳ ઑ೤ೠ ঔ࢚࠶ ప௼ץ 4 ׮ࣻ੄ DTܳ ࠙࢑ ೟ण(=੿ӏച ബҗ) दఃҊ ై಴ೞח ߑध 4 ੼ࣻо ծইب উ੿੸ੋ Ѿҗ 4 DecisionTree - ࠛউೠ 96% RandomForest - উ੿੸ੋ 95% PyCon APAC 2016 67

Slide 68

Slide 68 text

Random Forest ೟ण 4 ӝࠄ੸ਵ۽ Decision Tree৬ ࠺त 4 max_depth, min_samples_leaf ݽ؛੄ ࠂ੟بܳ ઑ੺. ੘ѱ द੘೧ࢲ ઑӘঀ ఃਕࠄ׮ 4 n_estimator 4 աޖ(DT)ܳ ݻ Ӓܖ बਸ Ѫੋ૑ Ѿ੿ ! 4 ցޖ ௼ݶ ೟णदр੉ ӡҊ, ցޖ ੸ਵݶ Ӓր DTо غযߡܿ PyCon APAC 2016 68

Slide 69

Slide 69 text

RF ੸ਊ റ Ѿҗ 4 ੿ഛبח 95% 4 ࠗ׼ೞѱ ૚҅ ߉ח ࢎ۹о হب۾ 4 predict_probaܳ ࢎਊ೧ ৘ஏ੄ ഛܫب ঳Ҋ 4 ഛܫ੉ ֫਷(>70%) ৘ஏ Ѿҗ݅ ನೣ 4 ৈӝࢲ 10~20%੿ب ੤അਯ(Recall) ೞۅ ୶੿ 4 Ӓ۞ա, ੿޻ب(Precision)ח… PyCon APAC 2016 69

Slide 70

Slide 70 text

100% ׳ࢿ GMש੉ ࣻ੘সਵ۽ Ѩష೧ ઱न Ѿҗ… ! PyCon APAC 2016 70

Slide 71

Slide 71 text

଺ওਵפ ઁ੤ܳ... 4 2ѐਘৈী Ѧ୛ ઁ੤ 4 ోਸ ࢎਊೠ ౵߁੉ ؀ࠗ࠙ ࢎۄ૗! ! 4 ઱ӝ੸/૑ࣘ੸ਵ۽ ઁ੤ܳ ೧ঠ ബҗо ੓਺ PyCon APAC 2016 71

Slide 72

Slide 72 text

ଵҊ: ୭ઙ ೖ୛੄ ઺ਃب PyCon APAC 2016 72

Slide 73

Slide 73 text

ѐࢶ ߑೱ 4 Ѩ୹ػ Ѿҗܳ ੉ਊ೧ ೟ण ݽ؛ ѐࢶ 4 ࠈ ҅੿ী ؀ೠ PIIܳ ࣻ૘೧فݶ नӏ ࠈ ೟णী ਊ੉ೡ Ѫ 4 ઁ੤ റ ߸ઙ ࠈ ݽפఠ݂ ೙ਃ PyCon APAC 2016 73

Slide 74

Slide 74 text

റӝ PyCon APAC 2016 74

Slide 75

Slide 75 text

ו՛ ੼ 4 ؘ੉ఠ ࣻ૘ࠗఠ оҕ, ࠙ࢳө૑੄ ݽٚ җ੿ਸ ౵੉ॆਵ۽ ! 4 Jupyter ֢౟࠘ਸ ాೠ ఐ࢝੸ ؘ੉ఠ ࠙ࢳ " 4 ؊ ׮নೠ ࠙ঠী ӝ҅ ೟णਸ ഝਊ оמೡ ٠ PyCon APAC 2016 75

Slide 76

Slide 76 text

ӝ҅೟ण बച 4 Ө੉ ੓ח ഝਊਸ ਤ೧ ӝࠄ ੉ۿਸ ؊ ҕࠗೞ੗ ! 4 જ਷ Hypothesisܳ ٜ݅ ࣻ ੓ѱ ػ׮ 4 ୭੸ചܳ ೡ ࣻ ੓ѱ ػ׮ 4 ೞա ੉࢚੄ ঌҊ્ܻਸ ࢎਊ೧ ࠁ੗ 4 SVM, Neural Net ١ ׮নೠ ࠙ܨӝ 4 Super Learner ߑधਵ۽ ঔ࢚࠶ PyCon APAC 2016 76

Slide 77

Slide 77 text

ࣁਘ਷ ൗ۞... ࢜۽਍ ۽Ӓ ࣻ૘/࠙ࢳ ജ҃ 4 RSync ߑध -> Fluentd/Kinesis पदр ۽Ӓ ࣻ૘ 4 gzipػ CSV -> Parquet ನݘਵ۽ S3 ੷੢ 4 Columnar ߄੉ցܻ ನݘ, 30x ࣘب ೱ࢚ 4 MRJob -> PySpark 4 ъ۱ೠ ࠙࢑ ୊ܻ / Cache ӝמ(߈ࠂ ೟णী ъ੼) 4 ױࣘ੸ Spark ௿۞झఠ(20 VMs = 80௏য, 320GB ۔)۽ ੉ਊ ઺ (दр ׼ 3000ਗ ੿ب) PyCon APAC 2016 77

Slide 78

Slide 78 text

ઑ঱ 4 ӝ҅೟ण੉ ղо ೞ۰ח ੌী ੸೤ೠ૑ ౸ױ ! 4 য࠭૚੄ ౠࢿ੉ ױࣽೞݶ ੹ా੸ੋ ߑߨਵ۽ оמ 4 ఐ࢝੸ ؘ੉ఠ ࠙ࢳਸ ా೧ ౠࢿਸ ݢ੷ ౵ঈೞ੗ 4 ׮নೠ ݽ؛/ೖ୛ܳ పझ౟೧ࠁ੗ 4 ೟ण ݽ؛ী ٮۄ ೖ୛ ੿ӏച/૒Үചо ೙ਃೡ ࣻ ੓ਵפ ୓௼ 4 ௿ېझр Imbalance ޙઁী ઱੄ PyCon APAC 2016 78

Slide 79

Slide 79 text

٩۞׬? ӝ҅೟ण? 4 ٩۞׬ 4 ੿Үೠ ೖ୛ ূ૑פয݂੉ ೙ਃ হ਺ 4 ݆਷ ಁ۞޷ఠ = ݆਷ ؘ੉ఠо ೙ਃ 4 ӝ҅೟ण 4 ೖ୛ ੘স੉ ઺ਃೞ૑݅ 4 ੸਷ ಁ۞޷ఠ = ੸਷ ؘ੉ఠ۽ب ബ җ PyCon APAC 2016 79

Slide 80

Slide 80 text

੟࢚ 4 ؘ੉ఠ ূ૑פয݂੄ য۰਑ 4 ؘ੉ఠ੄ ഛࠁо о੢ ઺ਃ 4 झನ౟ۄ੉౟ܳ ߉ח ࠙ঠח য়൤۰ ੹ݎ੉ যف਑ 4 ఑ োҳ੗о ইפۄݶ ҷڣ࢑স/౥࢜ ؘ੉ఠঠ݈۽ ࠶ܖয়࣌ 4 ݽٚ ഥࢎী ؘ੉ఠ ࠙ࢳоо ೙ਃೠ द؀ 4 ஹೊఠо ݽٚ ݽ؛/߸ࣻ ઑ೤ਸ పझ౟ ೡ ࣻ ੓׮ݶ? ! PyCon APAC 2016 80

Slide 81

Slide 81 text

՘ਵ۽... ੄ࢎ োҙ(Spurious Correlations) 4 पઁ۽ח োҙ੉ হ૑݅, ੓ח Ѫ୊ۢ ࠁ੉ח ҃਋ 4 ؘ੉ఠী݅ ૘଱ೞ૑ ݈Ҋ, بݫੋਸ ੉೧ೞ੗! PyCon APAC 2016 81

Slide 82

Slide 82 text

хࢎ೤פ׮. PyCon APAC 2016 82

Slide 83

Slide 83 text

ଵҊ ݂௼ 4 http://www.aladin.co.kr/shop/wproduct.aspx?ItemId=28946323 4 http://www.tylervigen.com/spurious-correlations 4 http://scikit-learn.org/stable/modules/tree.html 4 http://www.cimerr.net/conference/board/data/conference/1331626266/P15.pdf 4 http://stackoverflow.com/questions/20463281/- how-do-i-solve-overfitting-in-random-forest-- of-python-sklearn 4 http://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning 4 https://www.quora.com/Is-Scala-a-better-choi- ce-than-Python-for-Apache-Spark 4 http://statkclee.github.io/data-science/data- -handling-pipeline.html 4 https://databricks.com/blog/2016/01/25/deep-- learning-with-spark-and-tensorflow.html- PyCon APAC 2016 83