$30 off During Our Annual Pro Sale. View Details »

기계학습을 활용한 게임 어뷰징 검출

기계학습을 활용한 게임 어뷰징 검출

PyConAPAC 2016에서 발표한 문서입니다.

JeongJu Kim

August 16, 2016
Tweet

More Decks by JeongJu Kim

Other Decks in Technology

Transcript

  1. ӝ҅೟णਸ ഝਊೠ ѱ੐ য࠭૚ Ѩ୹
    ӣ੿઱
    PyCon APAC 2016
    PyCon APAC 2016 1

    View Slide

  2. ߊ಴੗ ࣗѐ
    ӣ੿઱ ([email protected])
    ੹: ѱ੐ ѐߊ
    - NHN / NPLUTO
    - 3D ূ૓ / ѱ੐ ௿ۄ੉঱౟ ѐߊ
    അ: ѱ੐ ؘ੉ఠ ࣻ૘ / ࠙ࢳ
    - Webzen NPlay
    - ۽Ӓ ನਕ؊, Pandas, Scikit-Learn,
    PySpark
    PyCon APAC 2016 2

    View Slide

  3. ੉ ߊ಴ח
    4 ӝ҅೟णী ؀ೠ ӝࠄ ૑ध੉ ੓ח ٜ࠙ਸ ؀࢚
    4 ౵੉ॆਸ ഝਊೠ ؘ੉ఠ ࠙ࢳҗ ӝ҅೟ण ࢎ۹ܳ ҕਬ
    4 ѐߊҗ ࢲ࠺झী ӝ҅೟णਸ بੑೞח ҅ӝо غ঻ਵݶ ೤פ׮
    PyCon APAC 2016 3

    View Slide

  4. द੘ زӝ
    4 ѱ੐ য࠭૚ ઁ੤ܳ
    4 ਬ੷ नҊ / GM ݽפఠ݂ / ಁఢ ଺ӝ۽ח ೠ҅
    4 ࢎۈ੄ ѐੑ੉ ୭ࣗചػ য࠭૚ ఐ૑ दझమਸ ٜ݅੗
    PyCon APAC 2016 4

    View Slide

  5. ѱ੐ য࠭૚੉ۆ?
    4 “ӝദ੸ਵ۽ ੄بೞ૑ ঋ਷ ߑधਵ۽ ѱ੐ ੿ࠁܳ ؀۝ ദٙೞѢա ب
    ਑ਸ ઱ח ೯ਤ” !
    4 ࢎ۹
    4 ࢲ࠺झ ҳഅ࢚੄ ೹੼ਸ ੉ਊೠ ೒ۨ੉
    4 ೧ఊ ోਸ ࢎਊೠ ࠺੿࢚ ೒ۨ੉
    4 ੹୓ ଻౴ହী بߓ۽ ҟҊ
    PyCon APAC 2016 5

    View Slide

  6. ా҅৬ ఐ࢝੸ ؘ੉ఠ ࠙ࢳ
    PyCon APAC 2016 6

    View Slide

  7. ਋ࢶ, ా҅
    4 ా҅ח ೂࠗೞ૑ ޅೠ ؘ੉ఠ৬ ஹೊ౴ ౵ਕ੄ ജ҃ীࢲ ߊ੹
    4 ా҅ ೟੗ٜ਷ ؘ੉ఠ/҅࢑ਸ ઴੉ח ߑߨਸ োҳ
    4 ৌঈೠ ജ҃ীࢲ ٜ݅য઎ӝী, ੸਷ ؘ੉ఠীࢲب о஖ܳ ߊѼೡ ࣻ
    ੓਺
    4 ӝࠄ੸ੋ ా҅ ૑ध਷ ѐߊ, ӝദ, ࢲ࠺झ ١ী ௾ ب਑੉ ؽ
    PyCon APAC 2016 7

    View Slide

  8. ఐ࢝੸ ؘ੉ఠ ࠙ࢳ
    4 ؘ੉ఠী ऀয੓ח ੿ࠁܳ,
    4 ׮নೠ пب۽ ਃড, दпച ೧ࠁݴ ଺ח җ੿ !
    4 ୊਺ ੽ೞח ؘ੉ఠח ੉ җ੿ࠗఠ
    4 ੗୓ दझమ(WzDat) ѐߊ೧ ഝਊ "
    4 Jupyter + Utility + Dashboard
    4 https://github.com/haje01/wzdat
    4 http://www.pycon.kr/2014/program/14
    PyCon APAC 2016 8

    View Slide

  9. ࢎ۹1
    рױೠ ా҅੸ ই੉٣য۽ झಁݠ Ѩ୹
    PyCon APAC 2016 9

    View Slide

  10. ࢚ട
    4 नӏ য়೑ೠ ѱ੐੄ ଻౴ହ੉ ѱ੐ ই੉మ ҟҊӖ۽ оٙ !
    4 ೧׼ ҅੿ਸ ઁ੤೧ب ߄۽ ࢜ ҅੿ਵ۽ ҟҊ ҅ࣘ
    4 ࡅܲ ઁ੤о ೙ਃೞৈ, ӝ҅೟णਸ ૓೯ೞӝীח दр੉ ࠗ઒
    PyCon APAC 2016 10

    View Slide

  11. ଻౴ਸ ੉ਊೠ झಅ
    (Spam)
    4 ѱ੐ ղীࢲ ੹৉ ଻౴ਵ۽ ݠפ/ই੉మ
    ౸ݒ ҟҊ
    4 য࠭੷ח ೐۽Ӓ۔੸ ౸߹ਸ ݄ӝਤ೧ ݫ
    द૑ܳ դةച
    PyCon APAC 2016 11

    View Slide

  12. झಁݠ Ѩ୹
    4 ׮নೠ ߑߨ੉ оמೞѷਵա,
    4 ੗োয ୊ܻա ӝ҅೟णэ਷ Ҋә ੽Ӕࠁ׮,
    4 рױೠ ా҅੸ ই੉٣য۽ दب
    PyCon APAC 2016 12

    View Slide

  13. ৡۄੋ ଻౴ ݫद૑ ӡ
    ੉੄ ࠙ನ
    4 ੌ߈੸ਵ۽ ۽Ӓ ੿ӏ࠙ನܳ ٮܲ׮Ҋ ঌ
    ۰ઉ੓׮.
    4 ইېח NPS Chat Corpus੄ ݫद૑
    ӡ੉ ࠙ನ
    PyCon APAC 2016 13

    View Slide

  14. ѱ੐ ղ ଻౴ ݫद૑ ӡ੉ ࠙ನ
    4 ৡۄੋ ଻౴җ ࠺तೞա ખ ؊ فԁ਍ ҃

    4 ౠ੿ ӡ੉ ݫद૑о ౗(?) → झಅਵ۽
    ഛੋ
    PyCon APAC 2016 14

    View Slide

  15. ই੉٣য
    4 ੌ߈ ਬ੷: ݫद૑ ӡ੉о ׮নೞҊ, ࠼بо ݆૑ ঋ਺
    4 झಁݠ: ݫद૑ ӡ੉о ׮নೞ૑ ঋҊ, ࠼بח ֫਺
    4 ૊, যڃ ਬ੷੄ ଻౴ ࠼بо ֫Ҋ ӡ੉о ׮নೞ૑ ঋਵݶ झಁݠ
    PyCon APAC 2016 15

    View Slide

  16. рױೠ Ѩ୹ ҕध
    4 ਬ੷ ߹ ଻౴੄ പࣻ / ݫद૑ ӡ੉ ઙܨ ࣻ
    4 ࠺तೠ ӡ੉੄ ଻౴ ݫद૑ܳ ੗઱ ࠁյ ࣻ۾ ч੉ ழ૗
    PyCon APAC 2016 16

    View Slide

  17. ࠙ܨ
    4 spam_ratioо ӝળ ч ੉࢚ੋ Ѫਸ झಁݠ۽ р઱
    4 ӝળ ч Ѿ੿਷ ോܻझ౮ೞѱ...
    4 ؀୽ ࢸ੿ റ, ࠙ܨػ நܼఠ੄ ݫद૑ ഛੋਵ۽ ч ઑ੺
    PyCon APAC 2016 17

    View Slide

  18. ࠙ܨ റ ݫद૑ ӡ੉ ࠙ನ
    4 ࠼بо ֫਷ ౠ੿ ӡ੉੄ ݫद૑(= झಅ)о ܻ࠙غ঻਺
    PyCon APAC 2016 18

    View Slide

  19. Ѿҗ ੸ਊ
    4 ҳഅ੉ рױ೮૑݅, য়ఐ੄ оמࢿ ੓਺
    4 ӝળ чਸ ֫ѱ ੟ই न܉بܳ ֫੐
    4 ੉ Ѿҗܳ о૑Ҋ ઁ੤
    PyCon APAC 2016 19

    View Slide

  20. ѐࢶ ߑೱ
    4 ӝળ ч Ѿ੿ਸ ખ ؊ җ೟੸ੋ ߑߨਵ۽
    4 ੗োয ୊ܻ ӝࣿ(NLP) بੑ
    4 ױয߹ ࠼ب(Ziff’s Law)৬ ઺ਃب(TF-IDF) Ҋ۰
    4 ӝ҅೟ण ঌҊ્ܻ ੸ਊ
    PyCon APAC 2016 20

    View Slide

  21. ӝ҅೟ण ࣗѐ
    PyCon APAC 2016 21

    View Slide

  22. ӝ҅೟णਸ ॳח ੉ਬ
    4 ੸਷ ֢۱ਵ۽ ҡଳ਷ Ѿҗޛ
    4 ׮নೠ ޙઁী ؀ೠ ੌ߈੸ੋ ࣛܖ࣌
    4 ׮ࣻ੄ ౠࢿ(ೖ୛)ਸ زदী Ҋ۰ೡ ࣻ ੓׮
    4 ؘ੉ఠ ߸زী ъೣ(ъѤࢿ)
    PyCon APAC 2016 22

    View Slide

  23. ࠙ܨ৬ ഥӈ
    4 ӝ҅೟ण਷ ௼ѱ ࠙ܨ
    (Classification)৬ ഥӈ
    (Regression)۽ ա׍
    4 ࠙ܨ - ઙܨܳ ৘ஏ ೞח Ѫ
    4 ഥӈ - োࣘػ чਸ ৘ஏ ೞח Ѫ
    4 য࠭૚ Ѩ୹਷ ࠙ܨী ࣘೣ
    PyCon APAC 2016 23

    View Slide

  24. ૑ب ೟णҗ ੗ਯ ೟ण
    4 ૑ب ೟ण(Supervised Learning)
    4 ӝઓ ҃೷ী ੄೧ ࠙ܨػ ࢠ೒ ؘ੉ఠо ੓ਸ ٸ
    4 ੗ਯ ೟ण(Unsupervised Learning)
    4 ࠙ܨػ ࢠ೒ ؘ੉ఠо হਸ ٸ
    4 ؀ࠗ࠙੄ ؘ੉ఠח ੸੺൤ ࠙ܨغয ੓૑ ঋ׮ → ಽযঠೡ ޙઁ
    PyCon APAC 2016 24

    View Slide

  25. ӝ҅೟ण ঌҊ્ܻٜ
    4 ӝࠄ
    4 ܻפয/۽૑झ౮ ܻӒۨ࣌(Linear/Logistic Regression)
    4 Ѿ੿ ౟ܻ(Decision Tree)
    4 Ҋә
    4 ےؒ ನۨझ౟(Random Forest)
    4 SVM(Support Vector Machine)
    4 ੋҕ न҃ݎ(Neural Network)
    PyCon APAC 2016 25

    View Slide

  26. ঌҊ્ܻ੄ ࢶఖ਷?
    4 ੌ߈੸ਵ۽ Ҋә ঌҊ્ܻ਷ ؊ ࠂ੟ೠ ݽ؛ ೟ण оמ
    4 Ӓ۞ա, Ҋә ঌҊ્ܻ੉ ޖઑѤ જ਷ Ѫ਷ ইש
    4 ೟ण੄ Ѿҗܳ ࢎۈ੉ ੉೧ೞӝীח ӝࠄ ঌҊ્ܻ੉ જ׮
    PyCon APAC 2016 26

    View Slide

  27. ৘ஏী ؀ೠ ಣо
    4 ੿ഛࢿী ؀ೠ ੿੄о ೙ਃ !
    4 Q: ਬ੷ 100ݺ ઺ 2ݺ ੓ח য࠭੷ܳ Ѩ୹ೞ۰ ೠ׮. पࣻ۽ ݽف
    ੿࢚ ਬ੷۽ ౸ױ೮ਸ ٸ ੿ഛبח?
    4 A: 100ݺ ઺ 2ݺ੉ ౣ۷ਵפ… 98% !?#@
    PyCon APAC 2016 27

    View Slide

  28. ஏ੿ ױਤ
    4 ੿޻ب(Precision) ੤അਯ(Recall)җ ١ ׮নೠ ױਤ
    4 ੿޻ب: ଺਷ Ѫ ઺ ঴݃ա ૓૞ য࠭੷ੋо?
    4 ੤അਯ: ੹୓ য࠭੷ ઺ ঴݃ա ଺ওחо?
    4 ؘ੉ఠо ࠛӐഋ(Imbalance)ੌٸח ౠ൤ ੿޻ب৬ ੤അਯਸ ೣԋ
    Ҋ۰೧ঠ
    4 খ੄ ҃਋ח ੤അਯ੉ 0
    PyCon APAC 2016 28

    View Slide

  29. P/R Curve ৬ AUC
    જ਷ ࠙ܨӝח?
    PyCon APAC 2016 29

    View Slide

  30. ࢎ۹2
    ӝ҅೟णਵ۽ ౵߁ Ѩ୹
    PyCon APAC 2016 30

    View Slide

  31. ࢚ട
    4 ۄ੉࠳ ѱ੐ীࢲ пઙ ೧ఊ ోਸ ࢎਊೠ ౵߁ ೒ۨ੉о ഝѐ !
    4 ౵߁: ѱ੐ ղ ੤ചܳ ࠺ ੿࢚੸ੋ ߑߨਵ۽ णٙ
    4 ࠈ੄ ౠࢿਸ ೞա ل۽ ౠ੿ೞӝ য۰਑ → ӝ҅೟ण੉ ೙ਃ
    PyCon APAC 2016 31

    View Slide

  32. ೟ण ߑध ࢶఖ
    4 Ҷ੉ ׏ۡ֔/٩۞׬ਵ۽ ೡ ೙ਃח হח ٠…
    4 җѢ ۽Ӓо ੷੢غҊ ੓঻Ҋ,
    4 ਍৔ஏীࢲ ӝઓ য࠭੷ நܼఠ ܻझ౟ܳ о૑Ҋ ੓঻਺ !
    → ӝ҅೟ण, ౠ൤ ૑ب ೟ण੉ оמ!
    4 Decision Tree ߑध੄ ૑ب ೟णਵ۽ Ѿ੿
    PyCon APAC 2016 32

    View Slide

  33. ળ࠺ җ੿
    1. ۽Ӓ ࣻ૘ ࢚క ഛੋ
    2. ۽Ӓ੄ ҳઑ/੄޷ ౵ঈ
    3. ೟णਸ ਤೠ ೖ୛(Feature) ୶୹
    PyCon APAC 2016 33

    View Slide

  34. ӝ҅೟णب ۽Ӓ ࣻ૘ࠗఠ
    4 ۽Ӓܳ ୓҅੸ਵ۽ ݽਵח Ѫب औ૑ ঋ਺
    4 ࠙ࢳ/೟णী Ѧܻח दр਷ 10~20% ੿ب
    4 ؘ੉ఠܳ ݽਵҊ оҕೞחؘ ؀ࠗ࠙੄ दр੉ Ѧܽ׮.
    4 ۽Ӓ ഋध਷ оә੸ Ӓ؀۽ ࢎਊ (झౚ٣য়ܳ ਤ೧… !)
    4 ۽Ӓܳ ੸੺൤ ࠙ܨ೧ ੷੢ (ࢲߡ/۽Ӓ ઙܨ, द੼ ߹۽)
    4 ௿ۄ਋٘ झషܻ૑(S3) ୶ୌ ☁
    PyCon APAC 2016 34

    View Slide

  35. ਦب਋ ࢲߡীࢲ ۽Ӓ ࣻ૘ೞӝ
    4 ѱ੐ ࢲߡח ؀ࠗ࠙ ਦب਋ ӝ߈
    4 য়೑ ࣗझ੄ જ਷ ోٜ(fluentd, logstash ١)ਸ ॳҊ र঻ਵա
    4 ਦب਋ ࢲߡী ࢸ஖о औ૑ ঋҊ, ੌࠗ ӝמ੉ ࠗ઒
    4 ੗୓ ѐߊ !
    4 https://github.com/haje01/wdfwd
    4 ࢲߡী թ਷ ۽Ӓ ౵ੌਸ RSync۽ زӝೞѢա
    4 ѱ੐ DBী ੽ࣘೞৈ Dump റ ੹࣠
    PyCon APAC 2016 35

    View Slide

  36. ۽Ӓо ࣻ૘ غ঻ਵݶ ೖ୛ܳ ٜ݅੗
    4 ೖ୛(Feature, ౠࢿ): ೟ण ؀࢚੄ ౠ૚ਸ ࢸݺ೧઱ח ч
    4 ৘) ૘ чਸ ৘ஏೞח ҃਋ !
    → ૘੄ ௼ӝ, ߑೱ, ജ҃, Үా, ಞ੄दࢸ ١੉ ೖ୛
    PyCon APAC 2016 36

    View Slide

  37. ೖ୛ ѐߊ(Feature Engineering)
    4 (࠺)੿ഋ ؘ੉ఠীࢲ ೖ୛ܳ ଺Ҋ ࢤࢿೞח ੘স
    4 ׮ܲ ೖ୛ٜী ղ੤ػ ೖ୛ܳ ଺ইղӝب ೣ
    4 ٸ۽ח ࠂ੟ೠ ௏٘о ೙ਃ(SQL۽ח ൨ٝ)
    4 3ѐਘ ࠙۝੄ ۽Ӓীࢲ ೞنਸ ా೧ ೖ୛ ࢤࢿ
    PyCon APAC 2016 37

    View Slide

  38. ೞنਸ ॄঠ݅ ೞա?
    4 ؘ੉ఠо Bigೞ૑ ঋਵݶ ೙ਃ হ਺
    4 ؀न…
    4 ߓ஖ Jobਸ য়ۖزউ جܻѢա
    4 ઱ӝ੸ਵ۽ ETLਸ ా೧ DBী ֍যفח җ੿੉ ೙ਃೡ ࣻ ੓਺
    4 ࠺੿ഋ/؀ਊ۝ ؘ੉ఠীࢲ ࠼ߣೠ ೖ୛ ѐߊਸ ೠ׮ݶ જ਺
    PyCon APAC 2016 38

    View Slide

  39. যڌѱ ॄঠೞա?
    4 ૒੽ ೞن ௿۞झఠܳ ҳ୷ೞৈ ࢎਊೡ ࣻب ੓ਵա,
    ࣇ౴җ ਍ਊ੄ য۰਑
    4 ௿ۄ਋٘ ࢲ࠺झীࢲ ઁҕೞח ೞن ࢲ࠺झܳ ੉ਊ !
    - AWS੄ EMR(Elastic Map Reduce)
    PyCon APAC 2016 39

    View Slide

  40. AWSח ࠺ऱ૑ ঋա?
    4 ୭੸ച ೞݶ ࠺ऱ૑ ঋ਺ !
    4 ೙ਃೡ ٸ݅ ॳח ױࣘ੸ ௿۞झఠ(Transient Cluster)۽ ੉ਊ
    4 Task ֢٘ח ҃ݒ ߑध੄ Spot Instance۽
    4 m4.xlarge(4 vCPU, 16 GiB RAM ): दр ׼ 0.036$
    (ࢲ਎ ܻ੹, 2016-08-09 ӝળ)
    PyCon APAC 2016 40

    View Slide

  41. AWS EMR ௿۞झఠ द੘ ചݶ
    PyCon APAC 2016 41

    View Slide

  42. ೞنਸ ਤೠ ۽Ӓ оҕ
    4 ೞن਷ ੘਷ ౵ੌ(< 100MB)ٜ੉ ݆਷ Ѫী ஂড
    4 ੘਷ ౵ੌٜ਷ ߽೤, ࣗ౴, ঑୷ೡ ೙ਃ
    4 ݃ٶೠ ోਸ ଺૑ ޅ೧ ѐߊ !
    4 https://github.com/haje01/mersoz
    4 ߄Ո ౵ੌ݅ ੘স, ੄ઓ ҙ҅ܳ Ҋ۰ೠ ߽۳ ୊ܻ
    PyCon APAC 2016 42

    View Slide

  43. ݠ૑, ࣗ౴ & ঑୷ റ S3ী ੷੢ػ ۽Ӓ
    PyCon APAC 2016 43

    View Slide

  44. ೞن MapReduce ௏٬ - mrjob
    4 Yelpীࢲ ݅ٚ Python ಁః૑
    4 ೞن झ౟ܿਸ ੉ਊ೧ ౵੉ॆਵ۽ MR ௏٬
    4 ۽ஸীࢲ ࢠ೒ ؘ੉ఠ۽ ѐߊೠ റ, EMRী ৢܿ !
    4 प೯ ࣘبח Javaߡ੹ ࠁ׮ ખ וܻ૑݅ ѐߊ ࣘبо ࡅܴ
    PyCon APAC 2016 44

    View Slide

  45. from mrjob.job import MRJob
    import re
    WORD_RE = re.compile(r"[\w']+")
    class MRWordFreqCount(MRJob):
    def mapper(self, _, line): # ۽Ӓ ౵ੌ੄ п ۄੋ੄
    for word in WORD_RE.findall(line): # ݽٚ ױযী ؀೧
    yield word.lower(), 1 # 'ױয', 1 ߈ജ
    def combiner(self, word, counts): # ֢٘੄ Ѿҗܳ ஂ೤
    yield word, sum(counts)
    def reducer(self, word, counts): # ௿۞झఠ੄ Ѿҗܳ ஂ೤
    yield word, sum(counts)
    if __name__ == '__main__':
    MRWordFreqCount.run()
    PyCon APAC 2016 45

    View Slide

  46. दझమ ҳࢿب
    PyCon APAC 2016 46

    View Slide

  47. അട ౵ঈ
    4 ӝ҅೟णਸ ਤ೧
    4 GM੉ ઁ੤ೞח ӔѢ(=ೖ୛)৬
    4 ઁ੤ػ நܼఠ ܻझ౟ܳ ਃ୒
    PyCon APAC 2016 47

    View Slide

  48. ೖ୛ ࢤࢿ ౲
    4 ۽Ӓীࢲ நܼఠ ӝળਵ۽ ҳೣ
    4 ੿Үೠ ೖ୛ࠁ׮ח ׮নೠ ೖ୛ܳ
    4 যରೖ ࠂ೤੸ਵ۽ ౸ױ
    4 ୡӝীח ૣ਷ दрী ؀೧, উ੿ചغݶ ӡѱ
    PyCon APAC 2016 48

    View Slide

  49. ୡӝী ࡳইࠄ ೖ୛ٜ
    4 ۽Ӓੋ ࣻ
    4 ೒ۨ੉ दр
    4 ۽Ӓ ইਓ੉ ࠛ࠙ݺೠ ҃਋о ݆਺
    4 ࣁ࣌ ইਓ بੑ: 5࠙ ⏱
    4 ই੉మ/ݠפ णٙ ࣻ
    4 ௮झ౟ ઙܐ ࣻ
    4 NPC/PC р ੹ై ࣻ
    PyCon APAC 2016 49

    View Slide

  50. ೖ୛੄ ఋੑ਷?
    4 ௼ѱ पࣻ ഋ, ஠పҊܻ ഋ, ࠛܽ(Boolean) ഋਵ۽ աׇ૗
    4 оә੸ पࣻ ഋਵ۽ ాੌೞח Ѫ੉ ߄ۈ૒
    4 Bool਷ 0, 1۽
    4 ஠పҊܻ ఋੑ਷ OneHotEncoderܳ ࢎਊ೧ पࣻഋਵ۽
    PyCon APAC 2016 50

    View Slide

  51. ٜ݅য૓ ೖ୛੄ ৘
    4 ױࣽ ఫझ౟ (.txt) ౵ੌ
    4 நܼఠݺ + ೖ୛ ߓৌ ഋध
    PyCon APAC 2016 51

    View Slide

  52. ӝ҅೟ण ૓೯
    PyCon APAC 2016 52

    View Slide

  53. ੿੘ ӝ҅೟ण਷ о߶਑
    4 ୭ઙ ೖ୛ ౵ੌ ௼ӝо ੘Ҋ, ӝ҅೟ण ࣻ೯ب о߶਍ ಞ
    4 ۽ஸ PCীࢲ ࣻ೯
    4 ୶ୌ दझమ୊ۢ ݽٚ ؘ੉ఠܳ ࠊঠೞח ೟ण਷ ޖѢ਎ Ѫ
    4 ݽ؛ਸ ࢶఖೞҊ ୭੸੄ ೞ੉ಌ ಁ۞޷ఠܳ Ѿ੿ೞח Ѫ੉ җઁ
    4 ׮নೠ ࣇ౴ਵ۽ ৈ۞ߣ प೷೧ࠊঠ
    4 ࠙࢑ दझమਸ ഝਊೞח ҃਋ب...
    PyCon APAC 2016 53

    View Slide

  54. যڃ ঌҊ્ܻ ݽ؛ਸ ࢶఖೡ Ѫੋо?
    4 द੘਷ рױೠ Ѫਵ۽
    4 ࠺तೠ ࢎ۹੄ ࢶ೯ োҳо ੓ਵݶ ଵҊೞ੗
    4 AUCա ROCܳ ాೠ ݽ؛ ಣо ߂ ࢶఖ
    PyCon APAC 2016 54

    View Slide

  55. Decision Tree۽ द੘
    4 ࠂ੟ೞ૑ ঋҊ ౸ױ җ੿੄ ੉೧о ਊ੉
    4 ౵੉ॆ Scikit-Learn ಁః૑੄ Ѫਸ ࢎਊ
    4 ׮নೠ ӝ҅೟ण ঌҊ્ܻਸ ୽प൤ ઁҕ
    4 ੋఠಕ੉झо ాੌغয ੓য ݽ؛ Ү୓о ਊ੉
    4 ೖ୛(X)৬ য࠭੷ ৈࠗ(y)ܳ ֍Ҋ ೟ण
    4 DTח ೖ୛ ੿ӏച ೙ਃ হয ಞܻ
    PyCon APAC 2016 55

    View Slide

  56. DT ࢎਊ ৘ (ࠠԢ ࠙ܨ)
    from sklearn.datasets import load_iris
    from sklearn import tree
    iris = load_iris()
    clf = tree.DecisionTreeClassifier()
    clf = clf.fit(iris.data, iris.target)
    >>> clf.predict(iris.data[:1, :])
    array([0])
    PyCon APAC 2016 56

    View Slide

  57. PyCon APAC 2016 57

    View Slide

  58. Decision Tree ೟ण җ੿
    1. ೖ୛ ౵ੌীࢲ ӝઓ য࠭੷੄ ೖ୛ܳ ଺Ҋ
    2. زࣻ੄ ੿࢚ ਬ੷ ೖ୛ ҳೣ
    4 Under Sampling
    3. ؘ੉ఠܳ Train/Test ࣇਵ۽ ա־Ҋ
    4. ӝࠄ ಁ۞޷ఠ۽ ೟ण द੘
    PyCon APAC 2016 58

    View Slide

  59. ୡӝ Ѿҗ
    4 ಣӐ ੿ഛب 80% ੿ب
    4 Binary Class ࠙ܨ੄ ҃਋ ੼ࣻо ੜ աয়ח ಞ
    4 աࢁ૑ ঋ਷Ѫ э૑݅,
    4 ৘ஏ੄ Ѿҗо ઁ੤੄ ӔѢ۽ ॳੋ׮ח ੼ীࢲ ݆੉ ࠗ઒
    PyCon APAC 2016 59

    View Slide

  60. ੿ഛبܳ ৢܻ੗
    4 Үର Ѩૐ(Cross Validation)ਸ ਤ೧ ؘ੉ఠ ࣇਸ ܻ࠙ ೞҊ
    4 GridSearchCVܳ ా೧ ୭੸੄ ೞ੉ಌ ಁ۞޷ఠܳ ଺਺
    4 ಣӐ ੿ഛب 91%۽ ೱ࢚
    4 যڃ ӝળਵ۽ ౸ױೞח૑ ೠ ߣ ࠁҊ र׮
    tree.export_graphviz۽ Ӓ۰ࠆ
    PyCon APAC 2016 60

    View Slide

  61. PyCon APAC 2016 61

    View Slide

  62. Ѿ੿ ౟ܻܳ ࠁפ...
    4 ೟णػ ݽ؛੉ যڃ ӝળਵ۽ ౸ױೞח૑ ঌ ࣻ ੓਺
    → ׮নೠ ૒ҵ੄ ࢎۈٜী ҕਬ оמ !
    4 ೞࠗ۽ ղ۰т ࣻ۾ ࠂ੟೧૑ח ޙઁ
    4 DTח җ੸೤(Overfitting)غӝ औӝী, Depthо ցޖ Ө૑
    ঋѱ ઱੄
    PyCon APAC 2016 62

    View Slide

  63. ৈӝࢲ ؊ ੉࢚ ੼ࣻо ৢۄо૑ ঋ਺
    4 GMשҗ ࢚੄ റ ࢜۽਍ ೖ୛ٜ ୶о
    4 زदী ঳਷ ই੉మ/ݠפ ࣻ
    4 ݗ ߈ࠂ പࣻ
    4 ౠ੿ ௿ېझ݅ ࢶఖ
    4 ਑૒੉૑ ঋҊ ই੉మਸ ঳਷ ࣻ
    4 դ೧೧ ࠁ੉ח Ѫٜب ೖ୛۽ ٜ݅ ࣻ ੓ח Ѫ੉ ֢ೞ਋
    4 ৘) 'ࠈ਷ ےؒೞѱ ࢤࢿػ ੉ܴਸ о૑Ҋ ੓যਃ''
    PyCon APAC 2016 63

    View Slide

  64. ৘) நܼఠ ੉ܴ੄ ےؒࢿ ౸ױ (੗/ݽ੄ ୹അ ಁఢ)
    ## நܼఠ ੉ܴ੉ ߊ਺ оמೠ૑ ౸ױೞח गب ௏٘
    # ੉ܴਸ ੗ݽ बࠅ۽ ߄Է(1о ੗਺, 2о ݽ਺)
    # ৘) anything -> ‘21211211’
    symbols = get_cv_symbols(char_name)
    # ׮਺җ э਷ ಁఢ੉ ੓ਵݶ ߊ਺ оמ (प੤۽ח ؊ ׮ন)
    if ‘2121’ or ‘2112’ or ‘1121’ or ‘22122’, … in symbols:
    can_pron = False
    else:
    can_pron = True
    PyCon APAC 2016 64

    View Slide

  65. ੿ഛೠ ߑߨ਷ ইפ૑݅...
    ࠂ೤੸ਵ۽ ౸ױೞӝী ب਑੉ ؽ
    PyCon APAC 2016 65

    View Slide

  66. ୶о ೖ୛۽ झ௏যо ೱ࢚, Ӓ۞ա…
    4 ಣӐ ੿ഛب 96%۽ ೱ࢚. ੼ࣻח ֫਷ ಞ੉૑݅,
    4 प੤ ੸ਊ೧ࠄ Ѿҗ
    4 GMש੄ ഛੋ җ੿ীࢲ য়ఐ੉ Ԩ ա১ !
    4 DecisionTree੄ Ҋ૕੸ੋ җ੸೤ ޙઁ۽ ౸ױ
    PyCon APAC 2016 66

    View Slide

  67. Random Forest۽ Ү୓
    4 ݆਷ Decision Tree ܳ ઑ೤ೠ ঔ࢚࠶ ప௼ץ
    4 ׮ࣻ੄ DTܳ ࠙࢑ ೟ण(=੿ӏച ബҗ) दఃҊ ై಴ೞח ߑध
    4 ੼ࣻо ծইب উ੿੸ੋ Ѿҗ
    4 DecisionTree - ࠛউೠ 96%
    RandomForest - উ੿੸ੋ 95%
    PyCon APAC 2016 67

    View Slide

  68. Random Forest ೟ण
    4 ӝࠄ੸ਵ۽ Decision Tree৬ ࠺त
    4 max_depth, min_samples_leaf
    ݽ؛੄ ࠂ੟بܳ ઑ੺. ੘ѱ द੘೧ࢲ ઑӘঀ ఃਕࠄ׮
    4 n_estimator
    4 աޖ(DT)ܳ ݻ Ӓܖ बਸ Ѫੋ૑ Ѿ੿ !
    4 ցޖ ௼ݶ ೟णदр੉ ӡҊ, ցޖ ੸ਵݶ Ӓր DTо غযߡܿ
    PyCon APAC 2016 68

    View Slide

  69. RF ੸ਊ റ Ѿҗ
    4 ੿ഛبח 95%
    4 ࠗ׼ೞѱ ૚҅ ߉ח ࢎ۹о হب۾
    4 predict_probaܳ ࢎਊ೧ ৘ஏ੄ ഛܫب ঳Ҋ
    4 ഛܫ੉ ֫਷(>70%) ৘ஏ Ѿҗ݅ ನೣ
    4 ৈӝࢲ 10~20%੿ب ੤അਯ(Recall) ೞۅ ୶੿
    4 Ӓ۞ա, ੿޻ب(Precision)ח…
    PyCon APAC 2016 69

    View Slide

  70. 100% ׳ࢿ
    GMש੉ ࣻ੘সਵ۽ Ѩష೧ ઱न Ѿҗ… !
    PyCon APAC 2016 70

    View Slide

  71. ଺ওਵפ ઁ੤ܳ...
    4 2ѐਘৈী Ѧ୛ ઁ੤
    4 ోਸ ࢎਊೠ ౵߁੉ ؀ࠗ࠙ ࢎۄ૗! !
    4 ઱ӝ੸/૑ࣘ੸ਵ۽ ઁ੤ܳ ೧ঠ ബҗо ੓਺
    PyCon APAC 2016 71

    View Slide

  72. ଵҊ: ୭ઙ ೖ୛੄ ઺ਃب
    PyCon APAC 2016 72

    View Slide

  73. ѐࢶ ߑೱ
    4 Ѩ୹ػ Ѿҗܳ ੉ਊ೧ ೟ण ݽ؛ ѐࢶ
    4 ࠈ ҅੿ী ؀ೠ PIIܳ ࣻ૘೧فݶ नӏ ࠈ ೟णী ਊ੉ೡ Ѫ
    4 ઁ੤ റ ߸ઙ ࠈ ݽפఠ݂ ೙ਃ
    PyCon APAC 2016 73

    View Slide

  74. റӝ
    PyCon APAC 2016 74

    View Slide

  75. ו՛ ੼
    4 ؘ੉ఠ ࣻ૘ࠗఠ оҕ, ࠙ࢳө૑੄ ݽٚ җ੿ਸ ౵੉ॆਵ۽ !
    4 Jupyter ֢౟࠘ਸ ాೠ ఐ࢝੸ ؘ੉ఠ ࠙ࢳ "
    4 ؊ ׮নೠ ࠙ঠী ӝ҅ ೟णਸ ഝਊ оמೡ ٠
    PyCon APAC 2016 75

    View Slide

  76. ӝ҅೟ण बച
    4 Ө੉ ੓ח ഝਊਸ ਤ೧ ӝࠄ ੉ۿਸ ؊ ҕࠗೞ੗ !
    4 જ਷ Hypothesisܳ ٜ݅ ࣻ ੓ѱ ػ׮
    4 ୭੸ചܳ ೡ ࣻ ੓ѱ ػ׮
    4 ೞա ੉࢚੄ ঌҊ્ܻਸ ࢎਊ೧ ࠁ੗
    4 SVM, Neural Net ١ ׮নೠ ࠙ܨӝ
    4 Super Learner ߑधਵ۽ ঔ࢚࠶
    PyCon APAC 2016 76

    View Slide

  77. ࣁਘ਷ ൗ۞... ࢜۽਍ ۽Ӓ ࣻ૘/࠙ࢳ ജ҃
    4 RSync ߑध -> Fluentd/Kinesis पदр ۽Ӓ ࣻ૘
    4 gzipػ CSV -> Parquet ನݘਵ۽ S3 ੷੢
    4 Columnar ߄੉ցܻ ನݘ, 30x ࣘب ೱ࢚
    4 MRJob -> PySpark
    4 ъ۱ೠ ࠙࢑ ୊ܻ / Cache ӝמ(߈ࠂ ೟णী ъ੼)
    4 ױࣘ੸ Spark ௿۞झఠ(20 VMs = 80௏য, 320GB ۔)۽ ੉ਊ ઺
    (दр ׼ 3000ਗ ੿ب)
    PyCon APAC 2016 77

    View Slide

  78. ઑ঱
    4 ӝ҅೟ण੉ ղо ೞ۰ח ੌী ੸೤ೠ૑ ౸ױ !
    4 য࠭૚੄ ౠࢿ੉ ױࣽೞݶ ੹ా੸ੋ ߑߨਵ۽ оמ
    4 ఐ࢝੸ ؘ੉ఠ ࠙ࢳਸ ా೧ ౠࢿਸ ݢ੷ ౵ঈೞ੗
    4 ׮নೠ ݽ؛/ೖ୛ܳ పझ౟೧ࠁ੗
    4 ೟ण ݽ؛ী ٮۄ ೖ୛ ੿ӏച/૒Үചо ೙ਃೡ ࣻ ੓ਵפ ୓௼
    4 ௿ېझр Imbalance ޙઁী ઱੄
    PyCon APAC 2016 78

    View Slide

  79. ٩۞׬? ӝ҅೟ण?
    4 ٩۞׬
    4 ੿Үೠ ೖ୛ ূ૑פয݂੉ ೙ਃ হ਺
    4 ݆਷ ಁ۞޷ఠ = ݆਷ ؘ੉ఠо ೙ਃ
    4 ӝ҅೟ण
    4 ೖ୛ ੘স੉ ઺ਃೞ૑݅
    4 ੸਷ ಁ۞޷ఠ = ੸਷ ؘ੉ఠ۽ب ബ
    җ
    PyCon APAC 2016 79

    View Slide

  80. ੟࢚
    4 ؘ੉ఠ ূ૑פয݂੄ য۰਑
    4 ؘ੉ఠ੄ ഛࠁо о੢ ઺ਃ
    4 झನ౟ۄ੉౟ܳ ߉ח ࠙ঠח য়൤۰ ੹ݎ੉ যف਑
    4 ఑ োҳ੗о ইפۄݶ ҷڣ࢑স/౥࢜ ؘ੉ఠঠ݈۽ ࠶ܖয়࣌
    4 ݽٚ ഥࢎী ؘ੉ఠ ࠙ࢳоо ೙ਃೠ द؀
    4 ஹೊఠо ݽٚ ݽ؛/߸ࣻ ઑ೤ਸ పझ౟ ೡ ࣻ ੓׮ݶ? !
    PyCon APAC 2016 80

    View Slide

  81. ՘ਵ۽... ੄ࢎ োҙ(Spurious Correlations)
    4 पઁ۽ח োҙ੉ হ૑݅, ੓ח Ѫ୊ۢ ࠁ੉ח ҃਋
    4 ؘ੉ఠী݅ ૘଱ೞ૑ ݈Ҋ, بݫੋਸ ੉೧ೞ੗!
    PyCon APAC 2016 81

    View Slide

  82. хࢎ೤פ׮.
    PyCon APAC 2016 82

    View Slide

  83. ଵҊ ݂௼
    4 http://www.aladin.co.kr/shop/wproduct.aspx?ItemId=28946323
    4 http://www.tylervigen.com/spurious-correlations
    4 http://scikit-learn.org/stable/modules/tree.html
    4 http://www.cimerr.net/conference/board/data/conference/1331626266/P15.pdf
    4 http://stackoverflow.com/questions/20463281/- how-do-i-solve-overfitting-in-random-forest--
    of-python-sklearn
    4 http://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning
    4 https://www.quora.com/Is-Scala-a-better-choi- ce-than-Python-for-Apache-Spark
    4 http://statkclee.github.io/data-science/data- -handling-pipeline.html
    4 https://databricks.com/blog/2016/01/25/deep-- learning-with-spark-and-tensorflow.html-
    PyCon APAC 2016 83

    View Slide