기계학습을 활용한 게임 어뷰징 검출

기계학습을 활용한 게임 어뷰징 검출

PyConAPAC 2016에서 발표한 문서입니다.

31a13c4a25d58a36a8fc3b2f5b844ab7?s=128

JeongJu Kim

August 16, 2016
Tweet

Transcript

  1. ӝ҅೟णਸ ഝਊೠ ѱ੐ য࠭૚ Ѩ୹ ӣ੿઱ PyCon APAC 2016 PyCon

    APAC 2016 1
  2. ߊ಴੗ ࣗѐ ӣ੿઱ (haje01@naver.com) ੹: ѱ੐ ѐߊ - NHN /

    NPLUTO - 3D ূ૓ / ѱ੐ ௿ۄ੉঱౟ ѐߊ അ: ѱ੐ ؘ੉ఠ ࣻ૘ / ࠙ࢳ - Webzen NPlay - ۽Ӓ ನਕ؊, Pandas, Scikit-Learn, PySpark PyCon APAC 2016 2
  3. ੉ ߊ಴ח 4 ӝ҅೟णী ؀ೠ ӝࠄ ૑ध੉ ੓ח ٜ࠙ਸ ؀࢚

    4 ౵੉ॆਸ ഝਊೠ ؘ੉ఠ ࠙ࢳҗ ӝ҅೟ण ࢎ۹ܳ ҕਬ 4 ѐߊҗ ࢲ࠺झী ӝ҅೟णਸ بੑೞח ҅ӝо غ঻ਵݶ ೤פ׮ PyCon APAC 2016 3
  4. द੘ زӝ 4 ѱ੐ য࠭૚ ઁ੤ܳ 4 ਬ੷ नҊ /

    GM ݽפఠ݂ / ಁఢ ଺ӝ۽ח ೠ҅ 4 ࢎۈ੄ ѐੑ੉ ୭ࣗചػ য࠭૚ ఐ૑ दझమਸ ٜ݅੗ PyCon APAC 2016 4
  5. ѱ੐ য࠭૚੉ۆ? 4 “ӝദ੸ਵ۽ ੄بೞ૑ ঋ਷ ߑधਵ۽ ѱ੐ ੿ࠁܳ ؀۝

    ദٙೞѢա ب ਑ਸ ઱ח ೯ਤ” ! 4 ࢎ۹ 4 ࢲ࠺झ ҳഅ࢚੄ ೹੼ਸ ੉ਊೠ ೒ۨ੉ 4 ೧ఊ ోਸ ࢎਊೠ ࠺੿࢚ ೒ۨ੉ 4 ੹୓ ଻౴ହী بߓ۽ ҟҊ PyCon APAC 2016 5
  6. ా҅৬ ఐ࢝੸ ؘ੉ఠ ࠙ࢳ PyCon APAC 2016 6

  7. ਋ࢶ, ా҅ 4 ా҅ח ೂࠗೞ૑ ޅೠ ؘ੉ఠ৬ ஹೊ౴ ౵ਕ੄ ജ҃ীࢲ

    ߊ੹ 4 ా҅ ೟੗ٜ਷ ؘ੉ఠ/҅࢑ਸ ઴੉ח ߑߨਸ োҳ 4 ৌঈೠ ജ҃ীࢲ ٜ݅য઎ӝী, ੸਷ ؘ੉ఠীࢲب о஖ܳ ߊѼೡ ࣻ ੓਺ 4 ӝࠄ੸ੋ ా҅ ૑ध਷ ѐߊ, ӝദ, ࢲ࠺झ ١ী ௾ ب਑੉ ؽ PyCon APAC 2016 7
  8. ఐ࢝੸ ؘ੉ఠ ࠙ࢳ 4 ؘ੉ఠী ऀয੓ח ੿ࠁܳ, 4 ׮নೠ пب۽

    ਃড, दпച ೧ࠁݴ ଺ח җ੿ ! 4 ୊਺ ੽ೞח ؘ੉ఠח ੉ җ੿ࠗఠ 4 ੗୓ दझమ(WzDat) ѐߊ೧ ഝਊ " 4 Jupyter + Utility + Dashboard 4 https://github.com/haje01/wzdat 4 http://www.pycon.kr/2014/program/14 PyCon APAC 2016 8
  9. ࢎ۹1 рױೠ ా҅੸ ই੉٣য۽ झಁݠ Ѩ୹ PyCon APAC 2016 9

  10. ࢚ട 4 नӏ য়೑ೠ ѱ੐੄ ଻౴ହ੉ ѱ੐ ই੉మ ҟҊӖ۽ оٙ

    ! 4 ೧׼ ҅੿ਸ ઁ੤೧ب ߄۽ ࢜ ҅੿ਵ۽ ҟҊ ҅ࣘ 4 ࡅܲ ઁ੤о ೙ਃೞৈ, ӝ҅೟णਸ ૓೯ೞӝীח दр੉ ࠗ઒ PyCon APAC 2016 10
  11. ଻౴ਸ ੉ਊೠ झಅ (Spam) 4 ѱ੐ ղীࢲ ੹৉ ଻౴ਵ۽ ݠפ/ই੉మ

    ౸ݒ ҟҊ 4 য࠭੷ח ೐۽Ӓ۔੸ ౸߹ਸ ݄ӝਤ೧ ݫ द૑ܳ դةച PyCon APAC 2016 11
  12. झಁݠ Ѩ୹ 4 ׮নೠ ߑߨ੉ оמೞѷਵա, 4 ੗োয ୊ܻա ӝ҅೟णэ਷

    Ҋә ੽Ӕࠁ׮, 4 рױೠ ా҅੸ ই੉٣য۽ दب PyCon APAC 2016 12
  13. ৡۄੋ ଻౴ ݫद૑ ӡ ੉੄ ࠙ನ 4 ੌ߈੸ਵ۽ ۽Ӓ ੿ӏ࠙ನܳ

    ٮܲ׮Ҋ ঌ ۰ઉ੓׮. 4 ইېח NPS Chat Corpus੄ ݫद૑ ӡ੉ ࠙ನ PyCon APAC 2016 13
  14. ѱ੐ ղ ଻౴ ݫद૑ ӡ੉ ࠙ನ 4 ৡۄੋ ଻౴җ ࠺तೞա

    ખ ؊ فԁ਍ ҃ ೱ 4 ౠ੿ ӡ੉ ݫद૑о ౗(?) → झಅਵ۽ ഛੋ PyCon APAC 2016 14
  15. ই੉٣য 4 ੌ߈ ਬ੷: ݫद૑ ӡ੉о ׮নೞҊ, ࠼بо ݆૑ ঋ਺

    4 झಁݠ: ݫद૑ ӡ੉о ׮নೞ૑ ঋҊ, ࠼بח ֫਺ 4 ૊, যڃ ਬ੷੄ ଻౴ ࠼بо ֫Ҋ ӡ੉о ׮নೞ૑ ঋਵݶ झಁݠ PyCon APAC 2016 15
  16. рױೠ Ѩ୹ ҕध 4 ਬ੷ ߹ ଻౴੄ പࣻ / ݫद૑

    ӡ੉ ઙܨ ࣻ 4 ࠺तೠ ӡ੉੄ ଻౴ ݫद૑ܳ ੗઱ ࠁյ ࣻ۾ ч੉ ழ૗ PyCon APAC 2016 16
  17. ࠙ܨ 4 spam_ratioо ӝળ ч ੉࢚ੋ Ѫਸ झಁݠ۽ р઱ 4

    ӝળ ч Ѿ੿਷ ോܻझ౮ೞѱ... 4 ؀୽ ࢸ੿ റ, ࠙ܨػ நܼఠ੄ ݫद૑ ഛੋਵ۽ ч ઑ੺ PyCon APAC 2016 17
  18. ࠙ܨ റ ݫद૑ ӡ੉ ࠙ನ 4 ࠼بо ֫਷ ౠ੿ ӡ੉੄

    ݫद૑(= झಅ)о ܻ࠙غ঻਺ PyCon APAC 2016 18
  19. Ѿҗ ੸ਊ 4 ҳഅ੉ рױ೮૑݅, য়ఐ੄ оמࢿ ੓਺ 4 ӝળ

    чਸ ֫ѱ ੟ই न܉بܳ ֫੐ 4 ੉ Ѿҗܳ о૑Ҋ ઁ੤ PyCon APAC 2016 19
  20. ѐࢶ ߑೱ 4 ӝળ ч Ѿ੿ਸ ખ ؊ җ೟੸ੋ ߑߨਵ۽

    4 ੗োয ୊ܻ ӝࣿ(NLP) بੑ 4 ױয߹ ࠼ب(Ziff’s Law)৬ ઺ਃب(TF-IDF) Ҋ۰ 4 ӝ҅೟ण ঌҊ્ܻ ੸ਊ PyCon APAC 2016 20
  21. ӝ҅೟ण ࣗѐ PyCon APAC 2016 21

  22. ӝ҅೟णਸ ॳח ੉ਬ 4 ੸਷ ֢۱ਵ۽ ҡଳ਷ Ѿҗޛ 4 ׮নೠ

    ޙઁী ؀ೠ ੌ߈੸ੋ ࣛܖ࣌ 4 ׮ࣻ੄ ౠࢿ(ೖ୛)ਸ زदী Ҋ۰ೡ ࣻ ੓׮ 4 ؘ੉ఠ ߸زী ъೣ(ъѤࢿ) PyCon APAC 2016 22
  23. ࠙ܨ৬ ഥӈ 4 ӝ҅೟ण਷ ௼ѱ ࠙ܨ (Classification)৬ ഥӈ (Regression)۽ ա׍

    4 ࠙ܨ - ઙܨܳ ৘ஏ ೞח Ѫ 4 ഥӈ - োࣘػ чਸ ৘ஏ ೞח Ѫ 4 য࠭૚ Ѩ୹਷ ࠙ܨী ࣘೣ PyCon APAC 2016 23
  24. ૑ب ೟णҗ ੗ਯ ೟ण 4 ૑ب ೟ण(Supervised Learning) 4 ӝઓ

    ҃೷ী ੄೧ ࠙ܨػ ࢠ೒ ؘ੉ఠо ੓ਸ ٸ 4 ੗ਯ ೟ण(Unsupervised Learning) 4 ࠙ܨػ ࢠ೒ ؘ੉ఠо হਸ ٸ 4 ؀ࠗ࠙੄ ؘ੉ఠח ੸੺൤ ࠙ܨغয ੓૑ ঋ׮ → ಽযঠೡ ޙઁ PyCon APAC 2016 24
  25. ӝ҅೟ण ঌҊ્ܻٜ 4 ӝࠄ 4 ܻפয/۽૑झ౮ ܻӒۨ࣌(Linear/Logistic Regression) 4 Ѿ੿

    ౟ܻ(Decision Tree) 4 Ҋә 4 ےؒ ನۨझ౟(Random Forest) 4 SVM(Support Vector Machine) 4 ੋҕ न҃ݎ(Neural Network) PyCon APAC 2016 25
  26. ঌҊ્ܻ੄ ࢶఖ਷? 4 ੌ߈੸ਵ۽ Ҋә ঌҊ્ܻ਷ ؊ ࠂ੟ೠ ݽ؛ ೟ण

    оמ 4 Ӓ۞ա, Ҋә ঌҊ્ܻ੉ ޖઑѤ જ਷ Ѫ਷ ইש 4 ೟ण੄ Ѿҗܳ ࢎۈ੉ ੉೧ೞӝীח ӝࠄ ঌҊ્ܻ੉ જ׮ PyCon APAC 2016 26
  27. ৘ஏী ؀ೠ ಣо 4 ੿ഛࢿী ؀ೠ ੿੄о ೙ਃ ! 4

    Q: ਬ੷ 100ݺ ઺ 2ݺ ੓ח য࠭੷ܳ Ѩ୹ೞ۰ ೠ׮. पࣻ۽ ݽف ੿࢚ ਬ੷۽ ౸ױ೮ਸ ٸ ੿ഛبח? 4 A: 100ݺ ઺ 2ݺ੉ ౣ۷ਵפ… 98% !?#@ PyCon APAC 2016 27
  28. ஏ੿ ױਤ 4 ੿޻ب(Precision) ੤അਯ(Recall)җ ١ ׮নೠ ױਤ 4 ੿޻ب:

    ଺਷ Ѫ ઺ ঴݃ա ૓૞ য࠭੷ੋо? 4 ੤അਯ: ੹୓ য࠭੷ ઺ ঴݃ա ଺ওחо? 4 ؘ੉ఠо ࠛӐഋ(Imbalance)ੌٸח ౠ൤ ੿޻ب৬ ੤അਯਸ ೣԋ Ҋ۰೧ঠ 4 খ੄ ҃਋ח ੤അਯ੉ 0 PyCon APAC 2016 28
  29. P/R Curve ৬ AUC જ਷ ࠙ܨӝח? PyCon APAC 2016 29

  30. ࢎ۹2 ӝ҅೟णਵ۽ ౵߁ Ѩ୹ PyCon APAC 2016 30

  31. ࢚ട 4 ۄ੉࠳ ѱ੐ীࢲ пઙ ೧ఊ ోਸ ࢎਊೠ ౵߁ ೒ۨ੉о

    ഝѐ ! 4 ౵߁: ѱ੐ ղ ੤ചܳ ࠺ ੿࢚੸ੋ ߑߨਵ۽ णٙ 4 ࠈ੄ ౠࢿਸ ೞա ل۽ ౠ੿ೞӝ য۰਑ → ӝ҅೟ण੉ ೙ਃ PyCon APAC 2016 31
  32. ೟ण ߑध ࢶఖ 4 Ҷ੉ ׏ۡ֔/٩۞׬ਵ۽ ೡ ೙ਃח হח ٠…

    4 җѢ ۽Ӓо ੷੢غҊ ੓঻Ҋ, 4 ਍৔ஏীࢲ ӝઓ য࠭੷ நܼఠ ܻझ౟ܳ о૑Ҋ ੓঻਺ ! → ӝ҅೟ण, ౠ൤ ૑ب ೟ण੉ оמ! 4 Decision Tree ߑध੄ ૑ب ೟णਵ۽ Ѿ੿ PyCon APAC 2016 32
  33. ળ࠺ җ੿ 1. ۽Ӓ ࣻ૘ ࢚క ഛੋ 2. ۽Ӓ੄ ҳઑ/੄޷

    ౵ঈ 3. ೟णਸ ਤೠ ೖ୛(Feature) ୶୹ PyCon APAC 2016 33
  34. ӝ҅೟णب ۽Ӓ ࣻ૘ࠗఠ 4 ۽Ӓܳ ୓҅੸ਵ۽ ݽਵח Ѫب औ૑ ঋ਺

    4 ࠙ࢳ/೟णী Ѧܻח दр਷ 10~20% ੿ب 4 ؘ੉ఠܳ ݽਵҊ оҕೞחؘ ؀ࠗ࠙੄ दр੉ Ѧܽ׮. 4 ۽Ӓ ഋध਷ оә੸ Ӓ؀۽ ࢎਊ (झౚ٣য়ܳ ਤ೧… !) 4 ۽Ӓܳ ੸੺൤ ࠙ܨ೧ ੷੢ (ࢲߡ/۽Ӓ ઙܨ, द੼ ߹۽) 4 ௿ۄ਋٘ झషܻ૑(S3) ୶ୌ ☁ PyCon APAC 2016 34
  35. ਦب਋ ࢲߡীࢲ ۽Ӓ ࣻ૘ೞӝ 4 ѱ੐ ࢲߡח ؀ࠗ࠙ ਦب਋ ӝ߈

    4 য়೑ ࣗझ੄ જ਷ ోٜ(fluentd, logstash ١)ਸ ॳҊ र঻ਵա 4 ਦب਋ ࢲߡী ࢸ஖о औ૑ ঋҊ, ੌࠗ ӝמ੉ ࠗ઒ 4 ੗୓ ѐߊ ! 4 https://github.com/haje01/wdfwd 4 ࢲߡী թ਷ ۽Ӓ ౵ੌਸ RSync۽ زӝೞѢա 4 ѱ੐ DBী ੽ࣘೞৈ Dump റ ੹࣠ PyCon APAC 2016 35
  36. ۽Ӓо ࣻ૘ غ঻ਵݶ ೖ୛ܳ ٜ݅੗ 4 ೖ୛(Feature, ౠࢿ): ೟ण ؀࢚੄

    ౠ૚ਸ ࢸݺ೧઱ח ч 4 ৘) ૘ чਸ ৘ஏೞח ҃਋ ! → ૘੄ ௼ӝ, ߑೱ, ജ҃, Үా, ಞ੄दࢸ ١੉ ೖ୛ PyCon APAC 2016 36
  37. ೖ୛ ѐߊ(Feature Engineering) 4 (࠺)੿ഋ ؘ੉ఠীࢲ ೖ୛ܳ ଺Ҋ ࢤࢿೞח ੘স

    4 ׮ܲ ೖ୛ٜী ղ੤ػ ೖ୛ܳ ଺ইղӝب ೣ 4 ٸ۽ח ࠂ੟ೠ ௏٘о ೙ਃ(SQL۽ח ൨ٝ) 4 3ѐਘ ࠙۝੄ ۽Ӓীࢲ ೞنਸ ా೧ ೖ୛ ࢤࢿ PyCon APAC 2016 37
  38. ೞنਸ ॄঠ݅ ೞա? 4 ؘ੉ఠо Bigೞ૑ ঋਵݶ ೙ਃ হ਺ 4

    ؀न… 4 ߓ஖ Jobਸ য়ۖزউ جܻѢա 4 ઱ӝ੸ਵ۽ ETLਸ ా೧ DBী ֍যفח җ੿੉ ೙ਃೡ ࣻ ੓਺ 4 ࠺੿ഋ/؀ਊ۝ ؘ੉ఠীࢲ ࠼ߣೠ ೖ୛ ѐߊਸ ೠ׮ݶ જ਺ PyCon APAC 2016 38
  39. যڌѱ ॄঠೞա? 4 ૒੽ ೞن ௿۞झఠܳ ҳ୷ೞৈ ࢎਊೡ ࣻب ੓ਵա,

    ࣇ౴җ ਍ਊ੄ য۰਑ 4 ௿ۄ਋٘ ࢲ࠺झীࢲ ઁҕೞח ೞن ࢲ࠺झܳ ੉ਊ ! - AWS੄ EMR(Elastic Map Reduce) PyCon APAC 2016 39
  40. AWSח ࠺ऱ૑ ঋա? 4 ୭੸ച ೞݶ ࠺ऱ૑ ঋ਺ ! 4

    ೙ਃೡ ٸ݅ ॳח ױࣘ੸ ௿۞झఠ(Transient Cluster)۽ ੉ਊ 4 Task ֢٘ח ҃ݒ ߑध੄ Spot Instance۽ 4 m4.xlarge(4 vCPU, 16 GiB RAM ): दр ׼ 0.036$ (ࢲ਎ ܻ੹, 2016-08-09 ӝળ) PyCon APAC 2016 40
  41. AWS EMR ௿۞झఠ द੘ ചݶ PyCon APAC 2016 41

  42. ೞنਸ ਤೠ ۽Ӓ оҕ 4 ೞن਷ ੘਷ ౵ੌ(< 100MB)ٜ੉ ݆਷

    Ѫী ஂড 4 ੘਷ ౵ੌٜ਷ ߽೤, ࣗ౴, ঑୷ೡ ೙ਃ 4 ݃ٶೠ ోਸ ଺૑ ޅ೧ ѐߊ ! 4 https://github.com/haje01/mersoz 4 ߄Ո ౵ੌ݅ ੘স, ੄ઓ ҙ҅ܳ Ҋ۰ೠ ߽۳ ୊ܻ PyCon APAC 2016 42
  43. ݠ૑, ࣗ౴ & ঑୷ റ S3ী ੷੢ػ ۽Ӓ PyCon APAC

    2016 43
  44. ೞن MapReduce ௏٬ - mrjob 4 Yelpীࢲ ݅ٚ Python ಁః૑

    4 ೞن झ౟ܿਸ ੉ਊ೧ ౵੉ॆਵ۽ MR ௏٬ 4 ۽ஸীࢲ ࢠ೒ ؘ੉ఠ۽ ѐߊೠ റ, EMRী ৢܿ ! 4 प೯ ࣘبח Javaߡ੹ ࠁ׮ ખ וܻ૑݅ ѐߊ ࣘبо ࡅܴ PyCon APAC 2016 44
  45. from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class

    MRWordFreqCount(MRJob): def mapper(self, _, line): # ۽Ӓ ౵ੌ੄ п ۄੋ੄ for word in WORD_RE.findall(line): # ݽٚ ױযী ؀೧ yield word.lower(), 1 # 'ױয', 1 ߈ജ def combiner(self, word, counts): # ֢٘੄ Ѿҗܳ ஂ೤ yield word, sum(counts) def reducer(self, word, counts): # ௿۞झఠ੄ Ѿҗܳ ஂ೤ yield word, sum(counts) if __name__ == '__main__': MRWordFreqCount.run() PyCon APAC 2016 45
  46. दझమ ҳࢿب PyCon APAC 2016 46

  47. അട ౵ঈ 4 ӝ҅೟णਸ ਤ೧ 4 GM੉ ઁ੤ೞח ӔѢ(=ೖ୛)৬ 4

    ઁ੤ػ நܼఠ ܻझ౟ܳ ਃ୒ PyCon APAC 2016 47
  48. ೖ୛ ࢤࢿ ౲ 4 ۽Ӓীࢲ நܼఠ ӝળਵ۽ ҳೣ 4 ੿Үೠ

    ೖ୛ࠁ׮ח ׮নೠ ೖ୛ܳ 4 যରೖ ࠂ೤੸ਵ۽ ౸ױ 4 ୡӝীח ૣ਷ दрী ؀೧, উ੿ചغݶ ӡѱ PyCon APAC 2016 48
  49. ୡӝী ࡳইࠄ ೖ୛ٜ 4 ۽Ӓੋ ࣻ 4 ೒ۨ੉ दр 4

    ۽Ӓ ইਓ੉ ࠛ࠙ݺೠ ҃਋о ݆਺ 4 ࣁ࣌ ইਓ بੑ: 5࠙ ⏱ 4 ই੉మ/ݠפ णٙ ࣻ 4 ௮झ౟ ઙܐ ࣻ 4 NPC/PC р ੹ై ࣻ PyCon APAC 2016 49
  50. ೖ୛੄ ఋੑ਷? 4 ௼ѱ पࣻ ഋ, ஠పҊܻ ഋ, ࠛܽ(Boolean) ഋਵ۽

    աׇ૗ 4 оә੸ पࣻ ഋਵ۽ ాੌೞח Ѫ੉ ߄ۈ૒ 4 Bool਷ 0, 1۽ 4 ஠పҊܻ ఋੑ਷ OneHotEncoderܳ ࢎਊ೧ पࣻഋਵ۽ PyCon APAC 2016 50
  51. ٜ݅য૓ ೖ୛੄ ৘ 4 ױࣽ ఫझ౟ (.txt) ౵ੌ 4 நܼఠݺ

    + ೖ୛ ߓৌ ഋध PyCon APAC 2016 51
  52. ӝ҅೟ण ૓೯ PyCon APAC 2016 52

  53. ੿੘ ӝ҅೟ण਷ о߶਑ 4 ୭ઙ ೖ୛ ౵ੌ ௼ӝо ੘Ҋ, ӝ҅೟ण

    ࣻ೯ب о߶਍ ಞ 4 ۽ஸ PCীࢲ ࣻ೯ 4 ୶ୌ दझమ୊ۢ ݽٚ ؘ੉ఠܳ ࠊঠೞח ೟ण਷ ޖѢ਎ Ѫ 4 ݽ؛ਸ ࢶఖೞҊ ୭੸੄ ೞ੉ಌ ಁ۞޷ఠܳ Ѿ੿ೞח Ѫ੉ җઁ 4 ׮নೠ ࣇ౴ਵ۽ ৈ۞ߣ प೷೧ࠊঠ 4 ࠙࢑ दझమਸ ഝਊೞח ҃਋ب... PyCon APAC 2016 53
  54. যڃ ঌҊ્ܻ ݽ؛ਸ ࢶఖೡ Ѫੋо? 4 द੘਷ рױೠ Ѫਵ۽ 4

    ࠺तೠ ࢎ۹੄ ࢶ೯ োҳо ੓ਵݶ ଵҊೞ੗ 4 AUCա ROCܳ ాೠ ݽ؛ ಣо ߂ ࢶఖ PyCon APAC 2016 54
  55. Decision Tree۽ द੘ 4 ࠂ੟ೞ૑ ঋҊ ౸ױ җ੿੄ ੉೧о ਊ੉

    4 ౵੉ॆ Scikit-Learn ಁః૑੄ Ѫਸ ࢎਊ 4 ׮নೠ ӝ҅೟ण ঌҊ્ܻਸ ୽प൤ ઁҕ 4 ੋఠಕ੉झо ాੌغয ੓য ݽ؛ Ү୓о ਊ੉ 4 ೖ୛(X)৬ য࠭੷ ৈࠗ(y)ܳ ֍Ҋ ೟ण 4 DTח ೖ୛ ੿ӏച ೙ਃ হয ಞܻ PyCon APAC 2016 55
  56. DT ࢎਊ ৘ (ࠠԢ ࠙ܨ) from sklearn.datasets import load_iris from

    sklearn import tree iris = load_iris() clf = tree.DecisionTreeClassifier() clf = clf.fit(iris.data, iris.target) >>> clf.predict(iris.data[:1, :]) array([0]) PyCon APAC 2016 56
  57. PyCon APAC 2016 57

  58. Decision Tree ೟ण җ੿ 1. ೖ୛ ౵ੌীࢲ ӝઓ য࠭੷੄ ೖ୛ܳ

    ଺Ҋ 2. زࣻ੄ ੿࢚ ਬ੷ ೖ୛ ҳೣ 4 Under Sampling 3. ؘ੉ఠܳ Train/Test ࣇਵ۽ ա־Ҋ 4. ӝࠄ ಁ۞޷ఠ۽ ೟ण द੘ PyCon APAC 2016 58
  59. ୡӝ Ѿҗ 4 ಣӐ ੿ഛب 80% ੿ب 4 Binary Class

    ࠙ܨ੄ ҃਋ ੼ࣻо ੜ աয়ח ಞ 4 աࢁ૑ ঋ਷Ѫ э૑݅, 4 ৘ஏ੄ Ѿҗо ઁ੤੄ ӔѢ۽ ॳੋ׮ח ੼ীࢲ ݆੉ ࠗ઒ PyCon APAC 2016 59
  60. ੿ഛبܳ ৢܻ੗ 4 Үର Ѩૐ(Cross Validation)ਸ ਤ೧ ؘ੉ఠ ࣇਸ ܻ࠙

    ೞҊ 4 GridSearchCVܳ ా೧ ୭੸੄ ೞ੉ಌ ಁ۞޷ఠܳ ଺਺ 4 ಣӐ ੿ഛب 91%۽ ೱ࢚ 4 যڃ ӝળਵ۽ ౸ױೞח૑ ೠ ߣ ࠁҊ र׮ tree.export_graphviz۽ Ӓ۰ࠆ PyCon APAC 2016 60
  61. PyCon APAC 2016 61

  62. Ѿ੿ ౟ܻܳ ࠁפ... 4 ೟णػ ݽ؛੉ যڃ ӝળਵ۽ ౸ױೞח૑ ঌ

    ࣻ ੓਺ → ׮নೠ ૒ҵ੄ ࢎۈٜী ҕਬ оמ ! 4 ೞࠗ۽ ղ۰т ࣻ۾ ࠂ੟೧૑ח ޙઁ 4 DTח җ੸೤(Overfitting)غӝ औӝী, Depthо ցޖ Ө૑ ঋѱ ઱੄ PyCon APAC 2016 62
  63. ৈӝࢲ ؊ ੉࢚ ੼ࣻо ৢۄо૑ ঋ਺ 4 GMשҗ ࢚੄ റ

    ࢜۽਍ ೖ୛ٜ ୶о 4 زदী ঳਷ ই੉మ/ݠפ ࣻ 4 ݗ ߈ࠂ പࣻ 4 ౠ੿ ௿ېझ݅ ࢶఖ 4 ਑૒੉૑ ঋҊ ই੉మਸ ঳਷ ࣻ 4 դ೧೧ ࠁ੉ח Ѫٜب ೖ୛۽ ٜ݅ ࣻ ੓ח Ѫ੉ ֢ೞ਋ 4 ৘) 'ࠈ਷ ےؒೞѱ ࢤࢿػ ੉ܴਸ о૑Ҋ ੓যਃ'' PyCon APAC 2016 63
  64. ৘) நܼఠ ੉ܴ੄ ےؒࢿ ౸ױ (੗/ݽ੄ ୹അ ಁఢ) ## நܼఠ

    ੉ܴ੉ ߊ਺ оמೠ૑ ౸ױೞח गب ௏٘ # ੉ܴਸ ੗ݽ बࠅ۽ ߄Է(1о ੗਺, 2о ݽ਺) # ৘) anything -> ‘21211211’ symbols = get_cv_symbols(char_name) # ׮਺җ э਷ ಁఢ੉ ੓ਵݶ ߊ਺ оמ (प੤۽ח ؊ ׮ন) if ‘2121’ or ‘2112’ or ‘1121’ or ‘22122’, … in symbols: can_pron = False else: can_pron = True PyCon APAC 2016 64
  65. ੿ഛೠ ߑߨ਷ ইפ૑݅... ࠂ೤੸ਵ۽ ౸ױೞӝী ب਑੉ ؽ PyCon APAC 2016

    65
  66. ୶о ೖ୛۽ झ௏যо ೱ࢚, Ӓ۞ա… 4 ಣӐ ੿ഛب 96%۽ ೱ࢚.

    ੼ࣻח ֫਷ ಞ੉૑݅, 4 प੤ ੸ਊ೧ࠄ Ѿҗ 4 GMש੄ ഛੋ җ੿ীࢲ য়ఐ੉ Ԩ ա১ ! 4 DecisionTree੄ Ҋ૕੸ੋ җ੸೤ ޙઁ۽ ౸ױ PyCon APAC 2016 66
  67. Random Forest۽ Ү୓ 4 ݆਷ Decision Tree ܳ ઑ೤ೠ ঔ࢚࠶

    ప௼ץ 4 ׮ࣻ੄ DTܳ ࠙࢑ ೟ण(=੿ӏച ബҗ) दఃҊ ై಴ೞח ߑध 4 ੼ࣻо ծইب উ੿੸ੋ Ѿҗ 4 DecisionTree - ࠛউೠ 96% RandomForest - উ੿੸ੋ 95% PyCon APAC 2016 67
  68. Random Forest ೟ण 4 ӝࠄ੸ਵ۽ Decision Tree৬ ࠺त 4 max_depth,

    min_samples_leaf ݽ؛੄ ࠂ੟بܳ ઑ੺. ੘ѱ द੘೧ࢲ ઑӘঀ ఃਕࠄ׮ 4 n_estimator 4 աޖ(DT)ܳ ݻ Ӓܖ बਸ Ѫੋ૑ Ѿ੿ ! 4 ցޖ ௼ݶ ೟णदр੉ ӡҊ, ցޖ ੸ਵݶ Ӓր DTо غযߡܿ PyCon APAC 2016 68
  69. RF ੸ਊ റ Ѿҗ 4 ੿ഛبח 95% 4 ࠗ׼ೞѱ ૚҅

    ߉ח ࢎ۹о হب۾ 4 predict_probaܳ ࢎਊ೧ ৘ஏ੄ ഛܫب ঳Ҋ 4 ഛܫ੉ ֫਷(>70%) ৘ஏ Ѿҗ݅ ನೣ 4 ৈӝࢲ 10~20%੿ب ੤അਯ(Recall) ೞۅ ୶੿ 4 Ӓ۞ա, ੿޻ب(Precision)ח… PyCon APAC 2016 69
  70. 100% ׳ࢿ GMש੉ ࣻ੘সਵ۽ Ѩష೧ ઱न Ѿҗ… ! PyCon APAC

    2016 70
  71. ଺ওਵפ ઁ੤ܳ... 4 2ѐਘৈী Ѧ୛ ઁ੤ 4 ోਸ ࢎਊೠ ౵߁੉

    ؀ࠗ࠙ ࢎۄ૗! ! 4 ઱ӝ੸/૑ࣘ੸ਵ۽ ઁ੤ܳ ೧ঠ ബҗо ੓਺ PyCon APAC 2016 71
  72. ଵҊ: ୭ઙ ೖ୛੄ ઺ਃب PyCon APAC 2016 72

  73. ѐࢶ ߑೱ 4 Ѩ୹ػ Ѿҗܳ ੉ਊ೧ ೟ण ݽ؛ ѐࢶ 4

    ࠈ ҅੿ী ؀ೠ PIIܳ ࣻ૘೧فݶ नӏ ࠈ ೟णী ਊ੉ೡ Ѫ 4 ઁ੤ റ ߸ઙ ࠈ ݽפఠ݂ ೙ਃ PyCon APAC 2016 73
  74. റӝ PyCon APAC 2016 74

  75. ו՛ ੼ 4 ؘ੉ఠ ࣻ૘ࠗఠ оҕ, ࠙ࢳө૑੄ ݽٚ җ੿ਸ ౵੉ॆਵ۽

    ! 4 Jupyter ֢౟࠘ਸ ాೠ ఐ࢝੸ ؘ੉ఠ ࠙ࢳ " 4 ؊ ׮নೠ ࠙ঠী ӝ҅ ೟णਸ ഝਊ оמೡ ٠ PyCon APAC 2016 75
  76. ӝ҅೟ण बച 4 Ө੉ ੓ח ഝਊਸ ਤ೧ ӝࠄ ੉ۿਸ ؊

    ҕࠗೞ੗ ! 4 જ਷ Hypothesisܳ ٜ݅ ࣻ ੓ѱ ػ׮ 4 ୭੸ചܳ ೡ ࣻ ੓ѱ ػ׮ 4 ೞա ੉࢚੄ ঌҊ્ܻਸ ࢎਊ೧ ࠁ੗ 4 SVM, Neural Net ١ ׮নೠ ࠙ܨӝ 4 Super Learner ߑधਵ۽ ঔ࢚࠶ PyCon APAC 2016 76
  77. ࣁਘ਷ ൗ۞... ࢜۽਍ ۽Ӓ ࣻ૘/࠙ࢳ ജ҃ 4 RSync ߑध ->

    Fluentd/Kinesis पदр ۽Ӓ ࣻ૘ 4 gzipػ CSV -> Parquet ನݘਵ۽ S3 ੷੢ 4 Columnar ߄੉ցܻ ನݘ, 30x ࣘب ೱ࢚ 4 MRJob -> PySpark 4 ъ۱ೠ ࠙࢑ ୊ܻ / Cache ӝמ(߈ࠂ ೟णী ъ੼) 4 ױࣘ੸ Spark ௿۞झఠ(20 VMs = 80௏য, 320GB ۔)۽ ੉ਊ ઺ (दр ׼ 3000ਗ ੿ب) PyCon APAC 2016 77
  78. ઑ঱ 4 ӝ҅೟ण੉ ղо ೞ۰ח ੌী ੸೤ೠ૑ ౸ױ ! 4

    য࠭૚੄ ౠࢿ੉ ױࣽೞݶ ੹ా੸ੋ ߑߨਵ۽ оמ 4 ఐ࢝੸ ؘ੉ఠ ࠙ࢳਸ ా೧ ౠࢿਸ ݢ੷ ౵ঈೞ੗ 4 ׮নೠ ݽ؛/ೖ୛ܳ పझ౟೧ࠁ੗ 4 ೟ण ݽ؛ী ٮۄ ೖ୛ ੿ӏച/૒Үചо ೙ਃೡ ࣻ ੓ਵפ ୓௼ 4 ௿ېझр Imbalance ޙઁী ઱੄ PyCon APAC 2016 78
  79. ٩۞׬? ӝ҅೟ण? 4 ٩۞׬ 4 ੿Үೠ ೖ୛ ূ૑פয݂੉ ೙ਃ হ਺

    4 ݆਷ ಁ۞޷ఠ = ݆਷ ؘ੉ఠо ೙ਃ 4 ӝ҅೟ण 4 ೖ୛ ੘স੉ ઺ਃೞ૑݅ 4 ੸਷ ಁ۞޷ఠ = ੸਷ ؘ੉ఠ۽ب ബ җ PyCon APAC 2016 79
  80. ੟࢚ 4 ؘ੉ఠ ূ૑פয݂੄ য۰਑ 4 ؘ੉ఠ੄ ഛࠁо о੢ ઺ਃ

    4 झನ౟ۄ੉౟ܳ ߉ח ࠙ঠח য়൤۰ ੹ݎ੉ যف਑ 4 ఑ োҳ੗о ইפۄݶ ҷڣ࢑স/౥࢜ ؘ੉ఠঠ݈۽ ࠶ܖয়࣌ 4 ݽٚ ഥࢎী ؘ੉ఠ ࠙ࢳоо ೙ਃೠ द؀ 4 ஹೊఠо ݽٚ ݽ؛/߸ࣻ ઑ೤ਸ పझ౟ ೡ ࣻ ੓׮ݶ? ! PyCon APAC 2016 80
  81. ՘ਵ۽... ੄ࢎ োҙ(Spurious Correlations) 4 पઁ۽ח োҙ੉ হ૑݅, ੓ח Ѫ୊ۢ

    ࠁ੉ח ҃਋ 4 ؘ੉ఠী݅ ૘଱ೞ૑ ݈Ҋ, بݫੋਸ ੉೧ೞ੗! PyCon APAC 2016 81
  82. хࢎ೤פ׮. PyCon APAC 2016 82

  83. ଵҊ ݂௼ 4 http://www.aladin.co.kr/shop/wproduct.aspx?ItemId=28946323 4 http://www.tylervigen.com/spurious-correlations 4 http://scikit-learn.org/stable/modules/tree.html 4 http://www.cimerr.net/conference/board/data/conference/1331626266/P15.pdf

    4 http://stackoverflow.com/questions/20463281/- how-do-i-solve-overfitting-in-random-forest-- of-python-sklearn 4 http://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning 4 https://www.quora.com/Is-Scala-a-better-choi- ce-than-Python-for-Apache-Spark 4 http://statkclee.github.io/data-science/data- -handling-pipeline.html 4 https://databricks.com/blog/2016/01/25/deep-- learning-with-spark-and-tensorflow.html- PyCon APAC 2016 83