PyConAPAC 2016에서 발표한 문서입니다.
ӝ҅णਸ ഝਊೠ ѱ য࠭ ѨӣPyCon APAC 2016PyCon APAC 2016 1
View Slide
ߊ ࣗѐӣ ([email protected]): ѱ ѐߊ- NHN / NPLUTO- 3D ূ / ѱ ۄ ѐߊഅ: ѱ ؘఠ ࣻ / ࠙ࢳ- Webzen NPlay- ۽Ӓ ನਕ؊, Pandas, Scikit-Learn,PySparkPyCon APAC 2016 2
ߊח4 ӝ҅णী ೠ ӝࠄ ध ח ٜ࠙ਸ ࢚4 ॆਸ ഝਊೠ ؘఠ ࠙ࢳҗ ӝ҅ण ࢎ۹ܳ ҕਬ4 ѐߊҗ ࢲ࠺झী ӝ҅णਸ بੑೞח ҅ӝо غਵݶ פPyCon APAC 2016 3
द زӝ4 ѱ য࠭ ઁܳ4 ਬ नҊ / GM ݽפఠ݂ / ಁఢ ӝ۽ח ೠ҅4 ࢎۈ ѐੑ ୭ࣗചػ য࠭ ఐ दझమਸ ٜ݅PyCon APAC 2016 4
ѱ য࠭ۆ?4 “ӝദਵ۽ بೞ ঋ ߑधਵ۽ ѱ ࠁܳ ദٙೞѢա بਸ ח ೯ਤ” !4 ࢎ۹4 ࢲ࠺झ ҳഅ࢚ ਸ ਊೠ ۨ4 ೧ఊ ోਸ ࢎਊೠ ࠺࢚ ۨ4 ହী بߓ۽ ҟҊPyCon APAC 2016 5
ా҅৬ ఐ࢝ ؘఠ ࠙ࢳPyCon APAC 2016 6
ࢶ, ా҅4 ా҅ח ೂࠗೞ ޅೠ ؘఠ৬ ஹೊ ਕ ജ҃ীࢲ ߊ4 ా҅ ٜ ؘఠ/҅ਸ ח ߑߨਸ োҳ4 ৌঈೠ ജ҃ীࢲ ٜ݅যӝী, ؘఠীࢲب оܳ ߊѼೡ ࣻ4 ӝࠄੋ ా҅ ध ѐߊ, ӝദ, ࢲ࠺झ ١ী ب ؽPyCon APAC 2016 7
ఐ࢝ ؘఠ ࠙ࢳ4 ؘఠী ऀযח ࠁܳ,4 নೠ пب۽ ਃড, दпച ೧ࠁݴ ח җ !4 ೞח ؘఠח җࠗఠ4 दझమ(WzDat) ѐߊ೧ ഝਊ "4 Jupyter + Utility + Dashboard4 https://github.com/haje01/wzdat4 http://www.pycon.kr/2014/program/14PyCon APAC 2016 8
ࢎ۹1рױೠ ా҅ ই٣য۽ झಁݠ ѨPyCon APAC 2016 9
࢚ട4 नӏ য়ೠ ѱ ହ ѱ ইమ ҟҊӖ۽ оٙ !4 ೧ ҅ਸ ઁ೧ب ߄۽ ࢜ ҅ਵ۽ ҟҊ ҅ࣘ4 ࡅܲ ઁо ਃೞৈ, ӝ҅णਸ ೯ೞӝীח दр ࠗPyCon APAC 2016 10
ਸ ਊೠ झಅ(Spam)4 ѱ ղীࢲ ਵ۽ ݠפ/ইమ౸ݒ ҟҊ4 য࠭ח ۽Ӓ۔ ౸߹ਸ ݄ӝਤ೧ ݫदܳ դةചPyCon APAC 2016 11
झಁݠ Ѩ4 নೠ ߑߨ оמೞѷਵա,4 োয ܻա ӝ҅णэ Ҋә Ӕࠁ,4 рױೠ ా҅ ই٣য۽ दبPyCon APAC 2016 12
ৡۄੋ ݫद ӡ ࠙ನ4 ੌ߈ਵ۽ ۽Ӓ ӏ࠙ನܳ ٮܲҊ ঌ۰ઉ.4 ইېח NPS Chat Corpus ݫदӡ ࠙ನPyCon APAC 2016 13
ѱ ղ ݫद ӡ ࠙ನ4 ৡۄੋ җ ࠺तೞա ખ ؊ فԁ ҃ೱ4 ౠ ӡ ݫदо (?) → झಅਵ۽ഛੋPyCon APAC 2016 14
ই٣য4 ੌ߈ ਬ: ݫद ӡо নೞҊ, ࠼بо ݆ ঋ4 झಁݠ: ݫद ӡо নೞ ঋҊ, ࠼بח ֫4 , যڃ ਬ ࠼بо ֫Ҋ ӡо নೞ ঋਵݶ झಁݠPyCon APAC 2016 15
рױೠ Ѩ ҕध4 ਬ ߹ പࣻ / ݫद ӡ ઙܨ ࣻ4 ࠺तೠ ӡ ݫदܳ ࠁյ ࣻ۾ ч ழPyCon APAC 2016 16
࠙ܨ4 spam_ratioо ӝળ ч ࢚ੋ Ѫਸ झಁݠ۽ р4 ӝળ ч Ѿ ോܻझ౮ೞѱ...4 ࢸ റ, ࠙ܨػ நܼఠ ݫद ഛੋਵ۽ ч ઑPyCon APAC 2016 17
࠙ܨ റ ݫद ӡ ࠙ನ4 ࠼بо ֫ ౠ ӡ ݫद(= झಅ)о ܻ࠙غPyCon APAC 2016 18
Ѿҗ ਊ4 ҳഅ рױ೮݅, য়ఐ оמࢿ 4 ӝળ чਸ ֫ѱ ই न܉بܳ ֫4 Ѿҗܳ оҊ ઁPyCon APAC 2016 19
ѐࢶ ߑೱ4 ӝળ ч Ѿਸ ખ ؊ җੋ ߑߨਵ۽4 োয ܻ ӝࣿ(NLP) بੑ4 ױয߹ ࠼ب(Ziff’s Law)৬ ਃب(TF-IDF) Ҋ۰4 ӝ҅ण ঌҊ્ܻ ਊPyCon APAC 2016 20
ӝ҅ण ࣗѐPyCon APAC 2016 21
ӝ҅णਸ ॳח ਬ4 ֢۱ਵ۽ ҡଳ Ѿҗޛ4 নೠ ޙઁী ೠ ੌ߈ੋ ࣛܖ࣌4 ࣻ ౠࢿ(ೖ)ਸ زदী Ҋ۰ೡ ࣻ 4 ؘఠ ߸زী ъೣ(ъѤࢿ)PyCon APAC 2016 22
࠙ܨ৬ ഥӈ4 ӝ҅ण ѱ ࠙ܨ(Classification)৬ ഥӈ(Regression)۽ ա4 ࠙ܨ - ઙܨܳ ஏ ೞח Ѫ4 ഥӈ - োࣘػ чਸ ஏ ೞח Ѫ4 য࠭ Ѩ ࠙ܨী ࣘೣPyCon APAC 2016 23
ب णҗ ਯ ण4 ب ण(Supervised Learning)4 ӝઓ ҃ী ೧ ࠙ܨػ ࢠ ؘఠо ਸ ٸ4 ਯ ण(Unsupervised Learning)4 ࠙ܨػ ࢠ ؘఠо হਸ ٸ4 ࠗ࠙ ؘఠח ࠙ܨغয ঋ → ಽযঠೡ ޙઁPyCon APAC 2016 24
ӝ҅ण ঌҊ્ܻٜ4 ӝࠄ4 ܻפয/۽झ౮ ܻӒۨ࣌(Linear/Logistic Regression)4 Ѿ ܻ(Decision Tree)4 Ҋә4 ےؒ ನۨझ(Random Forest)4 SVM(Support Vector Machine)4 ੋҕ न҃ݎ(Neural Network)PyCon APAC 2016 25
ঌҊ્ܻ ࢶఖ?4 ੌ߈ਵ۽ Ҋә ঌҊ્ܻ ؊ ࠂೠ ݽ؛ ण оמ4 Ӓ۞ա, Ҋә ঌҊ્ܻ ޖઑѤ જ Ѫ ইש4 ण Ѿҗܳ ࢎۈ ೧ೞӝীח ӝࠄ ঌҊ્ܻ જPyCon APAC 2016 26
ஏী ೠ ಣо4 ഛࢿী ೠ о ਃ !4 Q: ਬ 100ݺ 2ݺ ח য࠭ܳ Ѩೞ۰ ೠ. पࣻ۽ ݽف࢚ ਬ۽ ౸ױ೮ਸ ٸ ഛبח?4 A: 100ݺ 2ݺ ౣ۷ਵפ… 98% !?#@PyCon APAC 2016 27
ஏ ױਤ4 ب(Precision) അਯ(Recall)җ ١ নೠ ױਤ4 ب: Ѫ ݃ա য࠭ੋо?4 അਯ: য࠭ ݃ա ওחо?4 ؘఠо ࠛӐഋ(Imbalance)ੌٸח ౠ ب৬ അਯਸ ೣԋҊ۰೧ঠ4 খ ҃ח അਯ 0PyCon APAC 2016 28
P/R Curve ৬ AUCજ ࠙ܨӝח?PyCon APAC 2016 29
ࢎ۹2ӝ҅णਵ۽ ߁ ѨPyCon APAC 2016 30
࢚ട4 ۄ࠳ ѱীࢲ пઙ ೧ఊ ోਸ ࢎਊೠ ߁ ۨо ഝѐ !4 ߁: ѱ ղ ചܳ ࠺ ࢚ੋ ߑߨਵ۽ णٙ4 ࠈ ౠࢿਸ ೞա ل۽ ౠೞӝ য۰ → ӝ҅ण ਃPyCon APAC 2016 31
ण ߑध ࢶఖ4 Ҷ ۡ֔/٩۞ਵ۽ ೡ ਃח হח ٠…4 җѢ ۽Ӓо غҊ Ҋ,4 ஏীࢲ ӝઓ য࠭ நܼఠ ܻझܳ оҊ !→ ӝ҅ण, ౠ ب ण оמ!4 Decision Tree ߑध ب णਵ۽ ѾPyCon APAC 2016 32
ળ࠺ җ1. ۽Ӓ ࣻ ࢚క ഛੋ2. ۽Ӓ ҳઑ/ ঈ3. णਸ ਤೠ ೖ(Feature) ୶PyCon APAC 2016 33
ӝ҅णب ۽Ӓ ࣻࠗఠ4 ۽Ӓܳ ҅ਵ۽ ݽਵח Ѫب औ ঋ4 ࠙ࢳ/णী Ѧܻח दр 10~20% ب4 ؘఠܳ ݽਵҊ оҕೞחؘ ࠗ࠙ दр Ѧܽ.4 ۽Ӓ ഋध оә Ӓ۽ ࢎਊ (झౚ٣য়ܳ ਤ೧… !)4 ۽Ӓܳ ࠙ܨ೧ (ࢲߡ/۽Ӓ ઙܨ, द ߹۽)4 ۄ٘ झషܻ(S3) ୶ୌ ☁PyCon APAC 2016 34
ਦب ࢲߡীࢲ ۽Ӓ ࣻೞӝ4 ѱ ࢲߡח ࠗ࠙ ਦب ӝ߈4 য় ࣗझ જ ోٜ(fluentd, logstash ١)ਸ ॳҊ रਵա4 ਦب ࢲߡী ࢸо औ ঋҊ, ੌࠗ ӝמ ࠗ4 ѐߊ !4 https://github.com/haje01/wdfwd4 ࢲߡী թ ۽Ӓ ੌਸ RSync۽ زӝೞѢա4 ѱ DBী ࣘೞৈ Dump റ ࣠PyCon APAC 2016 35
۽Ӓо ࣻ غਵݶ ೖܳ ٜ݅4 ೖ(Feature, ౠࢿ): ण ࢚ ౠਸ ࢸݺ೧ח ч4 ) чਸ ஏೞח ҃ !→ ӝ, ߑೱ, ജ҃, Үా, ಞदࢸ ١ ೖPyCon APAC 2016 36
ೖ ѐߊ(Feature Engineering)4 (࠺)ഋ ؘఠীࢲ ೖܳ Ҋ ࢤࢿೞח স4 ܲ ೖٜী ղػ ೖܳ ইղӝب ೣ4 ٸ۽ח ࠂೠ ٘о ਃ(SQL۽ח ൨ٝ)4 3ѐਘ ࠙ ۽Ӓীࢲ ೞنਸ ా೧ ೖ ࢤࢿPyCon APAC 2016 37
ೞنਸ ॄঠ݅ ೞա?4 ؘఠо Bigೞ ঋਵݶ ਃ হ4 न…4 ߓ Jobਸ য়ۖزউ جܻѢա4 ӝਵ۽ ETLਸ ా೧ DBী ֍যفח җ ਃೡ ࣻ 4 ࠺ഋ/ਊ ؘఠীࢲ ࠼ߣೠ ೖ ѐߊਸ ೠݶ જPyCon APAC 2016 38
যڌѱ ॄঠೞա?4 ೞن ۞झఠܳ ҳ୷ೞৈ ࢎਊೡ ࣻب ਵա,ࣇҗ ਊ য۰4 ۄ٘ ࢲ࠺झীࢲ ઁҕೞח ೞن ࢲ࠺झܳ ਊ !- AWS EMR(Elastic Map Reduce)PyCon APAC 2016 39
AWSח ࠺ऱ ঋա?4 ୭ച ೞݶ ࠺ऱ ঋ !4 ਃೡ ٸ݅ ॳח ױࣘ ۞झఠ(Transient Cluster)۽ ਊ4 Task ֢٘ח ҃ݒ ߑध Spot Instance۽4 m4.xlarge(4 vCPU, 16 GiB RAM ): दр 0.036$(ࢲ ܻ, 2016-08-09 ӝળ)PyCon APAC 2016 40
AWS EMR ۞झఠ द ചݶPyCon APAC 2016 41
ೞنਸ ਤೠ ۽Ӓ оҕ4 ೞن ੌ(< 100MB)ٜ ݆ Ѫী ஂড4 ੌٜ ߽, ࣗ, ୷ೡ ਃ4 ݃ٶೠ ోਸ ޅ೧ ѐߊ !4 https://github.com/haje01/mersoz4 ߄Ո ੌ݅ স, ઓ ҙ҅ܳ Ҋ۰ೠ ߽۳ ܻPyCon APAC 2016 42
ݠ, ࣗ & ୷ റ S3ী ػ ۽ӒPyCon APAC 2016 43
ೞن MapReduce ٬ - mrjob4 Yelpীࢲ ݅ٚ Python ಁః4 ೞن झܿਸ ਊ೧ ॆਵ۽ MR ٬4 ۽ஸীࢲ ࢠ ؘఠ۽ ѐߊೠ റ, EMRী ৢܿ !4 प೯ ࣘبח Javaߡ ࠁ ખ וܻ݅ ѐߊ ࣘبо ࡅܴPyCon APAC 2016 44
from mrjob.job import MRJobimport reWORD_RE = re.compile(r"[\w']+")class MRWordFreqCount(MRJob):def mapper(self, _, line): # ۽Ӓ ੌ п ۄੋfor word in WORD_RE.findall(line): # ݽٚ ױযী ೧yield word.lower(), 1 # 'ױয', 1 ߈ജdef combiner(self, word, counts): # ֢٘ Ѿҗܳ ஂyield word, sum(counts)def reducer(self, word, counts): # ۞झఠ Ѿҗܳ ஂyield word, sum(counts)if __name__ == '__main__':MRWordFreqCount.run()PyCon APAC 2016 45
दझమ ҳࢿبPyCon APAC 2016 46
അട ঈ4 ӝ҅णਸ ਤ೧4 GM ઁೞח ӔѢ(=ೖ)৬4 ઁػ நܼఠ ܻझܳ ਃPyCon APAC 2016 47
ೖ ࢤࢿ 4 ۽Ӓীࢲ நܼఠ ӝળਵ۽ ҳೣ4 Үೠ ೖࠁח নೠ ೖܳ4 যରೖ ࠂਵ۽ ౸ױ4 ୡӝীח ૣ दрী ೧, উചغݶ ӡѱPyCon APAC 2016 48
ୡӝী ࡳইࠄ ೖٜ4 ۽Ӓੋ ࣻ4 ۨ दр4 ۽Ӓ ইਓ ࠛ࠙ݺೠ ҃о ݆4 ࣁ࣌ ইਓ بੑ: 5࠙ ⏱4 ইమ/ݠפ णٙ ࣻ4 ௮झ ઙܐ ࣻ4 NPC/PC р ై ࣻPyCon APAC 2016 49
ೖ ఋੑ?4 ѱ पࣻ ഋ, పҊܻ ഋ, ࠛܽ(Boolean) ഋਵ۽ աׇ4 оә पࣻ ഋਵ۽ ాੌೞח Ѫ ߄ۈ4 Bool 0, 1۽4 పҊܻ ఋੑ OneHotEncoderܳ ࢎਊ೧ पࣻഋਵ۽PyCon APAC 2016 50
ٜ݅য ೖ 4 ױࣽ ఫझ (.txt) ੌ4 நܼఠݺ + ೖ ߓৌ ഋधPyCon APAC 2016 51
ӝ҅ण ೯PyCon APAC 2016 52
ӝ҅ण о߶4 ୭ઙ ೖ ੌ ӝо Ҋ, ӝ҅ण ࣻ೯ب о߶ ಞ4 ۽ஸ PCীࢲ ࣻ೯4 ୶ୌ दझమۢ ݽٚ ؘఠܳ ࠊঠೞח ण ޖѢ Ѫ4 ݽ؛ਸ ࢶఖೞҊ ୭ ೞಌ ಁ۞ఠܳ Ѿೞח Ѫ җઁ4 নೠ ࣇਵ۽ ৈ۞ߣ प೧ࠊঠ4 ࠙ दझమਸ ഝਊೞח ҃ب...PyCon APAC 2016 53
যڃ ঌҊ્ܻ ݽ؛ਸ ࢶఖೡ Ѫੋо?4 द рױೠ Ѫਵ۽4 ࠺तೠ ࢎ۹ ࢶ೯ োҳо ਵݶ ଵҊೞ4 AUCա ROCܳ ాೠ ݽ؛ ಣо ߂ ࢶఖPyCon APAC 2016 54
Decision Tree۽ द4 ࠂೞ ঋҊ ౸ױ җ ೧о ਊ4 ॆ Scikit-Learn ಁః Ѫਸ ࢎਊ4 নೠ ӝ҅ण ঌҊ્ܻਸ प ઁҕ4 ੋఠಕझо ాੌغয য ݽ؛ Үо ਊ4 ೖ(X)৬ য࠭ ৈࠗ(y)ܳ ֍Ҋ ण4 DTח ೖ ӏച ਃ হয ಞܻPyCon APAC 2016 55
DT ࢎਊ (ࠠԢ ࠙ܨ)from sklearn.datasets import load_irisfrom sklearn import treeiris = load_iris()clf = tree.DecisionTreeClassifier()clf = clf.fit(iris.data, iris.target)>>> clf.predict(iris.data[:1, :])array([0])PyCon APAC 2016 56
PyCon APAC 2016 57
Decision Tree ण җ1. ೖ ੌীࢲ ӝઓ য࠭ ೖܳ Ҋ2. زࣻ ࢚ ਬ ೖ ҳೣ4 Under Sampling3. ؘఠܳ Train/Test ࣇਵ۽ ա־Ҋ4. ӝࠄ ಁ۞ఠ۽ ण दPyCon APAC 2016 58
ୡӝ Ѿҗ4 ಣӐ ഛب 80% ب4 Binary Class ࠙ܨ ҃ ࣻо ੜ աয়ח ಞ4 աࢁ ঋѪ э݅,4 ஏ Ѿҗо ઁ ӔѢ۽ ॳੋח ীࢲ ݆ ࠗPyCon APAC 2016 59
ഛبܳ ৢܻ4 Үର Ѩૐ(Cross Validation)ਸ ਤ೧ ؘఠ ࣇਸ ܻ࠙ ೞҊ4 GridSearchCVܳ ా೧ ୭ ೞಌ ಁ۞ఠܳ 4 ಣӐ ഛب 91%۽ ೱ࢚4 যڃ ӝળਵ۽ ౸ױೞח ೠ ߣ ࠁҊ रtree.export_graphviz۽ Ӓ۰ࠆPyCon APAC 2016 60
PyCon APAC 2016 61
Ѿ ܻܳ ࠁפ...4 णػ ݽ؛ যڃ ӝળਵ۽ ౸ױೞח ঌ ࣻ → নೠ ҵ ࢎۈٜী ҕਬ оמ !4 ೞࠗ۽ ղ۰т ࣻ۾ ࠂ೧ח ޙઁ4 DTח җ(Overfitting)غӝ औӝী, Depthо ցޖ Өঋѱ PyCon APAC 2016 62
ৈӝࢲ ؊ ࢚ ࣻо ৢۄо ঋ4 GMשҗ ࢚ റ ࢜۽ ೖٜ ୶о4 زदী ইమ/ݠפ ࣻ4 ݗ ߈ࠂ പࣻ4 ౠ ېझ݅ ࢶఖ4 ঋҊ ইమਸ ࣻ4 դ೧೧ ࠁח Ѫٜب ೖ۽ ٜ݅ ࣻ ח Ѫ ֢ೞ4 ) 'ࠈ ےؒೞѱ ࢤࢿػ ܴਸ оҊ যਃ''PyCon APAC 2016 63
) நܼఠ ܴ ےؒࢿ ౸ױ (/ݽ അ ಁఢ)## நܼఠ ܴ ߊ оמೠ ౸ױೞח गب ٘# ܴਸ ݽ बࠅ۽ ߄Է(1о , 2о ݽ)# ) anything -> ‘21211211’symbols = get_cv_symbols(char_name)# җ э ಁఢ ਵݶ ߊ оמ (प۽ח ؊ ন)if ‘2121’ or ‘2112’ or ‘1121’ or ‘22122’, … in symbols:can_pron = Falseelse:can_pron = TruePyCon APAC 2016 64
ഛೠ ߑߨ ইפ݅...ࠂਵ۽ ౸ױೞӝী ب ؽPyCon APAC 2016 65
୶о ೖ۽ झযо ೱ࢚, Ӓ۞ա…4 ಣӐ ഛب 96%۽ ೱ࢚. ࣻח ֫ ಞ݅,4 प ਊ೧ࠄ Ѿҗ4 GMש ഛੋ җীࢲ য়ఐ Ԩ ա১ !4 DecisionTree Ҋੋ җ ޙઁ۽ ౸ױPyCon APAC 2016 66
Random Forest۽ Ү4 ݆ Decision Tree ܳ ઑೠ ঔ࢚࠶ పץ4 ࣻ DTܳ ࠙ ण(=ӏച ബҗ) दఃҊ ైೞח ߑध4 ࣻо ծইب উੋ Ѿҗ4 DecisionTree - ࠛউೠ 96%RandomForest - উੋ 95%PyCon APAC 2016 67
Random Forest ण4 ӝࠄਵ۽ Decision Tree৬ ࠺त4 max_depth, min_samples_leafݽ؛ ࠂبܳ ઑ. ѱ द೧ࢲ ઑӘঀ ఃਕࠄ4 n_estimator4 աޖ(DT)ܳ ݻ Ӓܖ बਸ Ѫੋ Ѿ !4 ցޖ ݶ णदр ӡҊ, ցޖ ਵݶ Ӓր DTо غযߡܿPyCon APAC 2016 68
RF ਊ റ Ѿҗ4 ഛبח 95%4 ࠗೞѱ ҅ ߉ח ࢎ۹о হب۾4 predict_probaܳ ࢎਊ೧ ஏ ഛܫب Ҋ4 ഛܫ ֫(>70%) ஏ Ѿҗ݅ ನೣ4 ৈӝࢲ 10~20%ب അਯ(Recall) ೞۅ ୶4 Ӓ۞ա, ب(Precision)ח…PyCon APAC 2016 69
100% ׳ࢿGMש ࣻসਵ۽ Ѩష೧ न Ѿҗ… !PyCon APAC 2016 70
ওਵפ ઁܳ...4 2ѐਘৈী Ѧ ઁ4 ోਸ ࢎਊೠ ߁ ࠗ࠙ ࢎۄ! !4 ӝ/ࣘਵ۽ ઁܳ ೧ঠ ബҗо PyCon APAC 2016 71
ଵҊ: ୭ઙ ೖ ਃبPyCon APAC 2016 72
ѐࢶ ߑೱ4 Ѩػ Ѿҗܳ ਊ೧ ण ݽ؛ ѐࢶ4 ࠈ ҅ী ೠ PIIܳ ࣻ೧فݶ नӏ ࠈ णী ਊೡ Ѫ4 ઁ റ ߸ઙ ࠈ ݽפఠ݂ ਃPyCon APAC 2016 73
റӝPyCon APAC 2016 74
ו՛ 4 ؘఠ ࣻࠗఠ оҕ, ࠙ࢳө ݽٚ җਸ ॆਵ۽ !4 Jupyter ֢࠘ਸ ాೠ ఐ࢝ ؘఠ ࠙ࢳ "4 ؊ নೠ ࠙ঠী ӝ҅ णਸ ഝਊ оמೡ ٠PyCon APAC 2016 75
ӝ҅ण बച4 Ө ח ഝਊਸ ਤ೧ ӝࠄ ۿਸ ؊ ҕࠗೞ !4 જ Hypothesisܳ ٜ݅ ࣻ ѱ ػ4 ୭ചܳ ೡ ࣻ ѱ ػ4 ೞա ࢚ ঌҊ્ܻਸ ࢎਊ೧ ࠁ4 SVM, Neural Net ١ নೠ ࠙ܨӝ4 Super Learner ߑधਵ۽ ঔ࢚࠶PyCon APAC 2016 76
ࣁਘ ൗ۞... ࢜۽ ۽Ӓ ࣻ/࠙ࢳ ജ҃4 RSync ߑध -> Fluentd/Kinesis पदр ۽Ӓ ࣻ4 gzipػ CSV -> Parquet ನݘਵ۽ S3 4 Columnar ߄ցܻ ನݘ, 30x ࣘب ೱ࢚4 MRJob -> PySpark4 ъ۱ೠ ࠙ ܻ / Cache ӝמ(߈ࠂ णী ъ)4 ױࣘ Spark ۞झఠ(20 VMs = 80য, 320GB ۔)۽ ਊ (दр 3000ਗ ب)PyCon APAC 2016 77
ઑ4 ӝ҅ण ղо ೞ۰ח ੌী ೠ ౸ױ !4 য࠭ ౠࢿ ױࣽೞݶ ాੋ ߑߨਵ۽ оמ4 ఐ࢝ ؘఠ ࠙ࢳਸ ా೧ ౠࢿਸ ݢ ঈೞ4 নೠ ݽ؛/ೖܳ పझ೧ࠁ4 ण ݽ؛ী ٮۄ ೖ ӏച/Үചо ਃೡ ࣻ ਵפ 4 ېझр Imbalance ޙઁী PyCon APAC 2016 78
٩۞? ӝ҅ण?4 ٩۞4 Үೠ ೖ ূפয݂ ਃ হ4 ݆ ಁ۞ఠ = ݆ ؘఠо ਃ4 ӝ҅ण4 ೖ স ਃೞ݅4 ಁ۞ఠ = ؘఠ۽ب ബҗPyCon APAC 2016 79
࢚4 ؘఠ ূפয݂ য۰4 ؘఠ ഛࠁо о ਃ4 झನۄܳ ߉ח ࠙ঠח য়۰ ݎ যف4 োҳо ইפۄݶ ҷڣস/࢜ ؘఠঠ݈۽ ࠶ܖয়࣌4 ݽٚ ഥࢎী ؘఠ ࠙ࢳоо ਃೠ द4 ஹೊఠо ݽٚ ݽ؛/߸ࣻ ઑਸ పझ ೡ ࣻ ݶ? !PyCon APAC 2016 80
ਵ۽... ࢎ োҙ(Spurious Correlations)4 पઁ۽ח োҙ হ݅, ח Ѫۢ ࠁח ҃4 ؘఠী݅ ೞ ݈Ҋ, بݫੋਸ ೧ೞ!PyCon APAC 2016 81
хࢎפ.PyCon APAC 2016 82
ଵҊ ݂4 http://www.aladin.co.kr/shop/wproduct.aspx?ItemId=289463234 http://www.tylervigen.com/spurious-correlations4 http://scikit-learn.org/stable/modules/tree.html4 http://www.cimerr.net/conference/board/data/conference/1331626266/P15.pdf4 http://stackoverflow.com/questions/20463281/- how-do-i-solve-overfitting-in-random-forest--of-python-sklearn4 http://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning4 https://www.quora.com/Is-Scala-a-better-choi- ce-than-Python-for-Apache-Spark4 http://statkclee.github.io/data-science/data- -handling-pipeline.html4 https://databricks.com/blog/2016/01/25/deep-- learning-with-spark-and-tensorflow.html-PyCon APAC 2016 83