Takanobu Nozawa
November 05, 2019
14k

# データ分析コンペにおいて 特徴量管理に疲弊している全人類に伝えたい想い

## Takanobu Nozawa

November 05, 2019

## Transcript

1. ϚϚͷҰาΛࢧ͑Δ
σʔλ෼ੳίϯϖʹ͓͍ͯ
ಛ௃ྔ؅ཧʹർฐ͍ͯ͠Δશਓྨʹ఻͍͑ͨ૝͍
ʙֶशɾਪ࿦ύΠϓϥΠϯΛఴ͑ͯʙ
\$POOFIJUP*OD໺ᖒ఩র

\$POOFIJUP.BSDIÉWPMʙػցֶशɾσʔλ෼ੳࢢʙ

2. ͜Μʹͪ͸ʂ
ϚϚͷҰาΛࢧ͑Δ

3. ͍͖ͳΓͰ͕͢
ϚϚͷҰาΛࢧ͑Δ

4. σʔλ෼ੳίϯϖʢ,BHHMF 4*(/"5&ͳͲʣ
ฉ͍ͨ͜ͱ͋Δਓʙ!
ϚϚͷҰาΛࢧ͑Δ

5. σʔλ෼ੳίϯϖʹࢀՃͨ͜͠ͱ͋Δਓʙ!
ϚϚͷҰาΛࢧ͑Δ

6. ಛ௃ྔͷ؅ཧͬͯͲ͏ͯ͠·͔͢ʁ
ʢςʔϒϧσʔλʹ͓͍ͯʣ
ϚϚͷҰาΛࢧ͑Δ

7. Α͋͘Δύλʔϯʢ࣮ମݧʣ
ϚϚͷҰาΛࢧ͑Δ

8. Α͋͘ΔύλʔϯʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
w ಛ௃ྔ࡞Δ
DPM<Z " # \$ %>ˠ<Z " # \$ % & '>
# e.g)
train['A'] = train['A'].fillna(0)
train['B'] = np.log1p(train['B'])
train['E'] = train['A'] + train['B']
df_group = train.groupby('D')['E'].mean()
train['F'] = train['D'].map(df_group)
<>
ɾ
ɾ
ɾ

9. Α͋͘ΔύλʔϯʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
w ࢖͏ಛ௃ྔͷΧϥϜ͚ͩࢦఆ͢Δ
# e.g)
feat_col = ['A', 'C', 'D', 'E', 'F', 'J']
x_train = train[feat_col]
y_train = train['y']
# e.g)
clf.fit(x_train, y_train)
w ֶशͤ͞Δ
<>
<>
ɾ
ɾ
ɾ

10. Α͋͘ΔύλʔϯʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
w ࢖͏ಛ௃ྔͷΧϥϜ͚ͩࢦఆ͢Δ
# e.g)
feat_col = ['A', 'C', 'D', 'E', 'F', 'J']
x_train = train[feat_col]
y_train = train['y']
# e.g)
clf.fit(x_train, y_train)
w ֶशͤ͞Δ
<>
<>
ɾ
ɾ
ɾ
͋Ε
b'`ͬͯͲΜͳಛ௃ྔ͚ͩͬʁ

11. Α͋͘ΔύλʔϯʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
# e.g)
train['A'] = train['A'].fillna(0)
train['B'] = np.log1p(train['B'])
train['E'] = train['A'] + train['B']
df_group = train.groupby('D')['E'].mean()
train['F'] = train['D'].map(df_group)
<>
# e.g)
feat_col = ['A', 'C', 'D', 'E', 'F', 'J']
x_train = train[feat_col]
y_train = train['y']
# e.g)
clf.fit(x_train, y_train)
<>
<>

12. Α͋͘ΔύλʔϯʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
# e.g)
train['A'] = train['A'].fillna(0)
train['B'] = np.log1p(train['B'])
train['E'] = train['A'] + train['B']
df_group = train.groupby('D')['E'].mean()
train['F'] = train['D'].map(df_group)
<>
# e.g)
feat_col = ['A', 'C', 'D', 'E', 'F', 'J']
x_train = train[feat_col]
y_train = train['y']
# e.g)
clf.fit(x_train, y_train)
<>
<>
ݟ͚ͭͨʂ
ʢOPUFCPPLͷ্ͷํʣ

13. Α͋͘ΔύλʔϯʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
# e.g)
train['A'] = train['A'].fillna(0)
train['B'] = np.log1p(train['B'])
train['E'] = train['A'] + train['B']
df_group = train.groupby('D')['E'].mean()
train['F'] = train['D'].map(df_group)
<>
# e.g)
feat_col = ['A', 'C', 'D', 'E', 'F', 'J']
x_train = train[feat_col]
y_train = train['y']
# e.g)
clf.fit(x_train, y_train)
<>
<>
ݟ͚ͭͨʂ
ʢOPUFCPPLͷ্ͷํʣ
ಛ௃ྔ͕গͳ͍৔߹͸·ͩϚγ͕ͩɺ
ଟ͘ͳͬͯ͘ΔͱͲΜͳܭࢉͰٻΊͨ
ಛ௃ྔ͔ͩͬͨΛ͍͍ͪͪߟ͑Δʢ୳͢ʣ
ͷ͸݁ߏେมͩ͠ɺ͕͔͔࣌ؒΔ

14. Α͋͘Δύλʔϯͦͷʢ࣮ମݧʣ
ϚϚͷҰาΛࢧ͑Δ

15. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
Αͬ͠Ό͊ʂΊͬͪΌྑ͍είΞͰͨͥʙʙʙ
͜ͷOPUFCPPLΛ%VQMJDBUFͯ͠ɺ΋ͬͱྑ͍Ϟσϧ࡞ͬͪΌ͏ͧʂ

16. ҰํɺOPUFCPPLͷத਎͸ʜ
ϚϚͷҰาΛࢧ͑Δ

17. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
<> import numpy as np
import pandas as pd
OPUFCPPLͷத਎
ɾ
ɾ
ɾ
<> submission.to_csv('submission.csv', index=False)

18. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
<> import numpy as np
import pandas as pd
OPUFCPPLͷத਎
ɾ
ɾ
ɾ
ɾ
ɾ
ɾ
<> submission.to_csv('submission.csv', index=False)

19. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
<> import numpy as np
import pandas as pd
OPUFCPPLͷத਎
ɾ
ɾ
ɾ
ɾ
ɾ
ɾ
<> submission.to_csv('submission.csv', index=False)

20. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
<> import numpy as np
import pandas as pd
OPUFCPPLͷத਎
ɾ
ɾ
ɾ
ɾ
ɾ
ɾ
<> submission.to_csv('submission.csv', index=False)
ಉ͡ܭࢉΛԿ౓΋΍Βͳ͍ͱ͍͚ͳ͍
ʴ
ºʢ ʣˠແବ

21. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
౓ॏͳΔ%VQMJDBUFʹΑΓɺOPUFCPPL஍ࠈʹؕΔՄೳੑ΋ʜ
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
ʜʜʜʜʜ

22. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ
ϚϚͷҰาΛࢧ͑Δ
౓ॏͳΔ%VQMJDBUFʹΑΓɺOPUFCPPL஍ࠈʹؕΔՄೳੑ΋ʜ
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
dOPUFCPPL-JHIU#(.@TDPSF@JQZOC
ʜʜʜʜʜ

23. ϚϚͷҰาΛࢧ͑Δ
ࠓ೔࿩͢͜ͱ

24. ϚϚͷҰาΛࢧ͑Δ
σʔλ෼ੳίϯϖʹ͓͍ͯ
ಛ௃ྔ؅ཧʹർฐ͍ͯ͠Δશਓྨʹ఻͍͑ͨ૝͍
ʙֶशɾਪ࿦ύΠϓϥΠϯΛఴ͑ͯʙ

25. ΞδΣϯμ
ࣗݾ঺հ
ಛ௃ྔ؅ཧʹ͍ͭͯ
‣ ྻ͝ͱʹQJDLMFϑΝΠϧͰಛ௃ྔΛ؅ཧ
‣ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹϝϞϑΝΠϧ΋ੜ੒
ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
‣ ίϚϯυҰൃͰֶशˠ4VCNJUϑΝΠϧ࡞੒·ͰΛ࣮ߦ
‣ ֶशʹ࢖༻ͨ͠ಛ௃ྔ΍Ϟσϧύϥϝʔλ͸MPHͱҰॹʹอଘ
‣ TIBQΛ༻͍ͯಛ௃ྔͷߩݙ౓ΛՄࢹԽ͠ɺ࣍ճֶश࣌ͷצॴΛݟ͚ͭΔ
ϚϚͷҰาΛࢧ͑Δ

26. ΞδΣϯμ
ࣗݾ঺հ
ಛ௃ྔ؅ཧʹ͍ͭͯ
‣ ྻ͝ͱʹQJDLMFϑΝΠϧͰಛ௃ྔΛ؅ཧ
‣ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹϝϞϑΝΠϧ΋ੜ੒
ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
‣ ίϚϯυҰൃͰֶशˠ4VCNJUϑΝΠϧ࡞੒·ͰΛ࣮ߦ
‣ ֶशʹ࢖༻ͨ͠ಛ௃ྔ΍Ϟσϧύϥϝʔλ͸MPHͱҰॹʹอଘ
‣ TIBQΛ༻͍ͯಛ௃ྔͷߩݙ౓ΛՄࢹԽ͠ɺ࣍ճֶश࣌ͷצॴΛݟ͚ͭΔ
ϚϚͷҰาΛࢧ͑Δ
ݰਓͷ஌ܙΛ͓आΓͨ͠Β
ΊͬͪΌΑ͔ͬͨ
ʢ˞ʣ
͍ͬͯ͏࿩Λ͠·͢
ʢ˞ʣ͋͘·Ͱओ؍Ͱ͢

27. ࣗݾ঺հ
ϚϚͷҰาΛࢧ͑Δ

28. ࣗݾ঺հ
ϚϚͷҰาΛࢧ͑Δ
໊લɿ໺ᖒ఩রʢ/P[BXB5BLBOPCVʣ
ॴଐɿίωώτגࣜձࣾ
ɹɹɿ͔ͨͺ͍!UBLBQZ
w ʙίωώτʹ.-ΤϯδχΞͱͯ͠+0*/
w ػցֶशʢ/-1ɺਪનγεςϜʣΛϝΠϯʹ΍ΓͭͭΠϯϑϥʢ"84ʣ΋ษڧத
w ,BHHMFͨ͠ΓɺϒϩάʢIUUQTXXXUBLBQZXPSLʣॻ͍ͨΓɺ໺ٿͨ͠Γɺ
ϥʔϝϯ৯΂ͨΓ͍ͯ͠·͢

29. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
˞ԼههࣄΛࢀߟʹ͍͖ͤͯͨͩ͞·ͨ͠ɻ
ɾ,BHHMFͰ࢖͑Δ'FBUIFSܗࣜΛར༻ͨ͠ಛ௃ྔ؅ཧ๏
IUUQTBNBMPHIBUFCMPKQFOUSZLBHHMFGFBUVSFNBOBHFNFOU
‣ ྻ͝ͱʹQJDLMFϑΝΠϧͰಛ௃ྔΛ؅ཧ
‣ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹϝϞϑΝΠϧ΋ੜ੒

30. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
˞ԼههࣄΛࢀߟʹ͍͖ͤͯͨͩ͞·ͨ͠ɻ
ɾ,BHHMFͰ࢖͑Δ'FBUIFSܗࣜΛར༻ͨ͠ಛ௃ྔ؅ཧ๏
IUUQTBNBMPHIBUFCMPKQFOUSZLBHHMFGFBUVSFNBOBHFNFOU
‣ ྻ͝ͱʹQJDLMFϑΝΠϧͰಛ௃ྔΛ؅ཧ
‣ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹϝϞϑΝΠϧ΋ੜ੒
࠷ॳʹΠϝʔδΛڞ༗͠·͢

31. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
lྻ͝ͱzʹಛ௃ྔΛQJDLMFϑΝΠϧͰ؅ཧ͢Δ

32. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
4VSWJWFE 1DMBTT 4FY "HF &NCBSLFE
NBMF 4
NBMF \$
GFNBMF \$
NBMF \$
GFNBMF \$
GFNBMF 4
NBMF 4
lྻ͝ͱzʹಛ௃ྔΛQJDLMFϑΝΠϧͰ؅ཧ͢Δ

33. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
4VSWJWFE 1DMBTT 4FY "HF &NCBSLFE
NBMF 4
NBMF \$
GFNBMF \$
NBMF \$
GFNBMF \$
GFNBMF 4
NBMF 4
TVSWJWFE@USBJOQLM
QDMBTT@USBJOQLM
QDMBTT@UFTUQLM
TFY@USBJOQLM
TFY@UFTUQLM
BHF@USBJOQLM
BHF@UFTUQLM
FNCBSLFE@USBJOQLM
FNCBSLFE@UFTUQLM
lྻ͝ͱzʹಛ௃ྔΛQJDLMFϑΝΠϧͰ؅ཧ͢Δ

34. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
ಛ௃ྔੜ੒࣌ɺಉ࣌ʹಛ௃ྔϝϞΛ࡞੒͢Δ

35. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
ಛ௃ྔੜ੒࣌ɺಉ࣌ʹಛ௃ྔϝϞΛ࡞੒͢Δ

36. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
ಛ௃ྔੜ੒࣌ɺಉ࣌ʹಛ௃ྔϝϞΛ࡞੒͢Δ
݁ߏେมͦ͏ɾɾɾ

37. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
ಛ௃ྔੜ੒࣌ɺಉ࣌ʹಛ௃ྔϝϞΛ࡞੒͢Δ
Ͱ΋ɺQZUIPOεΫϦϓτΛ࣮ͭߦ͢Δ͚ͩɻ

38. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
IPHFQZΛίϚϯυϥΠϯ͔Β࣮ߦ͢Δ͚ͩ
class Pclass(Feature):
def create_features(self):
self.train['Pclass'] = train['Pclass']
self.test['Pclass'] = test['Pclass']
create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ')
class Sex(Feature):
def create_features(self):
self.train['Sex'] = train['Sex']
self.test['Sex'] = test['Sex']
create_memo('Sex','ੑผ')
class Age(Feature):
def create_features(self):
self.train['Age'] = train['Age']
self.test['Age'] = test['Age']
create_memo('Age','೥ྸ')
class Age_mis_val_median(Feature):
def create_features(self):
self.train['Age_mis_val_median'] = train['Age'].fillna(train['Age'].median())
self.test['Age_mis_val_median'] = test['Age'].fillna(test['Age'].median())
create_memo('Age_mis_val_median','೥ྸͷܽଛ஋Λதԝ஋Ͱิ׬ͨ͠΋ͷ')

39. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
class Pclass(Feature):
def create_features(self):
self.train['Pclass'] = train['Pclass']
self.test['Pclass'] = test['Pclass']
create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ')
class Sex(Feature):
def create_features(self):
self.train['Sex'] = train['Sex']
self.test['Sex'] = test['Sex']
create_memo('Sex','ੑผ')
class Age(Feature):
def create_features(self):
self.train['Age'] = train['Age']
self.test['Age'] = test['Age']
create_memo('Age','೥ྸ')
class Age_mis_val_median(Feature):
def create_features(self):
self.train['Age_mis_val_median'] = train['Age'].fillna(train['Age'].median())
self.test['Age_mis_val_median'] = test['Age'].fillna(test['Age'].median())
create_memo('Age_mis_val_median','೥ྸͷܽଛ஋Λதԝ஋Ͱิ׬ͨ͠΋ͷ')
IPHFQZΛίϚϯυϥΠϯ͔Β࣮ߦ͢Δ͚ͩ

40. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
class Pclass(Feature):
def create_features(self):
self.train['Pclass'] = train['Pclass']
self.test['Pclass'] = test['Pclass']
create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ')
class Sex(Feature):
def create_features(self):
self.train['Sex'] = train['Sex']
self.test['Sex'] = test['Sex']
create_memo('Sex','ੑผ')
class Age(Feature):
def create_features(self):
self.train['Age'] = train['Age']
self.test['Age'] = test['Age']
create_memo('Age','೥ྸ')
class Age_mis_val_median(Feature):
def create_features(self):
self.train['Age_mis_val_median'] = train['Age'].fillna(train['Age'].median())
self.test['Age_mis_val_median'] = test['Age'].fillna(test['Age'].median())
create_memo('Age_mis_val_median','೥ྸͷܽଛ஋Λதԝ஋Ͱิ׬ͨ͠΋ͷ')
֤ಛ௃ྔ
IPHFQZΛίϚϯυϥΠϯ͔Β࣮ߦ͢Δ͚ͩ

41. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
class Pclass(Feature):
def create_features(self):
self.train['Pclass'] = train['Pclass']
self.test['Pclass'] = test['Pclass']
create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ')
class Sex(Feature):
def create_features(self):
self.train['Sex'] = train['Sex']
self.test['Sex'] = test['Sex']
create_memo('Sex','ੑผ')
class Age(Feature):
def create_features(self):
self.train['Age'] = train['Age']
self.test['Age'] = test['Age']
create_memo('Age','೥ྸ')
class Age_mis_val_median(Feature):
def create_features(self):
self.train['Age_mis_val_median'] = train['Age'].fillna(train['Age'].median())
self.test['Age_mis_val_median'] = test['Age'].fillna(test['Age'].median())
create_memo('Age_mis_val_median','೥ྸͷܽଛ஋Λதԝ஋Ͱิ׬ͨ͠΋ͷ')
ಛ௃ྔϝϞϑΝΠϧ
IPHFQZΛίϚϯυϥΠϯ͔Β࣮ߦ͢Δ͚ͩ

42. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
class Pclass(Feature):
def create_features(self):
self.train['Pclass'] = train['Pclass']
self.test['Pclass'] = test['Pclass']
create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ')
DSFBUF@NFNPͷॲཧ֓ཁ

43. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
class Pclass(Feature):
def create_features(self):
self.train['Pclass'] = train['Pclass']
self.test['Pclass'] = test['Pclass']
create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ')
DSFBUF@NFNPͷॲཧ֓ཁ
# ಛ௃ྔϝϞcsvϑΝΠϧ࡞੒
def create_memo(col_name, desc):
file_path = Feature.dir + '/_features_memo.csv'
if not os.path.isfile(file_path):
with open(file_path,"w"):pass
with open(file_path, 'r+') as f:
lines = [line.strip() for line in lines]
# ॻ͖ࠐ΋͏ͱ͍ͯ͠Δಛ௃ྔ͕͢Ͱʹॻ͖ࠐ·Ε͍ͯͳ͍͔νΣοΫ
col = [line for line in lines if line.split(',')[0] == col_name]
if len(col) != 0:return
writer = csv.writer(f)
writer.writerow([col_name, desc])

44. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
class Pclass(Feature):
def create_features(self):
self.train['Pclass'] = train['Pclass']
self.test['Pclass'] = test['Pclass']
create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ')
DSFBUF@NFNPͷॲཧ֓ཁ
# ಛ௃ྔϝϞcsvϑΝΠϧ࡞੒
def create_memo(col_name, desc):
file_path = Feature.dir + '/_features_memo.csv'
if not os.path.isfile(file_path):
with open(file_path,"w"):pass
with open(file_path, 'r+') as f:
lines = [line.strip() for line in lines]
# ॻ͖ࠐ΋͏ͱ͍ͯ͠Δಛ௃ྔ͕͢Ͱʹॻ͖ࠐ·Ε͍ͯͳ͍͔νΣοΫ
col = [line for line in lines if line.split(',')[0] == col_name]
if len(col) != 0:return
writer = csv.writer(f)
writer.writerow([col_name, desc])

45. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
class Pclass(Feature):
def create_features(self):
self.train['Pclass'] = train['Pclass']
self.test['Pclass'] = test['Pclass']
create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ')
DSFBUF@NFNPͷॲཧ֓ཁ
# ಛ௃ྔϝϞcsvϑΝΠϧ࡞੒
def create_memo(col_name, desc):
file_path = Feature.dir + '/_features_memo.csv'
if not os.path.isfile(file_path):
with open(file_path,"w"):pass
with open(file_path, 'r+') as f:
lines = [line.strip() for line in lines]
# ॻ͖ࠐ΋͏ͱ͍ͯ͠Δಛ௃ྔ͕͢Ͱʹॻ͖ࠐ·Ε͍ͯͳ͍͔νΣοΫ
col = [line for line in lines if line.split(',')[0] == col_name]
if len(col) != 0:return
writer = csv.writer(f)
writer.writerow([col_name, desc])
\$47ܗࣜͰอଘ͓ͯ͘͠ͱ(JUIVC͔Βࢀর͠΍͍͢
ʢ΋ͪΖΜɺ&YDFM΍/VNCFSTͱ͍ͬͨΞϓϦέʔγϣϯ͔ΒͰ΋៉ྷʹݟ͑Δʣ

46. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
৽͍͠ಛ௃ྔΛ࡞੒͢Δ৔߹

47. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
class Family_Size(Feature):
def create_features(self):
self.train['Family_Size'] = train['Parch'] + train['SibSp']
self.test['Family_Size'] = test['Parch'] + test['SibSp']
create_memo('Family_Size','Ո଒ͷ૯਺')
IPHFQZʹ৽͍͠ಛ௃ྔੜ੒ॲཧΛهड़
৽͍͠ಛ௃ྔΛ࡞੒͢Δ৔߹

48. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
class Family_Size(Feature):
def create_features(self):
self.train['Family_Size'] = train['Parch'] + train['SibSp']
self.test['Family_Size'] = test['Parch'] + test['SibSp']
create_memo('Family_Size','Ո଒ͷ૯਺')
QZUIPOIPHFQZ
৽͍͠ಛ௃ྔΛ࡞੒͢Δ৔߹
IPHFQZʹ৽͍͠ಛ௃ྔੜ੒ॲཧΛهड़

49. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
class Family_Size(Feature):
def create_features(self):
self.train['Family_Size'] = train['Parch'] + train['SibSp']
self.test['Family_Size'] = test['Parch'] + test['SibSp']
create_memo('Family_Size','Ո଒ͷ૯਺')
QZUIPOIPHFQZ
৽͍͠ಛ௃ྔͷΈੜ੒
৽͍͠ಛ௃ྔΛ࡞੒͢Δ৔߹
IPHFQZʹ৽͍͠ಛ௃ྔੜ੒ॲཧΛهड़

50. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
σʔλΛಡΈࠐΉࡍ͸ɺಛ௃ྔΛࢦఆͯ͠ϩʔυ͢Δ͚ͩ
# ಛ௃ྔͷࢦఆ
features = [
"age_mis_val_median",
"family__size",
"cabin",
"fare_mis_val_median"
]
df = [pd.read_pickle(FEATURE_DIR_NAME + f’{f}_train.pkl') for f in features]
df = pd.concat(df, axis=1)

51. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
σʔλΛಡΈࠐΉࡍ͸ɺಛ௃ྔΛࢦఆͯ͠ϩʔυ͢Δ͚ͩ
# ಛ௃ྔͷࢦఆ
features = [
"age_mis_val_median",
"family__size",
"cabin",
"fare_mis_val_median"
]
df = [pd.read_pickle(FEATURE_DIR_NAME + f’{f}_train.pkl') for f in features]
df = pd.concat(df, axis=1)
Կ͕خ͔͔ͬͨ͠

52. ಛ௃ྔ؅ཧʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
w ͭͷεΫϦϓτϑΝΠϧʹಛ௃ྔੜ੒Λ·ͱΊΔ͜ͱͰɺಉ͡ܭࢉΛෳ਺
ճ࣮ߦ͢Δ͜ͱΛආ͚ɺ࣌ؒΛ༗ޮ׆༻Ͱ͖Δɻ
ɹˠಛ௃ྔͷ࠶ݱੑ΋୲อɻ
w ಛ௃ྔͷϝϞΛಉ࣌ʹੜ੒͢Δ͜ͱͰʮ͜ͷಛ௃ྔͳΜ͚ͩͬʁʯͱ಄Λ࢖
ΘͣʹࡁΜͩɻ
w ಛ௃ྔΛྻ͝ͱʹ؅ཧ͢Δ͜ͱͰऔΓճָ͕͠ʹͳΔɻ
ɹˠQJDLMFϑΝΠϧͩͱอଘ΋ಡΈࠐΈ΋଎͍ʂ
ɹˠಛ௃ྔ͕๲େʹͳΔ৔߹͸ɺ͋Δఔ౓ͷ୯ҐͰ·ͱΊͯ؅ཧ͢Δํ͕ྑ͍͔΋ɻ

53. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
˞Լهॻ੶ʹܝࡌ͞Ε͍ͯΔύΠϓϥΠϯΛࢀߟʹ͠·ͨ͠ɻ
ɾ,BHHMFͰউͭσʔλ෼ੳͷٕज़
IUUQTHJIZPKQCPPL
‣ ίϚϯυҰൃͰֶशˠ4VCNJUϑΝΠϧ࡞੒·ͰΛ࣮ߦ
‣ ֶशʹ࢖༻ͨ͠ಛ௃ྔ΍Ϟσϧύϥϝʔλ͸MPHͱҰॹʹอଘ
‣ TIBQΛ༻͍ͯಛ௃ྔͷߩݙ౓ΛՄࢹԽ͠ɺ࣍ճֶश࣌ͷצॴΛݟ͚ͭΔ

54. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
SVOQZΛ࣮ߦ͢Δ͜ͱͰɺֶशɾਪ࿦ɾ4VCNJUϑΝΠϧΛ࡞੒
# ಛ௃ྔͷࢦఆ
features = [
"age_mis_val_median",
"family__size",
"cabin",
"fare_mis_val_median"
]
run_name = 'lgb_1102'
# ࢖༻͢Δಛ௃ྔϦετͷอଘ
with open(LOG_DIR_NAME + run_name + "_features.txt", 'wt') as f:
for ele in features:
f.write(ele+'\n')
params_lgb = {
'boosting_type': 'gbdt',
'objective': 'binary',
'early_stopping_rounds': 20,
'verbose': 10,
'random_state': 99,
'num_round': 100
}
# ࢖༻͢Δύϥϝʔλͷอଘ
with open(LOG_DIR_NAME + run_name + "_param.txt", 'wt') as f:
for key,value in sorted(params_lgb.items()):
f.write(f'{key}:{value}\n')
runner = Runner(run_name, ModelLGB, features, params_lgb, n_fold, name_prefix)
runner.run_train_cv() # ֶश
runner.run_predict_cv() # ਪ࿦
Submission.create_submission(run_name) # submit࡞੒

55. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
SVOQZΛ࣮ߦ͢Δ͜ͱͰɺֶशɾਪ࿦ɾ4VCNJUϑΝΠϧΛ࡞੒
# ಛ௃ྔͷࢦఆ
features = [
"age_mis_val_median",
"family__size",
"cabin",
"fare_mis_val_median"
]
run_name = 'lgb_1102'
# ࢖༻͢Δಛ௃ྔϦετͷอଘ
with open(LOG_DIR_NAME + run_name + "_features.txt", 'wt') as f:
for ele in features:
f.write(ele+'\n')
params_lgb = {
'boosting_type': 'gbdt',
'objective': 'binary',
'early_stopping_rounds': 20,
'verbose': 10,
'random_state': 99,
'num_round': 100
}
# ࢖༻͢Δύϥϝʔλͷอଘ
with open(LOG_DIR_NAME + run_name + "_param.txt", 'wt') as f:
for key,value in sorted(params_lgb.items()):
f.write(f'{key}:{value}\n')
runner = Runner(run_name, ModelLGB, features, params_lgb, n_fold, name_prefix)
runner.run_train_cv() # ֶश
runner.run_predict_cv() # ਪ࿦
Submission.create_submission(run_name) # submit࡞੒
͜ͷSVO@OBNFΛQSFpYͱͯ͠ɺϑΝΠϧ΍ϞσϧΛอଘͯ͘͠ΕΔɻ

w ࢖༻ͨ͠ಛ௃ྔϦετ
w ࢖༻ͨ͠ϋΠύʔύϥϝʔλ
w GPMEຖͷϞσϧ
w ਪ࿦݁Ռ
w TVCNJUϑΝΠϧ
w TIBQͷܭࢉ݁ՌΠϝʔδϑΝΠϧͳͲ

56. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
SVOQZΛ࣮ߦ͢Δ͜ͱͰɺֶशɾਪ࿦ɾ4VCNJUϑΝΠϧΛ࡞੒
# ಛ௃ྔͷࢦఆ
features = [
"age_mis_val_median",
"family__size",
"cabin",
"fare_mis_val_median"
]
run_name = 'lgb_1102'
# ࢖༻͢Δಛ௃ྔϦετͷอଘ
with open(LOG_DIR_NAME + run_name + "_features.txt", 'wt') as f:
for ele in features:
f.write(ele+'\n')
params_lgb = {
'boosting_type': 'gbdt',
'objective': 'binary',
'early_stopping_rounds': 20,
'verbose': 10,
'random_state': 99,
'num_round': 100
}
# ࢖༻͢Δύϥϝʔλͷอଘ
with open(LOG_DIR_NAME + run_name + "_param.txt", 'wt') as f:
for key,value in sorted(params_lgb.items()):
f.write(f'{key}:{value}\n')
runner = Runner(run_name, ModelLGB, features, params_lgb, n_fold, name_prefix)
runner.run_train_cv() # ֶश
runner.run_predict_cv() # ਪ࿦
Submission.create_submission(run_name) # submit࡞੒
ੜ੒͞ΕΔϑΝΠϧྫ

57. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ
w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ
w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ
w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ
w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ
w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ
w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ

58. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ
w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ
w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ
w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ
w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ
w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ
w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ

59. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ
w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ
w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ
w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ
w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ
w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ
w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ

60. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ
w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ
w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ
w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ
w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ
w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ
w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ

61. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ
w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ
w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ
w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ
w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ
w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ
w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ

62. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ
ϚϚͷҰาΛࢧ͑Δ
ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ
w MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ
w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ
w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ
w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ
w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ
w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ
w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ
Կ͕خ͔͔ͬͨ͠

63. ϚϚͷҰาΛࢧ͑Δ
w ʮ͜ͷಛ௃ྔʯͱʮ͜ͷύϥϝʔλʯΛ࢖ֶͬͯशͤͨ͞Ϟσϧ
ʹؔͯ͠ɺʮ֤λεΫʹཁͨ࣌ؒ͠ʯͱʮ֤GPMEʴ࠷ऴతͳεί
ΞʯΛҙࣝ͠ͳͯ͘΋؅ཧͰ͖ΔΑ͏ʹɻ
w TIBQͷܭࢉ݁Ռ΍GFBUVSFJNQPSUBODFΛग़ྗ͓ͯ͘͜͠ͱ
Ͱɺ࣍ͷֶश࣌ͷצॴ͕௫ΊΔΑ͏ʹɻ
ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ

64. ·ͱΊ
ϚϚͷҰาΛࢧ͑Δ

65. ·ͱΊ
ϚϚͷҰาΛࢧ͑Δ
w ಛ௃ྔ؅ཧ͍͍ͧʂ
ͭͷεΫϦϓτϑΝΠϧʹಛ௃ྔੜ੒Λ·ͱΊΔ͜ͱͰɺಉ͡ܭࢉΛෳ਺ճ࣮ߦ͢Δ͜ͱ
ΛճආͰ͖Δʂ
ಛ௃ྔͷϝϞΛಉ࣌ʹੜ੒͢Δ͜ͱͰʮ͜ͷಛ௃ྔͳΜ͚ͩͬʁʯͱ಄Λ࢖͏ճ਺͕ݮ
Δʂ
ಛ௃ྔΛྻ͝ͱʹ؅ཧ͢Δ͜ͱͰऔΓճָ͕͠ʹͳͬͨʂʢ͕ɺಛ௃ྔ͕๲େͳ৔߹͸͋
Δఔ౓ͷ·ͱ·ΓͰ؅ཧͨ͠ํ͕ྑ͍͔΋ʣ
w ύΠϓϥΠϯ͍͍ͧʂ
ύΠϓϥΠϯΛߏங͢Δ͜ͱͰɺߴ଎ͳ1%\$"Λ࣮ݱʂ
ֶशʹ࢖༻ͨ͠ಛ௃ྔͱύϥϝʔλΛ؅ཧ͢Δ͜ͱͰɺ࠶ݱੑ΋୲อ͞Ε৺ཧత҆શੑ΋

66. ϚϚͷҰาΛࢧ͑Δ
͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠