データ分析コンペにおいて 特徴量管理に疲弊している全人類に伝えたい想い

C5c09bfd9ee31f5aef8ce257643d50ea?s=47 Takanobu Nozawa
November 05, 2019
7.5k

データ分析コンペにおいて 特徴量管理に疲弊している全人類に伝えたい想い

C5c09bfd9ee31f5aef8ce257643d50ea?s=128

Takanobu Nozawa

November 05, 2019
Tweet

Transcript

  1. ϚϚͷҰาΛࢧ͑Δ σʔλ෼ੳίϯϖʹ͓͍ͯ ಛ௃ྔ؅ཧʹർฐ͍ͯ͠Δશਓྨʹ఻͍͑ͨ૝͍ ʙֶशɾਪ࿦ύΠϓϥΠϯΛఴ͑ͯʙ $POOFIJUP*OD໺ᖒ఩র  $POOFIJUP.BSDIÉWPMʙػցֶशɾσʔλ෼ੳࢢʙ

  2. ͜Μʹͪ͸ʂ ϚϚͷҰาΛࢧ͑Δ

  3. ͍͖ͳΓͰ͕͢ ϚϚͷҰาΛࢧ͑Δ

  4. σʔλ෼ੳίϯϖʢ,BHHMF 4*(/"5&ͳͲʣ ฉ͍ͨ͜ͱ͋Δਓʙ! ϚϚͷҰาΛࢧ͑Δ

  5. σʔλ෼ੳίϯϖʹࢀՃͨ͜͠ͱ͋Δਓʙ! ϚϚͷҰาΛࢧ͑Δ

  6. ಛ௃ྔͷ؅ཧͬͯͲ͏ͯ͠·͔͢ʁ ʢςʔϒϧσʔλʹ͓͍ͯʣ ϚϚͷҰาΛࢧ͑Δ

  7. Α͋͘Δύλʔϯʢ࣮ମݧʣ ϚϚͷҰาΛࢧ͑Δ

  8. Α͋͘ΔύλʔϯʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ w ಛ௃ྔ࡞Δ DPM<Z " # $ %>ˠ<Z "

    # $ % & '> # e.g) train['A'] = train['A'].fillna(0) train['B'] = np.log1p(train['B']) train['E'] = train['A'] + train['B'] df_group = train.groupby('D')['E'].mean() train['F'] = train['D'].map(df_group) <> ɾ ɾ ɾ
  9. Α͋͘ΔύλʔϯʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ w ࢖͏ಛ௃ྔͷΧϥϜ͚ͩࢦఆ͢Δ # e.g) feat_col = ['A', 'C',

    'D', 'E', 'F', 'J'] x_train = train[feat_col] y_train = train['y'] # e.g) clf.fit(x_train, y_train) w ֶशͤ͞Δ <> <> ɾ ɾ ɾ
  10. Α͋͘ΔύλʔϯʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ w ࢖͏ಛ௃ྔͷΧϥϜ͚ͩࢦఆ͢Δ # e.g) feat_col = ['A', 'C',

    'D', 'E', 'F', 'J'] x_train = train[feat_col] y_train = train['y'] # e.g) clf.fit(x_train, y_train) w ֶशͤ͞Δ <> <> ɾ ɾ ɾ ͋Ε b'`ͬͯͲΜͳಛ௃ྔ͚ͩͬʁ
  11. Α͋͘ΔύλʔϯʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ # e.g) train['A'] = train['A'].fillna(0) train['B'] = np.log1p(train['B'])

    train['E'] = train['A'] + train['B'] df_group = train.groupby('D')['E'].mean() train['F'] = train['D'].map(df_group) <> # e.g) feat_col = ['A', 'C', 'D', 'E', 'F', 'J'] x_train = train[feat_col] y_train = train['y'] # e.g) clf.fit(x_train, y_train) <> <>
  12. Α͋͘ΔύλʔϯʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ # e.g) train['A'] = train['A'].fillna(0) train['B'] = np.log1p(train['B'])

    train['E'] = train['A'] + train['B'] df_group = train.groupby('D')['E'].mean() train['F'] = train['D'].map(df_group) <> # e.g) feat_col = ['A', 'C', 'D', 'E', 'F', 'J'] x_train = train[feat_col] y_train = train['y'] # e.g) clf.fit(x_train, y_train) <> <> ݟ͚ͭͨʂ ʢOPUFCPPLͷ্ͷํʣ
  13. Α͋͘ΔύλʔϯʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ # e.g) train['A'] = train['A'].fillna(0) train['B'] = np.log1p(train['B'])

    train['E'] = train['A'] + train['B'] df_group = train.groupby('D')['E'].mean() train['F'] = train['D'].map(df_group) <> # e.g) feat_col = ['A', 'C', 'D', 'E', 'F', 'J'] x_train = train[feat_col] y_train = train['y'] # e.g) clf.fit(x_train, y_train) <> <> ݟ͚ͭͨʂ ʢOPUFCPPLͷ্ͷํʣ ಛ௃ྔ͕গͳ͍৔߹͸·ͩϚγ͕ͩɺ ଟ͘ͳͬͯ͘ΔͱͲΜͳܭࢉͰٻΊͨ ಛ௃ྔ͔ͩͬͨΛ͍͍ͪͪߟ͑Δʢ୳͢ʣ ͷ͸݁ߏେมͩ͠ɺ͕͔͔࣌ؒΔ
  14. Α͋͘Δύλʔϯͦͷʢ࣮ମݧʣ ϚϚͷҰาΛࢧ͑Δ

  15. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ Αͬ͠Ό͊ʂΊͬͪΌྑ͍είΞͰͨͥʙʙʙ ͜ͷOPUFCPPLΛ%VQMJDBUFͯ͠ɺ΋ͬͱྑ͍Ϟσϧ࡞ͬͪΌ͏ͧʂ

  16. ҰํɺOPUFCPPLͷத਎͸ʜ ϚϚͷҰาΛࢧ͑Δ

  17. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ <> import numpy as np import pandas as

    pd OPUFCPPLͷத਎ ɾ ɾ ɾ <> submission.to_csv('submission.csv', index=False)
  18. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ <> import numpy as np import pandas as

    pd OPUFCPPLͷத਎ ɾ ɾ ɾ ɾ ɾ ɾ <> submission.to_csv('submission.csv', index=False)
  19. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ <> import numpy as np import pandas as

    pd OPUFCPPLͷத਎ ɾ ɾ ɾ ɾ ɾ ɾ <> submission.to_csv('submission.csv', index=False)
  20. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ <> import numpy as np import pandas as

    pd OPUFCPPLͷத਎ ɾ ɾ ɾ ɾ ɾ ɾ <> submission.to_csv('submission.csv', index=False) ಉ͡ܭࢉΛԿ౓΋΍Βͳ͍ͱ͍͚ͳ͍ ʴ ºʢ ʣˠແବ
  21. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ ౓ॏͳΔ%VQMJDBUFʹΑΓɺOPUFCPPL஍ࠈʹؕΔՄೳੑ΋ʜ dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC

    dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC ʜʜʜʜʜ
  22. Α͋͘ΔύλʔϯͦͷʢJQZOCʣ ϚϚͷҰาΛࢧ͑Δ ౓ॏͳΔ%VQMJDBUFʹΑΓɺOPUFCPPL஍ࠈʹؕΔՄೳੑ΋ʜ dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC

    dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC dOPUFCPPL-JHIU#(.@TDPSF@JQZOC ʜʜʜʜʜ
  23. ϚϚͷҰาΛࢧ͑Δ ࠓ೔࿩͢͜ͱ

  24. ϚϚͷҰาΛࢧ͑Δ σʔλ෼ੳίϯϖʹ͓͍ͯ ಛ௃ྔ؅ཧʹർฐ͍ͯ͠Δશਓྨʹ఻͍͑ͨ૝͍ ʙֶशɾਪ࿦ύΠϓϥΠϯΛఴ͑ͯʙ

  25. ΞδΣϯμ  ࣗݾ঺հ  ಛ௃ྔ؅ཧʹ͍ͭͯ ‣ ྻ͝ͱʹQJDLMFϑΝΠϧͰಛ௃ྔΛ؅ཧ ‣ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹϝϞϑΝΠϧ΋ੜ੒ 

    ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ‣ ίϚϯυҰൃͰֶशˠ4VCNJUϑΝΠϧ࡞੒·ͰΛ࣮ߦ ‣ ֶशʹ࢖༻ͨ͠ಛ௃ྔ΍Ϟσϧύϥϝʔλ͸MPHͱҰॹʹอଘ ‣ TIBQΛ༻͍ͯಛ௃ྔͷߩݙ౓ΛՄࢹԽ͠ɺ࣍ճֶश࣌ͷצॴΛݟ͚ͭΔ ϚϚͷҰาΛࢧ͑Δ
  26. ΞδΣϯμ  ࣗݾ঺հ  ಛ௃ྔ؅ཧʹ͍ͭͯ ‣ ྻ͝ͱʹQJDLMFϑΝΠϧͰಛ௃ྔΛ؅ཧ ‣ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹϝϞϑΝΠϧ΋ੜ੒ 

    ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ‣ ίϚϯυҰൃͰֶशˠ4VCNJUϑΝΠϧ࡞੒·ͰΛ࣮ߦ ‣ ֶशʹ࢖༻ͨ͠ಛ௃ྔ΍Ϟσϧύϥϝʔλ͸MPHͱҰॹʹอଘ ‣ TIBQΛ༻͍ͯಛ௃ྔͷߩݙ౓ΛՄࢹԽ͠ɺ࣍ճֶश࣌ͷצॴΛݟ͚ͭΔ ϚϚͷҰาΛࢧ͑Δ ݰਓͷ஌ܙΛ͓आΓͨ͠Β ΊͬͪΌΑ͔ͬͨ ʢ˞ʣ ͍ͬͯ͏࿩Λ͠·͢ ʢ˞ʣ͋͘·Ͱओ؍Ͱ͢
  27. ࣗݾ঺հ ϚϚͷҰาΛࢧ͑Δ

  28. ࣗݾ঺հ ϚϚͷҰาΛࢧ͑Δ ໊લɿ໺ᖒ఩রʢ/P[BXB5BLBOPCVʣ ॴଐɿίωώτגࣜձࣾ ɹɹɿ͔ͨͺ͍!UBLBQZ w ʙίωώτʹ.-ΤϯδχΞͱͯ͠+0*/ w ػցֶशʢ/-1ɺਪનγεςϜʣΛϝΠϯʹ΍ΓͭͭΠϯϑϥʢ"84ʣ΋ษڧத w

    ,BHHMFͨ͠ΓɺϒϩάʢIUUQTXXXUBLBQZXPSLʣॻ͍ͨΓɺ໺ٿͨ͠Γɺ ϥʔϝϯ৯΂ͨΓ͍ͯ͠·͢
  29. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ˞ԼههࣄΛࢀߟʹ͍͖ͤͯͨͩ͞·ͨ͠ɻ ɾ,BHHMFͰ࢖͑Δ'FBUIFSܗࣜΛར༻ͨ͠ಛ௃ྔ؅ཧ๏ IUUQTBNBMPHIBUFCMPKQFOUSZLBHHMFGFBUVSFNBOBHFNFOU ‣ ྻ͝ͱʹQJDLMFϑΝΠϧͰಛ௃ྔΛ؅ཧ ‣ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹϝϞϑΝΠϧ΋ੜ੒

  30. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ˞ԼههࣄΛࢀߟʹ͍͖ͤͯͨͩ͞·ͨ͠ɻ ɾ,BHHMFͰ࢖͑Δ'FBUIFSܗࣜΛར༻ͨ͠ಛ௃ྔ؅ཧ๏ IUUQTBNBMPHIBUFCMPKQFOUSZLBHHMFGFBUVSFNBOBHFNFOU ‣ ྻ͝ͱʹQJDLMFϑΝΠϧͰಛ௃ྔΛ؅ཧ ‣ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹϝϞϑΝΠϧ΋ੜ੒ ࠷ॳʹΠϝʔδΛڞ༗͠·͢

  31. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ lྻ͝ͱzʹಛ௃ྔΛQJDLMFϑΝΠϧͰ؅ཧ͢Δ

  32. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ 4VSWJWFE 1DMBTT 4FY "HF &NCBSLFE   NBMF

     4   NBMF  $   GFNBMF  $   NBMF  $   GFNBMF  $   GFNBMF  4   NBMF  4 lྻ͝ͱzʹಛ௃ྔΛQJDLMFϑΝΠϧͰ؅ཧ͢Δ
  33. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ 4VSWJWFE 1DMBTT 4FY "HF &NCBSLFE   NBMF

     4   NBMF  $   GFNBMF  $   NBMF  $   GFNBMF  $   GFNBMF  4   NBMF  4 TVSWJWFE@USBJOQLM QDMBTT@USBJOQLM QDMBTT@UFTUQLM TFY@USBJOQLM TFY@UFTUQLM BHF@USBJOQLM BHF@UFTUQLM FNCBSLFE@USBJOQLM FNCBSLFE@UFTUQLM lྻ͝ͱzʹಛ௃ྔΛQJDLMFϑΝΠϧͰ؅ཧ͢Δ
  34. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹಛ௃ྔϝϞΛ࡞੒͢Δ

  35. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹಛ௃ྔϝϞΛ࡞੒͢Δ

  36. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹಛ௃ྔϝϞΛ࡞੒͢Δ ݁ߏେมͦ͏ɾɾɾ

  37. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ಛ௃ྔੜ੒࣌ɺಉ࣌ʹಛ௃ྔϝϞΛ࡞੒͢Δ Ͱ΋ɺQZUIPOεΫϦϓτΛ࣮ͭߦ͢Δ͚ͩɻ

  38. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ IPHFQZΛίϚϯυϥΠϯ͔Β࣮ߦ͢Δ͚ͩ class Pclass(Feature): def create_features(self): self.train['Pclass'] = train['Pclass']

    self.test['Pclass'] = test['Pclass'] create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ') class Sex(Feature): def create_features(self): self.train['Sex'] = train['Sex'] self.test['Sex'] = test['Sex'] create_memo('Sex','ੑผ') class Age(Feature): def create_features(self): self.train['Age'] = train['Age'] self.test['Age'] = test['Age'] create_memo('Age','೥ྸ') class Age_mis_val_median(Feature): def create_features(self): self.train['Age_mis_val_median'] = train['Age'].fillna(train['Age'].median()) self.test['Age_mis_val_median'] = test['Age'].fillna(test['Age'].median()) create_memo('Age_mis_val_median','೥ྸͷܽଛ஋Λதԝ஋Ͱิ׬ͨ͠΋ͷ')
  39. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ class Pclass(Feature): def create_features(self): self.train['Pclass'] = train['Pclass'] self.test['Pclass']

    = test['Pclass'] create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ') class Sex(Feature): def create_features(self): self.train['Sex'] = train['Sex'] self.test['Sex'] = test['Sex'] create_memo('Sex','ੑผ') class Age(Feature): def create_features(self): self.train['Age'] = train['Age'] self.test['Age'] = test['Age'] create_memo('Age','೥ྸ') class Age_mis_val_median(Feature): def create_features(self): self.train['Age_mis_val_median'] = train['Age'].fillna(train['Age'].median()) self.test['Age_mis_val_median'] = test['Age'].fillna(test['Age'].median()) create_memo('Age_mis_val_median','೥ྸͷܽଛ஋Λதԝ஋Ͱิ׬ͨ͠΋ͷ') IPHFQZΛίϚϯυϥΠϯ͔Β࣮ߦ͢Δ͚ͩ
  40. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ class Pclass(Feature): def create_features(self): self.train['Pclass'] = train['Pclass'] self.test['Pclass']

    = test['Pclass'] create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ') class Sex(Feature): def create_features(self): self.train['Sex'] = train['Sex'] self.test['Sex'] = test['Sex'] create_memo('Sex','ੑผ') class Age(Feature): def create_features(self): self.train['Age'] = train['Age'] self.test['Age'] = test['Age'] create_memo('Age','೥ྸ') class Age_mis_val_median(Feature): def create_features(self): self.train['Age_mis_val_median'] = train['Age'].fillna(train['Age'].median()) self.test['Age_mis_val_median'] = test['Age'].fillna(test['Age'].median()) create_memo('Age_mis_val_median','೥ྸͷܽଛ஋Λதԝ஋Ͱิ׬ͨ͠΋ͷ') ֤ಛ௃ྔ IPHFQZΛίϚϯυϥΠϯ͔Β࣮ߦ͢Δ͚ͩ
  41. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ class Pclass(Feature): def create_features(self): self.train['Pclass'] = train['Pclass'] self.test['Pclass']

    = test['Pclass'] create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ') class Sex(Feature): def create_features(self): self.train['Sex'] = train['Sex'] self.test['Sex'] = test['Sex'] create_memo('Sex','ੑผ') class Age(Feature): def create_features(self): self.train['Age'] = train['Age'] self.test['Age'] = test['Age'] create_memo('Age','೥ྸ') class Age_mis_val_median(Feature): def create_features(self): self.train['Age_mis_val_median'] = train['Age'].fillna(train['Age'].median()) self.test['Age_mis_val_median'] = test['Age'].fillna(test['Age'].median()) create_memo('Age_mis_val_median','೥ྸͷܽଛ஋Λதԝ஋Ͱิ׬ͨ͠΋ͷ') ಛ௃ྔϝϞϑΝΠϧ IPHFQZΛίϚϯυϥΠϯ͔Β࣮ߦ͢Δ͚ͩ
  42. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ class Pclass(Feature): def create_features(self): self.train['Pclass'] = train['Pclass'] self.test['Pclass']

    = test['Pclass'] create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ') DSFBUF@NFNPͷॲཧ֓ཁ
  43. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ class Pclass(Feature): def create_features(self): self.train['Pclass'] = train['Pclass'] self.test['Pclass']

    = test['Pclass'] create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ') DSFBUF@NFNPͷॲཧ֓ཁ # ಛ௃ྔϝϞcsvϑΝΠϧ࡞੒ def create_memo(col_name, desc): file_path = Feature.dir + '/_features_memo.csv' if not os.path.isfile(file_path): with open(file_path,"w"):pass with open(file_path, 'r+') as f: lines = f.readlines() lines = [line.strip() for line in lines] # ॻ͖ࠐ΋͏ͱ͍ͯ͠Δಛ௃ྔ͕͢Ͱʹॻ͖ࠐ·Ε͍ͯͳ͍͔νΣοΫ col = [line for line in lines if line.split(',')[0] == col_name] if len(col) != 0:return writer = csv.writer(f) writer.writerow([col_name, desc])
  44. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ class Pclass(Feature): def create_features(self): self.train['Pclass'] = train['Pclass'] self.test['Pclass']

    = test['Pclass'] create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ') DSFBUF@NFNPͷॲཧ֓ཁ # ಛ௃ྔϝϞcsvϑΝΠϧ࡞੒ def create_memo(col_name, desc): file_path = Feature.dir + '/_features_memo.csv' if not os.path.isfile(file_path): with open(file_path,"w"):pass with open(file_path, 'r+') as f: lines = f.readlines() lines = [line.strip() for line in lines] # ॻ͖ࠐ΋͏ͱ͍ͯ͠Δಛ௃ྔ͕͢Ͱʹॻ͖ࠐ·Ε͍ͯͳ͍͔νΣοΫ col = [line for line in lines if line.split(',')[0] == col_name] if len(col) != 0:return writer = csv.writer(f) writer.writerow([col_name, desc])
  45. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ class Pclass(Feature): def create_features(self): self.train['Pclass'] = train['Pclass'] self.test['Pclass']

    = test['Pclass'] create_memo('Pclass','νέοτͷΫϥεɻ1st, 2nd, 3rdͷ3छྨ') DSFBUF@NFNPͷॲཧ֓ཁ # ಛ௃ྔϝϞcsvϑΝΠϧ࡞੒ def create_memo(col_name, desc): file_path = Feature.dir + '/_features_memo.csv' if not os.path.isfile(file_path): with open(file_path,"w"):pass with open(file_path, 'r+') as f: lines = f.readlines() lines = [line.strip() for line in lines] # ॻ͖ࠐ΋͏ͱ͍ͯ͠Δಛ௃ྔ͕͢Ͱʹॻ͖ࠐ·Ε͍ͯͳ͍͔νΣοΫ col = [line for line in lines if line.split(',')[0] == col_name] if len(col) != 0:return writer = csv.writer(f) writer.writerow([col_name, desc]) $47ܗࣜͰอଘ͓ͯ͘͠ͱ(JUIVC͔Βࢀর͠΍͍͢ ʢ΋ͪΖΜɺ&YDFM΍/VNCFSTͱ͍ͬͨΞϓϦέʔγϣϯ͔ΒͰ΋៉ྷʹݟ͑Δʣ
  46. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ৽͍͠ಛ௃ྔΛ࡞੒͢Δ৔߹

  47. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ class Family_Size(Feature): def create_features(self): self.train['Family_Size'] = train['Parch'] +

    train['SibSp'] self.test['Family_Size'] = test['Parch'] + test['SibSp'] create_memo('Family_Size','Ո଒ͷ૯਺') IPHFQZʹ৽͍͠ಛ௃ྔੜ੒ॲཧΛهड़ ৽͍͠ಛ௃ྔΛ࡞੒͢Δ৔߹
  48. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ class Family_Size(Feature): def create_features(self): self.train['Family_Size'] = train['Parch'] +

    train['SibSp'] self.test['Family_Size'] = test['Parch'] + test['SibSp'] create_memo('Family_Size','Ո଒ͷ૯਺') QZUIPOIPHFQZ ৽͍͠ಛ௃ྔΛ࡞੒͢Δ৔߹ IPHFQZʹ৽͍͠ಛ௃ྔੜ੒ॲཧΛهड़
  49. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ class Family_Size(Feature): def create_features(self): self.train['Family_Size'] = train['Parch'] +

    train['SibSp'] self.test['Family_Size'] = test['Parch'] + test['SibSp'] create_memo('Family_Size','Ո଒ͷ૯਺') QZUIPOIPHFQZ ৽͍͠ಛ௃ྔͷΈੜ੒ ৽͍͠ಛ௃ྔΛ࡞੒͢Δ৔߹ IPHFQZʹ৽͍͠ಛ௃ྔੜ੒ॲཧΛهड़
  50. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ σʔλΛಡΈࠐΉࡍ͸ɺಛ௃ྔΛࢦఆͯ͠ϩʔυ͢Δ͚ͩ # ಛ௃ྔͷࢦఆ features = [ "age_mis_val_median", "family__size",

    "cabin", "fare_mis_val_median" ] df = [pd.read_pickle(FEATURE_DIR_NAME + f’{f}_train.pkl') for f in features] df = pd.concat(df, axis=1)
  51. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ σʔλΛಡΈࠐΉࡍ͸ɺಛ௃ྔΛࢦఆͯ͠ϩʔυ͢Δ͚ͩ # ಛ௃ྔͷࢦఆ features = [ "age_mis_val_median", "family__size",

    "cabin", "fare_mis_val_median" ] df = [pd.read_pickle(FEATURE_DIR_NAME + f’{f}_train.pkl') for f in features] df = pd.concat(df, axis=1) Կ͕خ͔͔ͬͨ͠
  52. ಛ௃ྔ؅ཧʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ w ͭͷεΫϦϓτϑΝΠϧʹಛ௃ྔੜ੒Λ·ͱΊΔ͜ͱͰɺಉ͡ܭࢉΛෳ਺ ճ࣮ߦ͢Δ͜ͱΛආ͚ɺ࣌ؒΛ༗ޮ׆༻Ͱ͖Δɻ ɹˠಛ௃ྔͷ࠶ݱੑ΋୲อɻ w ಛ௃ྔͷϝϞΛಉ࣌ʹੜ੒͢Δ͜ͱͰʮ͜ͷಛ௃ྔͳΜ͚ͩͬʁʯͱ಄Λ࢖ ΘͣʹࡁΜͩɻ w

    ಛ௃ྔΛྻ͝ͱʹ؅ཧ͢Δ͜ͱͰऔΓճָ͕͠ʹͳΔɻ ɹˠQJDLMFϑΝΠϧͩͱอଘ΋ಡΈࠐΈ΋଎͍ʂ ɹˠಛ௃ྔ͕๲େʹͳΔ৔߹͸ɺ͋Δఔ౓ͷ୯ҐͰ·ͱΊͯ؅ཧ͢Δํ͕ྑ͍͔΋ɻ
  53. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ˞Լهॻ੶ʹܝࡌ͞Ε͍ͯΔύΠϓϥΠϯΛࢀߟʹ͠·ͨ͠ɻ ɾ,BHHMFͰউͭσʔλ෼ੳͷٕज़ IUUQTHJIZPKQCPPL ‣ ίϚϯυҰൃͰֶशˠ4VCNJUϑΝΠϧ࡞੒·ͰΛ࣮ߦ ‣ ֶशʹ࢖༻ͨ͠ಛ௃ྔ΍Ϟσϧύϥϝʔλ͸MPHͱҰॹʹอଘ ‣

    TIBQΛ༻͍ͯಛ௃ྔͷߩݙ౓ΛՄࢹԽ͠ɺ࣍ճֶश࣌ͷצॴΛݟ͚ͭΔ
  54. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ SVOQZΛ࣮ߦ͢Δ͜ͱͰɺֶशɾਪ࿦ɾ4VCNJUϑΝΠϧΛ࡞੒ # ಛ௃ྔͷࢦఆ features = [ "age_mis_val_median", "family__size",

    "cabin", "fare_mis_val_median" ] run_name = 'lgb_1102' # ࢖༻͢Δಛ௃ྔϦετͷอଘ with open(LOG_DIR_NAME + run_name + "_features.txt", 'wt') as f: for ele in features: f.write(ele+'\n') params_lgb = { 'boosting_type': 'gbdt', 'objective': 'binary', 'early_stopping_rounds': 20, 'verbose': 10, 'random_state': 99, 'num_round': 100 } # ࢖༻͢Δύϥϝʔλͷอଘ with open(LOG_DIR_NAME + run_name + "_param.txt", 'wt') as f: for key,value in sorted(params_lgb.items()): f.write(f'{key}:{value}\n') runner = Runner(run_name, ModelLGB, features, params_lgb, n_fold, name_prefix) runner.run_train_cv() # ֶश runner.run_predict_cv() # ਪ࿦ Submission.create_submission(run_name) # submit࡞੒
  55. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ SVOQZΛ࣮ߦ͢Δ͜ͱͰɺֶशɾਪ࿦ɾ4VCNJUϑΝΠϧΛ࡞੒ # ಛ௃ྔͷࢦఆ features = [ "age_mis_val_median", "family__size",

    "cabin", "fare_mis_val_median" ] run_name = 'lgb_1102' # ࢖༻͢Δಛ௃ྔϦετͷอଘ with open(LOG_DIR_NAME + run_name + "_features.txt", 'wt') as f: for ele in features: f.write(ele+'\n') params_lgb = { 'boosting_type': 'gbdt', 'objective': 'binary', 'early_stopping_rounds': 20, 'verbose': 10, 'random_state': 99, 'num_round': 100 } # ࢖༻͢Δύϥϝʔλͷอଘ with open(LOG_DIR_NAME + run_name + "_param.txt", 'wt') as f: for key,value in sorted(params_lgb.items()): f.write(f'{key}:{value}\n') runner = Runner(run_name, ModelLGB, features, params_lgb, n_fold, name_prefix) runner.run_train_cv() # ֶश runner.run_predict_cv() # ਪ࿦ Submission.create_submission(run_name) # submit࡞੒ ͜ͷSVO@OBNFΛQSFpYͱͯ͠ɺϑΝΠϧ΍ϞσϧΛอଘͯ͘͠ΕΔɻ ྫ w ࢖༻ͨ͠ಛ௃ྔϦετ w ࢖༻ͨ͠ϋΠύʔύϥϝʔλ w GPMEຖͷϞσϧ w ਪ࿦݁Ռ w TVCNJUϑΝΠϧ w TIBQͷܭࢉ݁ՌΠϝʔδϑΝΠϧͳͲ
  56. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ SVOQZΛ࣮ߦ͢Δ͜ͱͰɺֶशɾਪ࿦ɾ4VCNJUϑΝΠϧΛ࡞੒ # ಛ௃ྔͷࢦఆ features = [ "age_mis_val_median", "family__size",

    "cabin", "fare_mis_val_median" ] run_name = 'lgb_1102' # ࢖༻͢Δಛ௃ྔϦετͷอଘ with open(LOG_DIR_NAME + run_name + "_features.txt", 'wt') as f: for ele in features: f.write(ele+'\n') params_lgb = { 'boosting_type': 'gbdt', 'objective': 'binary', 'early_stopping_rounds': 20, 'verbose': 10, 'random_state': 99, 'num_round': 100 } # ࢖༻͢Δύϥϝʔλͷอଘ with open(LOG_DIR_NAME + run_name + "_param.txt", 'wt') as f: for key,value in sorted(params_lgb.items()): f.write(f'{key}:{value}\n') runner = Runner(run_name, ModelLGB, features, params_lgb, n_fold, name_prefix) runner.run_train_cv() # ֶश runner.run_predict_cv() # ਪ࿦ Submission.create_submission(run_name) # submit࡞੒ ੜ੒͞ΕΔϑΝΠϧྫ
  57. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w

    MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ
  58. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w

    MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ
  59. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w

    MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ
  60. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w

    MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ
  61. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w

    MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ
  62. ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ ϚϚͷҰาΛࢧ͑Δ ੜ੒͞ΕΔϑΝΠϧͷྫʢϑΥϧμ͸దٓ෼͚͍ͯ·͢ʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w MHC@@GPMENPEFMʢGPMEͰ࡞੒͞ΕͨϞσϧʣ w

    MHC@@QSFEQLMʢUFTUσʔλͰͷਪ࿦݁Ռʣ w MHC@@@TVCNJTTJPODTWʢਪ࿦݁ՌΛLBHHMFʹఏग़Ͱ͖ΔDTWʹม׵ͨ͠΋ͷʣ w MHC@@@GFBUVSFTUYUʢࠓճͷֶशʹ࢖༻ͨ͠ಛ௃ྔϦετʣ w MHC@@@QBSBNUYUʢࠓճͷֶशʹ࢖༻ͨ͠ϋΠύʔύϥϝʔλʣ w MHC@@@TIBQQOHʢTIBQͰܭࢉͨ͠ՄࢹԽΠϝʔδʣ w HFOFSBMMPHʢܭࢉϩάϑΝΠϧʣ w SFTVMUMPHʢϞσϧͷείΞ͚͕ͩهࡌ͞ΕͨϩάϑΝΠϧʣ Կ͕خ͔͔ͬͨ͠
  63. ϚϚͷҰาΛࢧ͑Δ w ʮ͜ͷಛ௃ྔʯͱʮ͜ͷύϥϝʔλʯΛ࢖ֶͬͯशͤͨ͞Ϟσϧ ʹؔͯ͠ɺʮ֤λεΫʹཁͨ࣌ؒ͠ʯͱʮ֤GPMEʴ࠷ऴతͳεί ΞʯΛҙࣝ͠ͳͯ͘΋؅ཧͰ͖ΔΑ͏ʹɻ w TIBQͷܭࢉ݁Ռ΍GFBUVSFJNQPSUBODFΛग़ྗ͓ͯ͘͜͠ͱ Ͱɺ࣍ͷֶश࣌ͷצॴ͕௫ΊΔΑ͏ʹɻ ֶशɾਪ࿦ύΠϓϥΠϯʹ͍ͭͯ

  64. ·ͱΊ ϚϚͷҰาΛࢧ͑Δ

  65. ·ͱΊ ϚϚͷҰาΛࢧ͑Δ w ಛ௃ྔ؅ཧ͍͍ͧʂ  ͭͷεΫϦϓτϑΝΠϧʹಛ௃ྔੜ੒Λ·ͱΊΔ͜ͱͰɺಉ͡ܭࢉΛෳ਺ճ࣮ߦ͢Δ͜ͱ ΛճආͰ͖Δʂ  ಛ௃ྔͷϝϞΛಉ࣌ʹੜ੒͢Δ͜ͱͰʮ͜ͷಛ௃ྔͳΜ͚ͩͬʁʯͱ಄Λ࢖͏ճ਺͕ݮ Δʂ

     ಛ௃ྔΛྻ͝ͱʹ؅ཧ͢Δ͜ͱͰऔΓճָ͕͠ʹͳͬͨʂʢ͕ɺಛ௃ྔ͕๲େͳ৔߹͸͋ Δఔ౓ͷ·ͱ·ΓͰ؅ཧͨ͠ํ͕ྑ͍͔΋ʣ w ύΠϓϥΠϯ͍͍ͧʂ  ύΠϓϥΠϯΛߏங͢Δ͜ͱͰɺߴ଎ͳ1%$"Λ࣮ݱʂ  ֶशʹ࢖༻ͨ͠ಛ௃ྔͱύϥϝʔλΛ؅ཧ͢Δ͜ͱͰɺ࠶ݱੑ΋୲อ͞Ε৺ཧత҆શੑ΋
  66. ϚϚͷҰาΛࢧ͑Δ ͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠