Upgrade to Pro — share decks privately, control downloads, hide ads and more …

spark.ml の API で XGBoost を扱いたい!#shokaispark

spark.ml の API で XGBoost を扱いたい!#shokaispark

『詳解 Apache Spark』出版記念イベントでの発表資料です。

http://connpass.com/event/30375/

KOMIYA Atsushi

May 11, 2016
Tweet

More Decks by KOMIYA Atsushi

Other Decks in Programming

Transcript

  1. on

  2. XGBoost • ޯ഑ϒʔεςΟϯάͷ࣮૷ͷҰͭ • ܾఆ໦ʹର͢Δޯ഑ϒʔεςΟϯά͸ɺ MLlib Ͱ΋ GBTClassifier / GBTRegressor

    ͱ ࣮ͯ͠૷͞Ε͍ͯΔ • ༧ଌਫ਼౓ͷߴ͞ͳͲ͔ΒɺKaggler ͳํʑΛ த৺ʹਓؾ͕͋ΔʢͬΆ͍ʣ
  3. ͜ͷൃදͰ͓࿩͢Δ͜ͱ • XGBoost on Spark ͷݱঢ় • spark.ml ͷ API

    ͰػցֶशΞϧΰϦζϜΛ
 ࣮૷͢ΔࡍͷϙΠϯτ • ಛʹΠϯλϑΣʔε෦෼ʹண໨͢Δ
  4. SparkXGBoost • https://github.com/rotationsymmetry/sparkxgboost • XGBoost ͱಉ͡ޯ഑ϒʔεςΟϯάπϦʔΛɺSpark ޲͚ ʹ pure Scala

    Ͱ࣮૷͍ͯ͠Δ • Spark packages ʹొ࿥͞Ε͍ͯΔ • ΦϦδφϧͷ XGBoost ʹͲ͜·Ͱ஧࣮ͳ࣮૷ͳͷ͔ෆ໌ • ver 0.6 ·ͰͷϩʔυϚοϓ͕͋Δ͕ɺ։ൃ͕׆ൃͰ͸ͳ͍ • ࠷ޙͷίϛοτ͸ࡢ೥ 11 ݄ɺver 0.2
  5. xgboost4j-spark • DMLC ͕ఏڙ͢Δެࣜͷ Spark integration • ͨͩ͠ɺDataFrame ʹ͸ରԠ͍ͯ͠ͳ͍ •

    XGBoost ຊମͷ git ϦϙδτϦ্Ͱϝϯς͞Ε͍ͯΔ • ֶश͓Αͼ༧ଌͷ۩ମతͳॲཧ͸ɺJNI ܦ༝Ͱ C++ ࣮૷ʹ͓೚ͤ • ֶश࣌ͷϫʔΧʔؒͷ௨৴ʹ͸ Rabit Λར༻͍ͯ͠Δ • Maven central ʹ͸ొ࿥͞Ε͍ͯͳ͍ • ར༻͢Δʹ͸໺ྑϏϧυඞਢ
  6. ࠓճ͸… • SparkXGBoost ͷΑ͏ʹɺXGBoost Λֶशث ؚΊͯ pure Scala Ͱ࠶࣮૷͢Δͷ͸ϋʔυϧ ͕ߴ͍

    • xgboost4j-spark ͕ࢀর͢Δ xgboost4j Λ
 ϕʔεʹɺspark.ml ͷ API Ͱϥοϓͯ͠ΈΔ
  7. spark.ml ͷ࣮૷ΛಡΉ • ࣮૷ΛಡΉͷʹ͓͢͢ΊͳػցֶशΞϧΰϦζϜ • ϩδεςΟοΫճؼ • LogisticRegression / LogisticRegressionModel

    • ܾఆ໦ (෼ྨ) • DecisionTreeClassifier / DecisionTreeClassificationModel • ܾఆ໦ (ճؼ) • DecisionTreeRegressor / DecisionTreeRegressionModel
  8. Predictor Ϋϥε • ΧϥϜ • label: ਖ਼ղϥϕϧΛ࣋ͭΧϥϜ • features: ಛ௃ϕΫτϧΛ࣋ͭΧϥϜ

    • prediction: ༧ଌ͞Εͨϥϕϧ͕ઃఆ͞ΕΔΧϥϜ • ϝιου • train (ந৅ϝιου): ֶशॲཧΛ࣮૷͢Δ • extractLabeledPoints: DataFrame ͔Β RDD[LabeledPoint] Λੜ੒ͯ͘͠ΕΔϝιου
  9. ProbabilisticClassifier Ϋϥε • ΧϥϜ • probability: (ೋ஋෼ྨͰ͋Ε͹) ਖ਼ղϥϕϧ͕ 1 Ͱ͋Δͱ༧ଌ͞ΕΔ֬཰͕ઃఆ͞ΕΔΧϥϜ

    • ύϥϝʔλ • threshold: ༧ଌ֬཰ (probability ΧϥϜ) ʹج͍ͮ ͯ 0/1 ʹৼΓ෼͚Δࡍͷ͖͍͠஋
  10. PredictionModel Ϋϥε • ϝιου • transform: transformImpl ϝιουΛݺͼग़͚ͩ͢ • transformImpl:

    ༩͑ΒΕͨ DataFrame ͷͦΕͧΕ ͷߦ͝ͱʹ predict ϝιουΛݺͼग़͢ • predict (ந৅ϝιου): ༩͑ΒΕͨಛ௃ϕΫτϧ͔ Β༧ଌ݁ՌΛੜ੒͢ΔॲཧΛ࣮૷͢Δ
  11. ClassificationModel Ϋϥε • ϝιου • transform: predict ϝιου΍ predictRaw &

    raw2Prediction ϝιουΛݺͼग़ͯ͠༧ଌ݁ՌΛٻΊΔ • predict: predictRaw ϝιουͷ݁ՌΛ raw2Prediction ʹ౉͠ ͯ༧ଌϥϕϧΛฦ͢ • predictRaw (ந৅ϝιου): ༧ଌϞσϧΛ༻͍ͯੜͷ༧ଌ஋Λ ฦ͢ॲཧΛ࣮૷͢Δ • raw2Prediction (ந৅ϝιου): ༧ଌϞσϧ͕ੜ੒ͨ͠ੜͷ༧ ଌ஋͔ΒϥϕϧΛ༧ଌॲཧΛ࣮૷͢Δ
  12. ProbabilisticClassificationModel Ϋϥε • ϝιου • predictRaw (ந৅ϝιου): ClassificationModel ʹಉ͡ •

    raw2ProbabilityInPlace (ந৅ϝιου): ੜͷ༧ଌ஋͔Β༧ଌ ֬཰ʹม׵͢ΔॲཧΛ࣮૷͢Δ • predictProbability: predictRaw ϝιουͷ݁ՌΛ raw2ProbabilityInPlace ϝιουʹ౉ͯ͠༧ଌ֬཰ʹม׵͢Δ • probability2Prediction: ༧ଌ֬཰͔Β༧ଌϥϕϧΛฦ͢ • raw2Prediction: ੜͷ༧ଌ஋͔Β༧ଌϥϕϧΛฦ͢
  13. ֶशثɾ༧ଌϞσϧͷ࣮૷ͷϙΠϯτ (2) • ෼ྨΞϧΰϦζϜͷ࣮૷ • ֶशثͷ࣮૷Ϋϥε͸ ProbabilisticClassifier Λ extends ͠Α͏

    • ༧ଌϞσϧͷ࣮૷Ϋϥε͸ ProbabilisticClassificationModel Λ extends ͠Α͏ • (ςϯϓϨతͳϝιουͷ࣮૷Λআ͚͹) predictRaw, raw2probabilityInPlace ϝιουΛ࣮૷͢Δ͚ͩͰࡁΉ
  14. ֶशثɾ༧ଌϞσϧͷ࣮૷ͷϙΠϯτ (3) • ճؼΞϧΰϦζϜͷ࣮૷ • ֶशثͷ࣮૷Ϋϥε͸ Predictor Λextends ͠Α͏ •

    ༧ଌϞσϧͷ࣮૷Ϋϥε͸ PredictionModel Λ extends ͠Α͏ • predict ϝιουΛ࣮૷͢Δ͚ͩͰࡁΉ
  15. ύϥϝʔλͷ࣮૷ྫ trait XGBoostGeneralParams extends Params {
 final val booster: Param[String]

    = new Param(this, "booster", // ύϥϝʔλ໊
 "which booster to use, can be gbtree or gblinear.", // આ໌ // ύϥϝʔλʹର͢ΔόϦσʔγϣϯϧʔϧ
 ParamValidators.inArray(Array("gbtree", "gblinear")))
 // setter, getter Λ༻ҙ͢Δ
 def setBooster(value: String): this.type = set(booster, value)
 def getBooster: String = $(booster)
 // σϑΥϧτ஋Λઃఆ͢Δ setDefault(booster, "gbtree")
 }
  16. ύϥϝʔλͷ࣮૷ϙΠϯτ (1) • ύϥϝʔλΛఆٛ͠Α͏ • ܕ • Param, DoubleParam, IntParam,

    FloatParam, LongParam… • ύϥϝʔλ໊ • આ໌ • όϦσʔγϣϯ • ParamValidators ͕ఏڙ͢ΔϑΝΫτϦϝιουΛར༻͢Δ