論文解説 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 Konduit
株式会社本橋和貴 @kmotohas 論⽂解説 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com %FFQMFBSOJOHK LPOEVJUTFSWJOH ࡞ͬͯΔձࣾͩΑ

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 2
ͲΜͳ࿦จ͔ ‣ 世界最⼤のオンライン旅⾏サイト Booking.com における約150の機械学習アプリ開発経験から学んだベストプラクティスの紹介 ‣ KDD2019 で採択され、各所で話題 ‣ Gigazine に⽇本語の解説記事あり

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣
High Stakes: ⾳楽のレコメンドなどと違い、⼀度旅⾏に⾏くとやり直しがきかない ‣ Inﬁnitesimal Queries: 検索クエリから得られる情報が少ない ‣ Complex Items: 選択基準が多様（⾏き先/⽇程/部屋数・タイプ/キャンセルポリシー） ‣ Constrained Supply: 宿泊施設の部屋は供給量が少なく動的 ‣ Continuous Cold Start: 旅⾏はせいぜい年数回、毎回好みが変動 ‣ Content Overload: 宿泊施設の情報量が多くて複雑（施設情報/写真/⼝コミ/レート） 3 #PPLJOHDPNʹ͓͚ΔνϟϨϯδ *ODFQUJPO.PEFMJOH%FQMPZNFOU.POJUPSJOH&WBMVBUJPOɺ֤ϑΣʔζʹ͓͚ΔϨοεϯΛ঺հ

旅⾏者の好みを幅広く予測 ‣ 旅⾏の⽂脈を予測 ‣ 履歴などからユーザーの⾒た内容をトラッキング ‣ UIを最適化 ‣ レビューを始めとするコンテンツを集めてどれを表⽰するか決める ‣ 価格やオプションのトレンド 4 #PPLJOHDPNͰ༻͍ΒΕ͍ͯΔϞσϧ

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 (a)
旅⾏の⽂脈を予測 - ⼦供連れか否かなど (b) レビューを始めとするコンテンツを集めてどれを表⽰するか - なぜその施設が⼈気か要約 (c) 価格やオプションのトレンド 5 Ϟσϧ׆༻ͷ۩ମྫ (a) Traveller Context Model (b) Content Curation Model (c) Content Augmentation Model Figure 1: Examples of Application of Machine Learning likely is that a user is shopping for a family trip. Usually, Family noisy and vast, making it hard to be consumed by users. Content Curation is the process of making content accessible to humans. For example, we have collected over 171M reviews in more than 1.5M properties, which contain highly valuable information about the service a particular accommodation provides and a very rich source of selling points. A Machine Learning model "curates" reviews, constructing brief and representative summaries of the outstanding aspects of an accommodation (Figure 1(b)). 2.1.6 Content Augmentation. The whole process of users brows- ing, selecting, booking, and reviewing accommodations, puts to our disposal implicit signals that allow us to construct deeper un- derstanding of the services and the quality a particular property or destination can oer. Models in this family derive attributes of a property, destination or even specic dates, augmenting the explicit service oer. Content Augmentation diers from Content Curation in that curation is about making already existing content easily accessible by users whereas augmentation is about enriching an existing entity using data from many others. To illustrate this idea, we give two examples: • Great Value: Booking.com provides a wide selection of properties, oering dierent levels of value in the form of ameni- ties, location, quality of the service and facilities, policies, and many other dimensions. Users need to assess how the price asked for a room relates to the value they would obtain. Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

1. 機械学習モデルを導⼊したプロジェクトは⼤きなビジネス価値をもたらす ‣ 2. 機械学習モデルのパフォーマンスは、ビジネスのパフォーマンスと同じにならない ‣ 3. 解決しようとする問題をはっきりさせることが⼤切 ‣ 4. 予測は遅延をもたらす ‣ 5. モデルの質に関して迅速なフィードバックを得ること ‣ 6. ランダム化⽐較試験を⽤いてモデルを使った時のビジネスインパクトを測る 6 ̒ͭͷϨοεϯ લड़ͷ(JHB[JOFهࣄ಺ͷ೔ຊޠ༁Λआ༻

"MMNPEFMGBNJMJFTDBOQSPWJEFWBMVF Figure 2: Model Families Business Impact relative to median impact. 3 MODELING: OFFLINE MODEL PERFORMANCE IS JUST A HEALTH CHECK A common approach to quantify the quality of a model is to estimate or predict the performance the model will have when exposed to data it has never seen before. Dierent avors of cross-validation are used to estimate the value of a specic metric that depends on the task (classication, regression, ranking). In Booking.com we are very much concerned with the value a model brings to our customers and our business. Such value is estimated through Randomized Controlled Trials (RCTs) and specic business metrics like conversion, customer service tickets or cancellations. A very interesting nding is that increasing the performance of a model, does not necessarily translates to a gain in value. Figure 4 illustrates this learning. Each point represents the comparison of a successful model that proved its value through a previous RCT, versus a new model. The horizontal coordinate is given by the relative dierence between the new model and the current baseline according to an oine estimation of the performance of the models. This data is only about classiers and rankers, evaluated by ROC AUC and Mean Reciprocal Rank respectively. The vertical coordinate is given by the relative dierence in a business metric of interest as observed Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA Figure 2: Model Families Business Impact relative to median impact. Figure 3: A sequence of experiments on a Recommendations Prod- uct. Each experiment tests a new version focusing on the indicated discipline or ML Problem Setup. The length of the bar is the observed impact relative to the rst version (all statistically signi- cant) improving the model behind them. We have also observed models like con interest does no this lear model t model. T between oine e only abo Recipro the rela in a RC the sam (46 mod deeper a conde with 90 of corre between same tim the exte models built in furtherm already remarka areas of exampl Machin with hu the busi This the one lDPSFNFUSJDTzʹର͢Δ֤ϞσϧಋೖͷΠϯύΫτ ʢϕϯνϚʔΫϞσϧΛͱ͢Δʣ ਪનγεςϜͷվળཤྺ ʢॳظΞϧΰϦζϜͷΠϯύΫτΛͱ͢Δʣ ࣌ؒ ʢ$POUFOU$VSBUJPOҎ֎ʣ͢΂ͯϙδςΟϒͳޮՌ ஈ֊తʹվળʢվળޮՌ͸ανΓؾຯʣ

オンラインオフラインの評価に相関なし - モデル⾃体の限界︖ - 不気味の⾕効果︖ • 予測がうまくいきすぎると怖い - タスクに過剰適合︖ • CTR向上に特化しすぎてクリックを誘うだけのレコメンドをしていないか 10 .PEFMJOH0⒐JOF.PEFM1FSGPSNBODFJTKVTUB)FBMUI$IFDL out 7.4. ter, an be 0]), od- ed ess em TR on ase. ust An ery ng lar ice ies Figure 4: Relative dierence in a business metric vs relative performance dierence between a baseline model and a new one. KDD ’19, August 4–8, 2019, Anchorage, AK, USA ԣ࣠ΦϑϥΠϯͷධՁࢦඪ30$"6$.FBO3FDJQSPDBM3BOLͳͲͷมԽ ॎ࣠ΦϯϥΠϯͷϏδωεήΠϯʢίϯόʔδϣϯ཰ͳͲʣ

オンラインオフラインの評価に相関なし - モデル⾃体の限界︖ - 不気味の⾕効果︖ • 予測がうまくいきすぎると怖い - タスクに過剰適合︖ • CTR向上に特化しすぎてクリックを誘うだけのレコメンドをしていないか 11 .PEFMJOH0⒐JOF.PEFM1FSGPSNBODFJTKVTUB)FBMUI$IFDL out 7.4. ter, an be 0]), od- ed ess em TR on ase. ust An ery ng lar ice ies Figure 4: Relative dierence in a business metric vs relative performance dierence between a baseline model and a new one. KDD ’19, August 4–8, 2019, Anchorage, AK, USA ԣ࣠ΦϑϥΠϯͷධՁࢦඪ30$"6$.FBO3FDJQSPDBM3BOLͳͲͷมԽ ॎ࣠ΦϯϥΠϯͷϏδωεήΠϯʢίϯόʔδϣϯ཰ͳͲʣ Figure 4: Relative dierence in a business metric vs relative performance dierence between a baseline model and a new one. Figure 5: Uncanny valley: People not always react positively to accu- rate predictions (destination recommender using Markov chains).

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 lҰൠతʹϕετͳ໰୊ઃఆ͸͙͢ࢥ͍ͭ͘Α͏ͳ΋ͷͰ͸ͳ͍͜ͱ͕ଟ͍z
lϞσϧͷੑೳͷݶքΛӽ͑Δʹ͸໰୊ઃఆΛม͑Δͷ͕ޮ཰తz ‣ （例）“Dates Flexibility”モデルの問題設定 - ユーザーが検索クエリとは別の⽇を何⽇考慮しているか ⇨ regression - ユーザーが旅⾏の⽇程を変更するか、確率はいくらか ⇨ classiﬁcation ‣ ⼤体の⽬的が⼀緒でも問題の設定の仕⽅によってデータの集め⽅も変わる 13 .PEFMJOH#FGPSF4PMWJOHB1SPCMFN %FTJHO*U

擬似的にシステムにレイテンシーを追加したらコンバージョン率が有意に低下 - クラスターで分散処理 - モデル⾼速化エンジンを内製 - スパースな（⼩パラメータ数の）モデルを使う - 頻出な計算は結果をキャッシュしておく - バルクで処理 - 地理的に近距離の場合は⼀箇所に近似、など 15 %FQMPZNFOU5JNFJT.POFZ Figure 6: Impact of latency on conversion rate store. When the feature space is too big we can still cache frequent requests in memory. • Bulking: Some products require many requests per prediction. To minimize network load, we bulk them together in Figure 7: Examples of Response Distribution Charts Delayed feedback: In other cases the true label is only observed many days or even weeks after the prediction is made. Consider a model that predicts whether a user will submit a review or not. We might make use of this model at shopping time, but the true label will be only observed after the guest completes the stay, which can be months later. Therefore, in these situations, label-dependent metrics like pre- cision, recall, etc, are inappropriate, which led us to the following Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

モデルがうまく予測できたかチェックするフィードバック情報が遅れたりそもそも取得できなかったりする - （⼆値分類の場合）モデルの出⼒⾃体のヒストグラムをチェック 17 .POJUPSJOH6OTVQFSWJTFE3FE'MBHT Figure 7: Examples of Response Distribution Charts KDD ’19, August 4–8, 2019, Anchorage, AK, USA ܇࿅࣌ͷ෼෍ͱൺֱͯ͠DPODFQUESJGUΛ؂ࢹ

.POJUPSJOH6OTVQFSWJTFE3FE'MBHT conversion rate Figure 7: Examples of Response Distribution Charts Delayed feedback: In other cases the true label is only observed many days or even weeks after the prediction is made. Consider a model that predicts whether a user will submit a review or not. We KDD ’19, August 4–8, 2019, Anchorage, AK, USA VOEFSpUUJOHʁ ͦ΋ͦ΋ֶशෆՄೳʁ ਖ਼نԽͯ͠ͳ͍ʁ ֎Ε஋ͷऔΓѻ͍ϛεʁ Ϟσϧ͕εύʔε ͗͢Δʁ ཧ૝త

3BOEPNJ[FE$POUSPMMFE5SJBMT https://ameblo.jp/atopiclass/entry-11118695172.html ϥϯμϜԽൺֱࢼݧ ͍ΘΏΔ"#ςετ ౷੍܈ $POUSPMHSPVQ հೖ܈ 5SFBUNFOUHSPVQ

単純に Control/Treatment group に分けるだけでは不⼗分 - 例えば⼦連れの旅⾏客に何かアミューズメント施設を推薦するモデルなど︖ - ⼦連れじゃないユーザーは⽐較から除外しないとノイズが乗る 21 4FMFDUJWFUSJHHFSJOH xtreme cases, the loga- used to make the cues issue providing Global considering all predic- e providing Immediate ructed as soon as a few ion and feature space data points to be con- Figure 8: Experiment design for selective triggering. KDD ’19, August 4–8, 2019, Anchorage, AK, USA ੨Ͱғ·Εͨೋ܈Λൺֱ

単純に Control/Treatment group に分けるだけでは不⼗分 - 例えば旅⾏客など⾏き先がフレキシブルかどうかを判別するモデルXがあり、そういう客にはクエリとは別の⾏き先も提案するというモデルYの評価 22 .PEFMPVUQVUEFQFOEFOUUSJHHFSJOH stribution. In extreme cases, the loga- y in the RDC is used to make the cues mplete Feedback issue providing Global DC is computed considering all predic- ed Feedback issue providing Immediate DC can be constructed as soon as a few class distribution and feature space uires very few data points to be con- i-class classication when the number just constructing one binary classier nates between one class and the others est) iterion to choose a threshold for turn- y output. The criterion is to simply use C in between the 2 class-representative s large, then one can choose to maxi- n using the lower and upper bound of ly. This is very useful when the same us points of the system like hotel page , since they have dierent populations stributions. Figure 8: Experiment design for selective triggering. Figure 9: Experiment design for model-output dependent triggering and control for performance impact. Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA ੨Ͱғ·Εͨೋ܈Λൺֱ $POUSPMHSPVQͱ 5SFBUNFOUHSPVQΛ ൺֱ͢ΔͱϞσϧʹΑΔ ஗ԆͷӨڹΛධՁͰ͖Δ $POUSPMHSPVQ͸໰୊ൃੜ࣌ͷ ηʔϑςΟωοτ

単純に Control/Treatment group に分けるだけでは不⼗分 - 例えば現状のモデル1の正解率が80%で、新しいモデル2はモデル1の誤答率を半分にするが新しい誤答を5%⽣む場合、各々が異なる結果になるのは~15%だけ 23 $PNQBSJOH.PEFMT ੨Ͱғ·Εͨೋ܈Λൺֱ $POUSPMHSPVQͱ 5SFBUNFOUHSPVQΛ ൺֱ͢ΔͱϞσϧʹΑΔ ஗ԆͷӨڹΛධՁͰ͖Δ $POUSPMHSPVQ͸໰୊ൃੜ࣌ͷ ηʔϑςΟωοτ lass distribution and feature space es very few data points to be con- class classication when the number st constructing one binary classier tes between one class and the others t) erion to choose a threshold for turn- output. The criterion is to simply use n between the 2 class-representative arge, then one can choose to maxi- using the lower and upper bound of . This is very useful when the same points of the system like hotel page ince they have dierent populations ributions. it cannot prove or disprove a model mators or rankers bution Analysis has proven to be a to detect defects in the models very ERIMENT DESIGN PAYS OFF omized Controlled Trials is ingrained have built our own experimentation xperimentation, enabling everybody Figure 9: Experiment design for model-output dependent triggering and control for performance impact. Figure 10: Experiment design for comparing models. models may require specic features to be available. The subjects 5 % 10%

·ͱΊ ‣ e-commerceにおける150の機械学習モデル開発から得られた 6つのレッスンをシェア ‣ 「仮説→モデリング→実験」サイクルが重要 - オフラインでの改善はビジネスゲインとの相関が低い ‣ モデルのレイテンシーを意識する ‣ 分類モデルのスコア分布をモニタリング ‣ ランダム化⽐較試験ではノイズに気を付ける

論文解説 150 Successful Machine Learning Models: 6 ...

論文解説 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com

kmotohas

More Decks by kmotohas

Other Decks in Technology

Featured

Transcript

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 Konduit

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 2

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 (a)

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 8

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 lҰൠతʹϕετͳ໰୊ઃఆ͸͙͢ࢥ͍ͭ͘Α͏ͳ΋ͷͰ͸ͳ͍͜ͱ͕ଟ͍z

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 18

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 20

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 ‣

Kazuki Motohashi - Konduit K.K. 実践者向けディープラーニング勉強会第8回 - 18/November/2019 24