旅⾏の⽂脈を予測 - ⼦供連れか否かなど (b) レビューを始めとするコンテンツを 集めてどれを表⽰するか - なぜその施設が⼈気か要約 (c) 価格やオプションのトレンド 5 Ϟσϧ׆༻ͷ۩ମྫ (a) Traveller Context Model (b) Content Curation Model (c) Content Augmentation Model Figure 1: Examples of Application of Machine Learning likely is that a user is shopping for a family trip. Usually, Family noisy and vast, making it hard to be consumed by users. Content Curation is the process of making content accessible to humans. For example, we have collected over 171M reviews in more than 1.5M properties, which contain highly valuable information about the service a particular accommodation provides and a very rich source of selling points. A Machine Learning model "curates" reviews, con- structing brief and representative summaries of the outstanding aspects of an accommodation (Figure 1(b)). 2.1.6 Content Augmentation. The whole process of users brows- ing, selecting, booking, and reviewing accommodations, puts to our disposal implicit signals that allow us to construct deeper un- derstanding of the services and the quality a particular property or destination can oer. Models in this family derive attributes of a property, destination or even specic dates, augmenting the explicit service oer. Content Augmentation diers from Content Curation in that curation is about making already existing content easily accessible by users whereas augmentation is about enriching an existing entity using data from many others. To illustrate this idea, we give two examples: • Great Value: Booking.com provides a wide selection of prop- erties, oering dierent levels of value in the form of ameni- ties, location, quality of the service and facilities, policies, and many other dimensions. Users need to assess how the price asked for a room relates to the value they would obtain. Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
"MMNPEFMGBNJMJFTDBOQSPWJEFWBMVF Figure 2: Model Families Business Impact relative to median impact. 3 MODELING: OFFLINE MODEL PERFORMANCE IS JUST A HEALTH CHECK A common approach to quantify the quality of a model is to estimate or predict the performance the model will have when exposed to data it has never seen before. Dierent avors of cross-validation are used to estimate the value of a specic metric that depends on the task (classication, regression, ranking). In Booking.com we are very much concerned with the value a model brings to our customers and our business. Such value is estimated through Randomized Controlled Trials (RCTs) and specic business metrics like conversion, customer service tickets or cancellations. A very interesting nding is that increasing the performance of a model, does not necessarily translates to a gain in value. Figure 4 illustrates this learning. Each point represents the comparison of a successful model that proved its value through a previous RCT, versus a new model. The horizontal coordinate is given by the relative dierence between the new model and the current baseline according to an oine estimation of the performance of the models. This data is only about classiers and rankers, evaluated by ROC AUC and Mean Reciprocal Rank respectively. The vertical coordinate is given by the relative dierence in a business metric of interest as observed Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA Figure 2: Model Families Business Impact relative to median impact. Figure 3: A sequence of experiments on a Recommendations Prod- uct. Each experiment tests a new version focusing on the indicated discipline or ML Problem Setup. The length of the bar is the ob- served impact relative to the rst version (all statistically signi- cant) improving the model behind them. We have also observed models like con interest does no this lear model t model. T between oine e only abo Recipro the rela in a RC the sam (46 mod deeper a conde with 90 of corre between same tim the exte models built in furtherm already remarka areas of exampl Machin with hu the busi This the one lDPSFNFUSJDTzʹର͢Δ֤ϞσϧಋೖͷΠϯύΫτ ʢϕϯνϚʔΫϞσϧΛͱ͢Δʣ ਪનγεςϜͷվળཤྺ ʢॳظΞϧΰϦζϜͷΠϯύΫτΛͱ͢Δʣ ࣌ؒ ʢ$POUFOU$VSBUJPOҎ֎ʣͯ͢ϙδςΟϒͳޮՌ ஈ֊తʹվળʢվળޮՌανΓؾຯʣ
オンラインオフラインの評価に相関なし - モデル⾃体の限界︖ - 不気味の⾕効果︖ • 予測がうまくいきすぎると怖い - タスクに過剰適合︖ • CTR向上に特化しすぎてクリックを誘うだけの レコメンドをしていないか 10 .PEFMJOH0⒐JOF.PEFM1FSGPSNBODFJTKVTUB)FBMUI$IFDL out 7.4. ter, an be 0]), od- ed ess em TR on ase. ust An ery ng lar ice ies Figure 4: Relative dierence in a business metric vs relative perfor- mance dierence between a baseline model and a new one. KDD ’19, August 4–8, 2019, Anchorage, AK, USA ԣ࣠ΦϑϥΠϯͷධՁࢦඪ30$"6$.FBO3FDJQSPDBM3BOLͳͲͷมԽ ॎ࣠ΦϯϥΠϯͷϏδωεήΠϯʢίϯόʔδϣϯͳͲʣ
オンラインオフラインの評価に相関なし - モデル⾃体の限界︖ - 不気味の⾕効果︖ • 予測がうまくいきすぎると怖い - タスクに過剰適合︖ • CTR向上に特化しすぎてクリックを誘うだけの レコメンドをしていないか 11 .PEFMJOH0⒐JOF.PEFM1FSGPSNBODFJTKVTUB)FBMUI$IFDL out 7.4. ter, an be 0]), od- ed ess em TR on ase. ust An ery ng lar ice ies Figure 4: Relative dierence in a business metric vs relative perfor- mance dierence between a baseline model and a new one. KDD ’19, August 4–8, 2019, Anchorage, AK, USA ԣ࣠ΦϑϥΠϯͷධՁࢦඪ30$"6$.FBO3FDJQSPDBM3BOLͳͲͷมԽ ॎ࣠ΦϯϥΠϯͷϏδωεήΠϯʢίϯόʔδϣϯͳͲʣ Figure 4: Relative dierence in a business metric vs relative perfor- mance dierence between a baseline model and a new one. Figure 5: Uncanny valley: People not always react positively to accu- rate predictions (destination recommender using Markov chains).
擬似的にシステムにレイテンシーを追加したら コンバージョン率が有意に低下 - クラスターで分散処理 - モデル⾼速化エンジンを内製 - スパースな(⼩パラメータ数の)モデルを使う - 頻出な計算は結果をキャッシュしておく - バルクで処理 - 地理的に近距離の場合は⼀箇所に近似、など 15 %FQMPZNFOU5JNFJT.POFZ Figure 6: Impact of latency on conversion rate store. When the feature space is too big we can still cache frequent requests in memory. • Bulking: Some products require many requests per predic- tion. To minimize network load, we bulk them together in Figure 7: Examples of Response Distribution Charts Delayed feedback: In other cases the true label is only observed many days or even weeks after the prediction is made. Consider a model that predicts whether a user will submit a review or not. We might make use of this model at shopping time, but the true label will be only observed after the guest completes the stay, which can be months later. Therefore, in these situations, label-dependent metrics like pre- cision, recall, etc, are inappropriate, which led us to the following Applied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
.POJUPSJOH6OTVQFSWJTFE3FE'MBHT conversion rate Figure 7: Examples of Response Distribution Charts Delayed feedback: In other cases the true label is only observed many days or even weeks after the prediction is made. Consider a model that predicts whether a user will submit a review or not. We KDD ’19, August 4–8, 2019, Anchorage, AK, USA VOEFSpUUJOHʁ ֶͦͦशෆՄೳʁ ਖ਼نԽͯ͠ͳ͍ʁ ֎ΕͷऔΓѻ͍ϛεʁ Ϟσϧ͕εύʔε ͗͢Δʁ ཧత
単純に Control/Treatment group に分けるだけでは不⼗分 - 例えば⼦連れの旅⾏客に何かアミューズメント施設を推薦するモデルなど︖ - ⼦連れじゃないユーザーは⽐較から除外しないとノイズが乗る 21 4FMFDUJWFUSJHHFSJOH xtreme cases, the loga- used to make the cues issue providing Global considering all predic- e providing Immediate ructed as soon as a few ion and feature space data points to be con- Figure 8: Experiment design for selective triggering. KDD ’19, August 4–8, 2019, Anchorage, AK, USA ੨Ͱғ·Εͨೋ܈Λൺֱ
単純に Control/Treatment group に分けるだけでは不⼗分 - 例えば旅⾏客など⾏き先がフレキシブルかどうかを判別するモデルXがあり、 そういう客にはクエリとは別の⾏き先も提案するというモデルYの評価 22 .PEFMPVUQVUEFQFOEFOUUSJHHFSJOH stribution. In extreme cases, the loga- y in the RDC is used to make the cues mplete Feedback issue providing Global DC is computed considering all predic- ed Feedback issue providing Immediate DC can be constructed as soon as a few class distribution and feature space uires very few data points to be con- i-class classication when the number just constructing one binary classier nates between one class and the others est) iterion to choose a threshold for turn- y output. The criterion is to simply use C in between the 2 class-representative s large, then one can choose to maxi- n using the lower and upper bound of ly. This is very useful when the same us points of the system like hotel page , since they have dierent populations stributions. Figure 8: Experiment design for selective triggering. Figure 9: Experiment design for model-output dependent triggering and control for performance impact. Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA ੨Ͱғ·Εͨೋ܈Λൺֱ $POUSPMHSPVQͱ 5SFBUNFOUHSPVQΛ ൺֱ͢ΔͱϞσϧʹΑΔ ԆͷӨڹΛධՁͰ͖Δ $POUSPMHSPVQൃੜ࣌ͷ ηʔϑςΟωοτ
単純に Control/Treatment group に分けるだけでは不⼗分 - 例えば現状のモデル1の正解率が80%で、新しいモデル2はモデル1の誤答率を半分 にするが新しい誤答を5%⽣む場合、各々が異なる結果になるのは~15%だけ 23 $PNQBSJOH.PEFMT ੨Ͱғ·Εͨೋ܈Λൺֱ $POUSPMHSPVQͱ 5SFBUNFOUHSPVQΛ ൺֱ͢ΔͱϞσϧʹΑΔ ԆͷӨڹΛධՁͰ͖Δ $POUSPMHSPVQൃੜ࣌ͷ ηʔϑςΟωοτ lass distribution and feature space es very few data points to be con- class classication when the number st constructing one binary classier tes between one class and the others t) erion to choose a threshold for turn- output. The criterion is to simply use n between the 2 class-representative arge, then one can choose to maxi- using the lower and upper bound of . This is very useful when the same points of the system like hotel page ince they have dierent populations ributions. it cannot prove or disprove a model mators or rankers bution Analysis has proven to be a to detect defects in the models very ERIMENT DESIGN PAYS OFF omized Controlled Trials is ingrained have built our own experimentation xperimentation, enabling everybody Figure 9: Experiment design for model-output dependent triggering and control for performance impact. Figure 10: Experiment design for comparing models. models may require specic features to be available. The subjects 5 % 10%