Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Estimating Conversion Rate in Display Advertising from Past Performance Data

jujudubai
June 13, 2014

Estimating Conversion Rate in Display Advertising from Past Performance Data

有名な論文ですね。
スパース性が高いデータに対する推定方法や補正方法等、実務で参考になる点が多いです。もちろん研究にも。

jujudubai

June 13, 2014
Tweet

More Decks by jujudubai

Other Decks in Business

Transcript

  1. Agenda 1. Introduction 2. Issues 1. Problem Setup & Formulation

    2. Data Hierarchies 3. Conversion Rate Estimation 1. Past Performance at Different Hierarchical Levels 2. Combining Estimators using Logistic Regression 4. Practical Issues (Propose) 1. Data Imbalance 2. Output Calibration 3. Missing Features 4. Feature Selection 5. Results & Discussion 1. Data Imbalance & Score Calibration 2. Missing Value Imputation 3. “Baseline Estimators” vs “Logistic Regression” 6. Conclusion
  2. Introduction 1.ɹ޿ࠂग़ߘͷೖࡳΛߦ͏DSPͷ͓࿩
 ɹ޿ࠂΛදࣔ͢ΔWebαΠτɺϢʔβʔ͔Β࠷దԽͳ”bidprice”Λਪఆ͍ͨ͠
 ɹ㱺ɹCVRΛਖ਼֬ʹ༧ଌ͢Δඞཁੑ 2.ɹ՝୊
 ɹ[1]ɹCVR͸ۃΊͯখ͘͞ɺ෼ੳʹे෼ͳ਺஋͕ಘΒΕͳ͍
 -ɹCVR = 0.0001 ~

    0.1%
 ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ㱺ɹ֊૚Խɺ”implicit”ͳΫϥελϦϯάͰղܾ
 ɹ[2] ɹRTBͰ͸ɺೖࡳ·Ͱͷ͕࣌ؒ5 ~ 10ms͔͠ͳ͍
 -ɹ࣌ؒత੍໿͕ڧ͍ͨΊɺܭࢉྔ͕গͳ͔ͭ͘ਖ਼֬ͳଌఆ๏͕ඞཁ
 ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ㱺ɹɹLogistic RegressionͰղܾ
  3. Issues:
 §2.2 Problem Setup & Formulation 1.ɹม਺ͷઆ໌ 2.ɹ࠷దͳ޿ࠂͷબ୒
 3.ɹϩδεςΟοΫճؼʹΑΔCVRͷ༧ଌ •

    ɹCVΛ֬཰࿦తʹଊ͑Δ
 
 4.ɹ࠷దͳ޿ࠂͷબ୒ʹ͓͚Δ໰୊ઃఆ Yijk ⇠ Bernoulli (pijk), pijk = p(Y = 1|ui, pj, ak) ϕϧψʔΠ෼෍ΛԾఆ͢Δ user : ui page : pj ad : ak ad⇤ = arg max k=1,....,n p ( Y =1 | ui, pj, ak)
  4. §2.3 Data Hierarchies 1.ɹσʔλʹ֊૚ߏ଄Λ࣋ͨͤɺ”implict” ͳΫϥελϦϯά
 ɹAdɺpageɺuserͷNodeΛٻΊΔͱɺla × lp × lu

    ݸʢ=MݸʣͷΫϥελʹॴଐ͢Δ 2.ɹ“implicit” user clustering ͱ͸…
 ɹ㱺ɹ“explicit”ͳΫϥελϦϯάʢσϞάϥϑΟοΫ৘ใ΍ऩೖ౳ΛҼࢠʹ…)
 ɹɹɹͰ͸ͳ͘ɺ
 ɹɹɹ๚໰ͨ͠WebαΠτͷΧςΰϦʔ౳ΛҼࢠʹ…
 ɹɹɹ{User × Publisher Type × Campaign}
  5. 3. ɹpast count data(i.e., number of imp and cv) ͔Βద੾ͳΫϥελʔʹ഑ஔ


    ɹݸਓͱΫϥελͷCVRͷۙࣅࣜ
 Ϣʔβʔͷଐ͢ΔΫϥελʔ pijk = p(Y = 1|ui,pj,ak) ⇡ ˜ pijk = p( ˜ Y = 1|u 2 Cui ,pj,ak)
  6. Conversion Rate Estimation:
 §3.1 Past Performance at Different Hierarchical Levels

    1. ɹThe first challengeɿ
 ɹಉҰ΋͘͠͸ۃΊ͍ͯۙCVRΛ࣋ͭ”user”ͱ”group users”=ClusterΛಛఆ͢Δ͜ͱ 2. ɹThe main challengeɿ
 ɹෆ଍͕ͪ͠ͳ”CVͨ͠؍ଌ஋”ΛͲͷΑ͏ʹѻ͏͔
 ɹmissing dataͷิਖ਼
 ɹex) for a true conversion rate that is on order of 10^-5 (on the order of millions imp) 3. ɹCVRͷ࠶ఆٛ
 
 4. ɹ࠷໬ਪఆ๏Ͱܭࢉ֤ͨؔ͠਺Λσʔλ֊૚ຖͰͦΕͧΕ͔͚߹ΘͤɺMݸͷp_ijkΛఆٛ
 ɹ㱺ɹ֤֊૚ຖͷ̍ͭ̍ͭͷCVR༧ଌثͷਫ਼౓͸ऑ͍͕ɺͦΕΛ૊Έ߹ΘͤΔ͜ͱߴਫ਼౓ʹ VTFSɺQBHFɺ"E͸άϧʔϓԽ  pijk = f(ˆ ˜ p1 ijk = 1| u 2 Cui ,2 Cpj ,a 2 Cak )
  7. 3.ɹ֤PVʹଳ͢Δ”ର਺໬౓ؔ਺”
 
 
 
 
 
 
 
 
 4.ɹ࠷దͳ”β”ͷਪఆʢL-BFGS-B๏౳ʣࢀߟʣhttp://en.wikipedia.org/wiki/L-BFGS

    ➡ɹ࠷దԽΞϧΰϦζϜͷҰछ ➡ɹRͰ͸optim()
 
 
 
 
 
 
 
 ΩϟϯϖʔϯͷTճ໨ͷJNQ਺ userɺpageɺAdͷ૊Έ߹Θͤ࣌ͷਖ਼֬ͳCVRͷܭࢉख๏
  8. Practical Issues (Propose): §4.1 Data Imbalance 1. ɹ2ͭͷ”Data Imbalance” 1.ɹThe

    average conversion rate of an advertising campaign is inherently very low and … - ɹCVR͸ฏۉతʹ௿͘ɺे෼ͳαϯϓϧ਺Λ֬อ͢Δͷ͕ࠔ೉
 ɹex) 10^-3 to 10^-6 2.ɹThe ratio of the number of no-conversion events to that of conversion events is very large… - ɹcvʹର͢Δʢ๲େͳʣimp਺ͷσʔλͷௐ੔ 2.ɹରࡦ
 ɹɹ-ɹCVͨ͠σʔλ͸શͯར༻
 ɹɹ-ɹnon-CVσʔλ͸αϯϓϦϯάͯ͠ར༻
  9. §4.3 Missing Features 1.ɹσʔλ͕ͳ͍৔߹ͷิਖ਼
 ɹɹ-ɹMARύλʔϯʢmissing at randomʣ
 ɹɹ ɹˠɹ͋Δม਺͕ܽଌͱͳΔ֬཰͸؍ଌ͞Εͨσʔλ͚ͩʹґଘ͠ɺ
 ɹɹ

    ɹɹɹܽଌ͍ͯ͠Δม਺ͷຊདྷͷ஋ʹ͸ґଘ͠ͳ͍
 ɹɹ ex)ɹϢʔβʔϓϩϑΝΠϧαʔόʹͳ͍userIDɺcategoryʹ෼ྨͰ͖ͳ͍Web page
 
 ɹɹ-ɹ৚݅෇͖Ψ΢ε෼෍ʹΑΔิਖ਼
  10. Results & Discussion: §5.1 Data Imbalance & Score Calibration 1.ɹ

    ࣮ݧ৚݅
 ɹɹ-ɹ޿ࠂ࿮͸5छྨʢͦΕͧΕ1೔͋ͨΓ100ສPVʣ
 ɹɹ-ɹ2012/1ͷ2िؒͷϩάΛར༻
 ɹɹ-ɹ1ि໨͸܇࿅ɺ2ि໨͸ςετ
 ɹɹ-ɹ͋ΔϢʔβʔ͕ϖʔδɺ޿ࠂΛݟͨ࣌ʹCV͢Δ͔͠ͳ͍͔༧૝ 2.ɹ܇࿅σʔλͷCV/non-CV཰ͷӨڹ
 ɹˎɹIRʢimbalance ratioʣʹΑΔࠩ͸ͳ͠
 
 
 

  11. §5.3 “Baseline Estimators” vs “Logistic Regression” 1. ɹఏҊख๏ͷޮՌ 1.ɹBaseline 1

    :
 Ϣʔβʔͷ೥ྸɺੑผɺډॅ஍Ҭ౳ͷσϞάϥϑΟοΫͳଐੑʹΑΔΫϥελϦϯάͱ
 ޿ࠂ܈ʹΑΔਪఆ Ϣʔβʔͱಉ͡ϢʔβʔάϧʔϓʢΫϥελʔʣ
 ʢաڈͷӾཡύλʔϯ΍σϞάϥϑΟοΫ౳ʹΑΓಛఆʣ pijk ⇡ ˆ ˜ p1 = pMLE(Y = 1| u 2 CG ui ,ak 2 Campaignak )
  12. §6 Conclusion 1. ݁࿦ 1. RTBʹର͢Δߴ଎ͳCVRਪఆख๏ΛఏҊ
 →ɹLRʹΑΔύϥϝʔλͷਪఆ͕޷·͍͠ 2. ֊૚ߏ଄ԽʹΑΓਫ਼౓͕޲্
 →ɹ{user,page,ad}

    ͷΑ͏ʹ֊૚Խ͢Δ͜ͱ͕޷·͍͠ 2. ՝୊ͱٙ໰ • ɹCV཰ͷΫϥελϦϯάख๏ʹؔͯ͠͸ৄ͍͠هࡌ͕ͳ͔ͬͨ఺ɻ
 ɹ”implicit”ͳΫϥελϦϯάͷ۩ମతͳख๏͕هࡌ͞Ε͍ͯͳ͍ɻ