20211023_recsys2021_paper_reading_YoheiKikuta

Yohei Kikuta (@yohei_kikuta), 2021/10/23 RecSys2021࿦จಡΈձ Mitigating Confounding Bias in Recommendation
via Information Bottleneck https://dl.acm.org/doi/10.1145/3460231.3474263 1

Ubie גࣜձࣾͷ٠ాངฏͰ͢ • ֤छ৘ใ • GitHub: https://github.com/yoheikikuta • Twitter: https://twitter.com/yohei_kikuta
• Blog: https://yoheikikuta.github.io/ • WE ARE HIRING!!! • Ubie Engineer JD: https://recruit.ubie.life/engineer • ΧδϡΞϧ໘ஊ͸ Twitter DM ΋͘͠͸ Meety https://meety.net/matches/ BCiPwWPqsmJR ·Ͱ͓ؾܰʹʂ 2

·ͱΊ biased feedback ͷΈ͕ར༻Մೳͳঢ়گͰ confounding bias Λܰݮ͍ͨ͠  ຒΊࠐΈදݱΛ biased/unbiased ੒෼ʹ෼͚ͯ
information bottleneck ͷΞΠσ ΞΛൃలֶͤͯ͞श͠ɺ༧ଌʹ༗ޮͳ unbiased ͳ੒෼͕ಘΒΕΔ͜ͱΛࣔͨ͠ • ҼՌμΠΞάϥϜͱؔ࿈͚ͮͯఆࣜԽ • tractable ͳܗʹ੔ཧ • open/product σʔλ྆ํͰ༗ޮͰ͋Δͱ࣮ূ ࿦จಡΈϝϞ: https://github.com/yoheikikuta/paper-reading/issues/61 Mitigating Confounding Bias in Recommendation via Information Boleneck RecSys ’21, September 27-October 1, 2021, Amsterdam, RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Liu and Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A Treatment T Outcome y (a) Biased Feedback Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A Treatment T Outcome y (b) Unbiased Feedback Figure 1: (a) A causal diagram of biased feedback, where x and are known variables, blue arrows indicate the indire and red arrows indicate the direct eect. (b) A causal diagram of unbiased feedback where the indirect eect are trun Variables x Variables x Figure 1: (a) A causal diagram of biased feedback, where x and are known variables, blue arrows indicate the indir and red arrows indicate the direct eect. (b) A causal diagram of unbiased feedback where the indirect eect are tru inference. The input variables x (i.e., the feature vector) can be divided into three parts, including the instrumental variables I, the confounder variables C and the adjustment variables A. The instrumental variables I and the confounder variables C determine the treatment T, and they have an indirect eect on the outcome (i.e., the label) through the path {I,C} ! T ! . The confounder variables C and the adjustment variables A have a direct eect on i.e., for user requests, the system randomly samples ite candidate set, and displays them after some random arr However, the uniform policy will harm the user exper reduce the platform revenue. Hence, a more appealing is that only the biased feedback is available. Therefo paper, we focus on mitigating the confounding bias with feedback alone. 3

࿦จͷϞνϕʔγϣϯ ϥϯμϜԽൺֱࢼݧʢRCTʣΛ΍Δͷ͸ίετֻ͕͔ΓಘΔͷͰආ͚͍ͨ • աڈσʔλ͕ biased ͳ΋ͷ͔͠ͳ͍৔߹͸৽͘͠࡞Δඞཁ͕͋Δ • ৽͘͠४උ͢Δʹ΋Ϣʔβ͕ෆརӹΛඃΔՄೳੑ͕͋Δ biased feedback
ͷΈ͕͋Δঢ়ଶͰ bias Λܰݮͨ͘͠ɺbias ੜ੒աఔʹ஫໨͠ ͯΔ΋ͷ͸͋·Γͳ͍ͷͰͦ͜ʹ஫໨ͨ͠ख๏ΛߟҊ͍ͨ͠ ੑ࣭ͷҟͳΔ੒෼Λ information bottleneck ͷΞΠσΞͰ෼཭͢Δઌߦݚڀ͸͋ ΔͷͰɺͦΕΛൃలͤͯ͞׆༻͍ͨ͠ 4

ߟ͍͑ͯΔҼՌμΠΞάϥϜ Mitigating Confounding Bias in Recommendation via Information Boleneck RecSys
’21, September 27-October 1, 2021, Amsterdam, Netherlands RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Liu and Cheng et al. Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A Treatment T Outcome y (a) Biased Feedback Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A Treatment T Outcome y (b) Unbiased Feedback Figure 1: (a) A causal diagram of biased feedback, where x and are known variables, blue arrows indicate the indirect eect and red arrows indicate the direct eect. (b) A causal diagram of unbiased feedback where the indirect eect are truncated. Figure 1: (a) A causal diagram of biased feedback, where x and are known variables, blue arrows indicate the indirect eect and red arrows indicate the direct eect. (b) A causal diagram of unbiased feedback where the indirect eect are truncated.   ɾͲͷΞΠςϜΛબͿ͔  ɾͲͷϢʔβʹͲͷΑ͏ʹΞΠςϜΛදࣔ͢Δ͔ ೖྗಛ௃ྔ x = (u,i) click or not y ΁ͷؒ઀ޮՌ y ΁ͷ௚઀ޮՌ ਪનઓུ 5

confounding bias ͷఆٛ Mitigating Confounding Bias in Recommendation via Information
Boleneck RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Liu and Cheng et al. Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A Treatment T Outcome y (a) Biased Feedback Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A Treatment T Outcome y (b) Unbiased Feedback objective function LDI B according to the upper bound. 5.1 The De-confounder Penalty Term Based on the chain rule of mutual information, we have the follow- ing equation about the de-confounder penalty term 3 in Eq.(3), I (z;r) = I (z; ) I (z; |r) + I (z;r | ) . (4) We further inspect the term I (z;r| ) in Eq.(4). Since the distribution of z depends solely on the variables x, we have H (z| ,r) = H (z| ), where H(·|·) denotes the conditional entropy [15, 26]. Combining I (z;r | ) = By substituting E Since the term I ( use the form of c I (z; ure 1(a) and 1(b), we call the bias brought by the recommendation strategy the confounding bias. Confounding bias is essentially a collection of biases at the system level, such as the position bias and popularity bias. D 3.1 (C B). Suppose that the variables x, the outcome , the indirect eect {I,C} ! T ! and the direct eect {C,A} ! from x to are given. Confounding bias refers to the confusion of the observed feedback in recommender systems caused by the indirect eect. On the other hand, previous works have shown that the unbiased feedback can be obtained through a uniform policy [5, 23, 39, 46], Based on this proposed metho we constrain th vector from the ponents, i.e., a b The biased comp the unbiased com that this special distinguish the i representation z biased compone accurate recomm 353 ͜ͷ y ΁ͷؒ઀ޮՌ͕  confounding bias ΍Γ͍ͨͷ͸ؒ઀ޮՌ ͷӨڹΛ؇࿨ֶͨ͠श 6

ຒΊࠐΈදݱΛ biased/unbiased ʹ෼ׂ͢Δ RecSys ’21, September 27-October 1, 2021, Amsterdam,
Netherlands Liu and Cheng et al. (a) Biased Feedback (b) Unbiased Feedback Figure 1: (a) A causal diagram of biased feedback, where x and are known variables, blue arrows indicate the indirect eect and red arrows indicate the direct eect. (b) A causal diagram of unbiased feedback where the indirect eect are truncated. Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A r z biased embedding vector (a) Training Phase Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A z debiased embedding vector (b) Test Phase Figure 2: (a) In the training phase, the model obtains a special biased embedding vector from the variables x, which includes a Figure 2: (a) In the training phase, the model obtains a special biased embedding vector from the variables x, which includes a biased component r and an unbiased component z. (b) In the test phase, the biased component r is discarded, and the unbiased RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands A 4.1. The confounding bias will be reected in the embedding representation learned by a traditional recommendation model, and the bias will usually corrupt all the dimensions, i.e., the original representation z⇤ is biased and indistinguishable. Based on this assumption, we illustrate the ideas behind the proposed method in Figure 2. In the training phase (i.e., Figure 2(a)), we constrain the model to obtain a special biased representation vector from the variables x, which includes two independent components, i.e., a biased component r and an unbiased component z. The biased component r is responsible for the indirect eect, while the unbiased component z is responsible for the direct eect. Note that this special biased embedding vector, i.e., [r,z], is easier to distinguish the inuence of the confounding bias than the original ςετ࣌͸ unbiased ෦෼͚ͩΛ࢖༻ ֶश࣌ʹ͜ΕΒΛ ۠ผ͢ΔΑ͏ֶश 7

Information Bottleneck ͷ෮श Rate Distortion theory: ֬཰ม਺ͷඇՄٯѹॖ Ͱ distortion function
Λ༻͍ͯҎԼΛ࠷খԽ Information Bottleneck: ͷબͼํ͸೉͍͠ɻ ͳΔ ͕࢖͑Δʢ͜Ε͸ڭࢣϥϕϧʹ૬౰ʣͱ͠ ͯҎԼΛ࠷খԽ X → ˜ X d I(X; ˜ X) + β⟨d(x, ˜ x)⟩p(x,˜ x) d Y → X Y I(X; ˜ X) − βI( ˜ X; Y) 8

ຒΊࠐΈදݱΛ෼ׂ͢ΔͨΊͷ໨తؔ਺ it may lead to bad results. Inspired by information
theory, we can then derive the required objective function to be minimized from the above analysis, trength of the bias in the feedback data, and blindly optimizing ay lead to bad results. nspired by information theory, we can then derive the required ctive function from the above analysis, LDIB := min I (z;x) | {z } 1 I (z; ) | {z } 2 + I (z;r) | {z } 3 I (r; ) | {z } 4 , (3) re term 1 is a compression term that describes the mutual rmation between the variables x and the unbiased embedding rm 2 is an accuracy term that describes the performance of unbiased embedding z; term 3 is a de-confounder penalty m, which describes the dependency degree between the biased edding r and the unbiased embedding z; and term 4 is similar erm 2 , which is used for the potential gain from the biased (3) where term 1 is a compression term that describes the mutual information between the variables x and the unbiased embedding z; term 2 is an accuracy term that describes the performance of the unbiased embedding z; term 3 is a de-confounder penalty term, which describes the dependency degree between the biased embedding r and the unbiased embedding z; and term 4 is similar We fur tion of we hav entrop we hav I By sub RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands T y (a) Biased Feedback Figure 1: (a) A causal diagram of biased feedback, where x and and red arrows indicate the direct eect. (b) A causal diagram of Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A r z biased embedding vector (a) Training Phase Figure 2: (a) In the training phase, the model obtains a special bi biased component r and an unbiased component z. (b) In the test component z is used in the recommendation process. Figure 2: (a) In the training phase, the model obtains a special bi biased component r and an unbiased component z. (b) In the tes component z is used in the recommendation process. Mitigating Confounding Bias in Recommendation via Information Boleneck RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands divided into three parts, including the instrumental variables I, the confounder variables C and the adjustment variables A. The instrumental variables I and the confounder variables C determine the treatment T, and they have an indirect eect on the outcome (i.e., the label) through the path {I,C} ! T ! . The confounder variables C and the adjustment variables A have a direct eect on through the path {C,A} ! . Similar causal diagrams can also be found in previous works on treatment eect estimation [13, 18], and we follow their assumptions, i.e., strong ignorability [29]. The term “treatment” in recommender systems can be thought of as referring to a recommendation strategy, this is, which items the system selects, as well as how the system organizes and shows these items to some specic users. Dierent recommendation strategies will produce dierent indirect eects on feedback events by inuencing the treatmentT. This A 4.1. The confounding bias will be reected in the embedding representation learned by a traditional recommendation model, and the bias will usually corrupt all the dimensions, i.e., the original representation z⇤ is biased and indistinguishable. Based on this assumption, we illustrate the ideas behind the proposed method in Figure 2. In the training phase (i.e., Figure 2(a)), we constrain the model to obtain a special biased representation vector from the variables x, which includes two independent components, i.e., a biased component r and an unbiased component z. The biased component r is responsible for the indirect eect, while the unbiased component z is responsible for the direct eect. Note that this special biased embedding vector, i.e., [r,z], is easier to distinguish the inuence of the confounding bias than the original representation z⇤. In the test phase (i.e., Figure 2(b)), we discard the ᶃ compression term ʢz ͸ՄೳͳݶΓ৘ใΛѹॖʣ ᶄ z ͕ y Λ༧ଌ͢ΔͨΊʹඞཁͳ৘ใΛ༗͢ΔΑ͏ʹ ᶅ z ͱ r ͕ಠཱʹͳΔΑ͏ʹ ᶆ r ΋͍͘Β͔ y Λ༧ଌͰ͖ΔΑ͏ʹʢؒ઀ޮՌʹΑΔʣ biased/unbiased ੒෼ͱݴͬͯΔ͕࣮ࡍ͸͜ͷੑ࣭ͰҎͬͯͷΈ۠ผ͍ͯ͠Δ 9

tractable ͳܗ΁ͷมܗ: ᶃ term of z depends solely on the
variables x, we have H (z| ,r) = H (z| ), where H(·|·) denotes the conditional entropy [15, 26]. Combining I (z;r) = I (z; ) H ( |r) + H ( |z,r) . because the degree of dependence between them is determined by the strength of the bias in the feedback data, and blindly optimizing it may lead to bad results. Inspired by information theory, we can then derive the required objective function to be minimized from the above analysis, because the degree of dependence between them is determined by he strength of the bias in the feedback data, and blindly optimizing t may lead to bad results. Inspired by information theory, we can then derive the required objective function from the above analysis, LDIB := min I (z;x) | {z } 1 I (z; ) | {z } 2 + I (z;r) | {z } 3 I (r; ) | {z } 4 , (3) where term 1 is a compression term that describes the mutual nformation between the variables x and the unbiased embedding ; term 2 is an accuracy term that describes the performance of he unbiased embedding z; term 3 is a de-confounder penalty erm, which describes the dependency degree between the biased mbedding r and the unbiased embedding z; and term 4 is similar o term 2 , which is used for the potential gain from the biased mbedding r. Note that , and are the weight parameters. Since erms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get he desired biased and unbiased components, and can then prune he confounding bias. 5 A TRACTABLE OPTIMIZATION (3) where term 1 is a compression term that describes the mutual information between the variables x and the unbiased embedding z; term 2 is an accuracy term that describes the performance of the unbiased embedding z; term 3 is a de-confounder penalty term, which describes the dependency degree between the biased embedding r and the unbiased embedding z; and term 4 is similar to term 2 , which is used for the potential gain from the biased embedding r. Note that , and are the weight parameters. Since terms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get the desired biased and unbiased components, and can then prune the confounding bias. Based on the chain rule of mutual information, we ha ing equation about the de-confounder penalty term I (z;r) = I (z; ) I (z; |r) + I (z;r | ) . We further inspect the term I (z;r| ) in Eq.(4). Sinc tion of z depends solely on the variables x, and x is we have H (z| ,r) = H (z| ), where H(·|·) denotes th entropy [15, 26]. Combining the properties of mutua we have, I (z;r | ) = H (z| ) H (z| ,r) = H (z| ) H (z By substituting Eq.(5) into Eq.(4), we have, I (z;r) = I (z; ) I (z; |r) . Since the term I (z; |r) in Eq.(6) is still dicult to be use the form of conditional entropy to further simpl I (z;r) = I (z; ) H ( |r) + H ( |z,r) . LD PIED Egg trIgelaged t tractableto I 53 I Zn xfDkifpl PD Epby plzxlog play EP2log play Variationalapprox I is Variationaldistribution fate 2 KL dirn FAHIM'sE Palpa7181420 EPlog PH Z 22117 loggy Prepaid IpaIDK PC 21 41861 HEER PEN NGSdiag6 4 242 Mustknencoded embeding.TW T he parametrization thickElkie 2 9114 Gobad whereENNIOI i the3 691115720153 e ME AttFF3EI NT ZAt Hsembedding vectori 3 814 1810,11213 4 Drift Diatack diag644 11NoI Dial e tamed s timo e tix Illusion Ilostb'd 51,68 111min11 t's'tdilog bot D deterministic tilÉÉ ItfIi jifhregalaiza iqyg.gg ม෼ۙࣅ΍ KL divergence ͷੑ࣭ ͳͲΛ༻͍ͯɺembedding vector ͷ regularization ʹͳΔ l2 I(z; x) ≤ ∥z∥2 2 10

tractable ͳܗ΁ͷมܗ: ᶄ term of z depends solely on the
variables x, we have H (z| ,r) = H (z| ), where H(·|·) denotes the conditional entropy [15, 26]. Combining I (z;r) = I (z; ) H ( |r) + H ( |z,r) . because the degree of dependence between them is determined by the strength of the bias in the feedback data, and blindly optimizing it may lead to bad results. Inspired by information theory, we can then derive the required objective function to be minimized from the above analysis, because the degree of dependence between them is determined by he strength of the bias in the feedback data, and blindly optimizing t may lead to bad results. Inspired by information theory, we can then derive the required objective function from the above analysis, LDIB := min I (z;x) | {z } 1 I (z; ) | {z } 2 + I (z;r) | {z } 3 I (r; ) | {z } 4 , (3) where term 1 is a compression term that describes the mutual nformation between the variables x and the unbiased embedding ; term 2 is an accuracy term that describes the performance of he unbiased embedding z; term 3 is a de-confounder penalty erm, which describes the dependency degree between the biased mbedding r and the unbiased embedding z; and term 4 is similar o term 2 , which is used for the potential gain from the biased mbedding r. Note that , and are the weight parameters. Since erms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get he desired biased and unbiased components, and can then prune he confounding bias. 5 A TRACTABLE OPTIMIZATION (3) where term 1 is a compression term that describes the mutual information between the variables x and the unbiased embedding z; term 2 is an accuracy term that describes the performance of the unbiased embedding z; term 3 is a de-confounder penalty term, which describes the dependency degree between the biased embedding r and the unbiased embedding z; and term 4 is similar to term 2 , which is used for the potential gain from the biased embedding r. Note that , and are the weight parameters. Since terms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get the desired biased and unbiased components, and can then prune the confounding bias. Based on the chain rule of mutual information, we ha ing equation about the de-confounder penalty term I (z;r) = I (z; ) I (z; |r) + I (z;r | ) . We further inspect the term I (z;r| ) in Eq.(4). Sinc tion of z depends solely on the variables x, and x is we have H (z| ,r) = H (z| ), where H(·|·) denotes th entropy [15, 26]. Combining the properties of mutua we have, I (z;r | ) = H (z| ) H (z| ,r) = H (z| ) H (z By substituting Eq.(5) into Eq.(4), we have, I (z;r) = I (z; ) I (z; |r) . Since the term I (z; |r) in Eq.(6) is still dicult to be use the form of conditional entropy to further simpl I (z;r) = I (z; ) H ( |r) + H ( |z,r) . 814 1810,11213 4 Drift Diatack diag644 11NoI Illusion Ilostb'd 51,68 deterministic tilÉÉ ItfI I zig HH H 1412 Z H 1212 HA 20 I ZD 117121 117214 ICE42 chainrule Ilk Yu INXzy x y zx I 242 HZIL HHH 141272 H2 12 0 ZAKFitt I 4 112,1 ICZL Iz214 I Z L Halt HLIZt z 71 2h2 t 4 Z E NFHILEITHEEIFEE D TAEK I HT HH HAIL Z HEAD 11 ୯७ͳࣜมܗ −I(z; y) = − H(y) + H(y|z) ≤ H(y|z)

tractable ͳܗ΁ͷมܗ: ᶅ term of z depends solely on the
variables x, we have H (z| ,r) = H (z| ), where H(·|·) denotes the conditional entropy [15, 26]. Combining I (z;r) = I (z; ) H ( |r) + H ( |z,r) . because the degree of dependence between them is determined by the strength of the bias in the feedback data, and blindly optimizing it may lead to bad results. Inspired by information theory, we can then derive the required objective function to be minimized from the above analysis, because the degree of dependence between them is determined by he strength of the bias in the feedback data, and blindly optimizing t may lead to bad results. Inspired by information theory, we can then derive the required objective function from the above analysis, LDIB := min I (z;x) | {z } 1 I (z; ) | {z } 2 + I (z;r) | {z } 3 I (r; ) | {z } 4 , (3) where term 1 is a compression term that describes the mutual nformation between the variables x and the unbiased embedding ; term 2 is an accuracy term that describes the performance of he unbiased embedding z; term 3 is a de-confounder penalty erm, which describes the dependency degree between the biased mbedding r and the unbiased embedding z; and term 4 is similar o term 2 , which is used for the potential gain from the biased mbedding r. Note that , and are the weight parameters. Since erms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get he desired biased and unbiased components, and can then prune he confounding bias. 5 A TRACTABLE OPTIMIZATION (3) where term 1 is a compression term that describes the mutual information between the variables x and the unbiased embedding z; term 2 is an accuracy term that describes the performance of the unbiased embedding z; term 3 is a de-confounder penalty term, which describes the dependency degree between the biased embedding r and the unbiased embedding z; and term 4 is similar to term 2 , which is used for the potential gain from the biased embedding r. Note that , and are the weight parameters. Since terms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get the desired biased and unbiased components, and can then prune the confounding bias. Based on the chain rule of mutual information, we ha ing equation about the de-confounder penalty term I (z;r) = I (z; ) I (z; |r) + I (z;r | ) . We further inspect the term I (z;r| ) in Eq.(4). Sinc tion of z depends solely on the variables x, and x is we have H (z| ,r) = H (z| ), where H(·|·) denotes th entropy [15, 26]. Combining the properties of mutua we have, I (z;r | ) = H (z| ) H (z| ,r) = H (z| ) H (z By substituting Eq.(5) into Eq.(4), we have, I (z;r) = I (z; ) I (z; |r) . Since the term I (z; |r) in Eq.(6) is still dicult to be use the form of conditional entropy to further simpl I (z;r) = I (z; ) H ( |r) + H ( |z,r) . 720153 e 814 1810,11213 4 Drift Diatack diag644 11NoI Dial e tamed s timo e tix Illusion Ilostb'd 51,68 111min11 t's'tdilog bot D deterministic tilÉÉ ItfIi jifhregalaiza iqyg.gg I zig HH H 1412 Z H 1212 HA 20 I ZD 117121 117214 ICE42 chainrule Ilk YuY ExistXinXiaXD INXzy x y zxz YIX RHS and term ICYMI Iftiz I frizz RHS 3rd term Ift 2 1127 ILL p I 242 HZIL HHH 141272 H2 12 0 ZAKFitt It H Huston I 4 112,1 ICZL Iz214 I Z L Halt HLIZt z 7112127 41414 HAI zig 2h2 t 4 Z E NFHILEITHEEIFEE 12 ୯७ͳࣜมܗ I(z; r) = H(y) − H(y|z) − H(y|r) + H(y|z, r) ≥ − H(y|z) − H(y|r) + H(y|z, r) ͜ͷ߲୯ಠͰ͸Լ͔Βԡ͑͞ΔܗʹͳͬͯΔ͕ɺ ᶄ term ͷ ͱ߹ΘͤΔͱ ͳͷ Ͱ ͷ৔߹શମͰ͸্͔Βԡ͑͞ΒΕΔ H(y) −(1 − γ)H(y) γ ≤ 1

tractable ͳܗ΁ͷมܗ: ᶆ term of z depends solely on the
variables x, we have H (z| ,r) = H (z| ), where H(·|·) denotes the conditional entropy [15, 26]. Combining I (z;r) = I (z; ) H ( |r) + H ( |z,r) . because the degree of dependence between them is determined by the strength of the bias in the feedback data, and blindly optimizing it may lead to bad results. Inspired by information theory, we can then derive the required objective function to be minimized from the above analysis, because the degree of dependence between them is determined by he strength of the bias in the feedback data, and blindly optimizing t may lead to bad results. Inspired by information theory, we can then derive the required objective function from the above analysis, LDIB := min I (z;x) | {z } 1 I (z; ) | {z } 2 + I (z;r) | {z } 3 I (r; ) | {z } 4 , (3) where term 1 is a compression term that describes the mutual nformation between the variables x and the unbiased embedding ; term 2 is an accuracy term that describes the performance of he unbiased embedding z; term 3 is a de-confounder penalty erm, which describes the dependency degree between the biased mbedding r and the unbiased embedding z; and term 4 is similar o term 2 , which is used for the potential gain from the biased mbedding r. Note that , and are the weight parameters. Since erms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get he desired biased and unbiased components, and can then prune he confounding bias. 5 A TRACTABLE OPTIMIZATION (3) where term 1 is a compression term that describes the mutual information between the variables x and the unbiased embedding z; term 2 is an accuracy term that describes the performance of the unbiased embedding z; term 3 is a de-confounder penalty term, which describes the dependency degree between the biased embedding r and the unbiased embedding z; and term 4 is similar to term 2 , which is used for the potential gain from the biased embedding r. Note that , and are the weight parameters. Since terms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get the desired biased and unbiased components, and can then prune the confounding bias. Based on the chain rule of mutual information, we ha ing equation about the de-confounder penalty term I (z;r) = I (z; ) I (z; |r) + I (z;r | ) . We further inspect the term I (z;r| ) in Eq.(4). Sinc tion of z depends solely on the variables x, and x is we have H (z| ,r) = H (z| ), where H(·|·) denotes th entropy [15, 26]. Combining the properties of mutua we have, I (z;r | ) = H (z| ) H (z| ,r) = H (z| ) H (z By substituting Eq.(5) into Eq.(4), we have, I (z;r) = I (z; ) I (z; |r) . Since the term I (z; |r) in Eq.(6) is still dicult to be use the form of conditional entropy to further simpl I (z;r) = I (z; ) H ( |r) + H ( |z,r) . INXzy x y zxz Y I 242 HZIL HHH 141272 H2 12 0 ZAKFitt I 4 112,1 ICZL Iz214 I Z L Halt HLIZt z 7112 2h2 t 4 Z E NFHILEITHEEIFEE D TAEK I HT HH HAIL Z HEAD I TEKLE Lp I MAN t7112121 21412171 21 11214 Ellman's 1181741421 18441214 2 Exe I p er Ips 1 2 HUH 18441442112174 PI 13 ୯७ͳࣜมܗ −I(r; y) ≤ H(y|r)

tractable ͳܗ΁ͷมܗ of z depends solely on the variables x,
we have H (z| ,r) = H (z| ), where H(·|·) denotes the conditional entropy [15, 26]. Combining I (z;r) = I (z; ) H ( |r) + H ( |z,r) . because the degree of dependence between them is determined by the strength of the bias in the feedback data, and blindly optimizing it may lead to bad results. Inspired by information theory, we can then derive the required objective function to be minimized from the above analysis, because the degree of dependence between them is determined by he strength of the bias in the feedback data, and blindly optimizing t may lead to bad results. Inspired by information theory, we can then derive the required objective function from the above analysis, LDIB := min I (z;x) | {z } 1 I (z; ) | {z } 2 + I (z;r) | {z } 3 I (r; ) | {z } 4 , (3) where term 1 is a compression term that describes the mutual nformation between the variables x and the unbiased embedding ; term 2 is an accuracy term that describes the performance of he unbiased embedding z; term 3 is a de-confounder penalty erm, which describes the dependency degree between the biased mbedding r and the unbiased embedding z; and term 4 is similar o term 2 , which is used for the potential gain from the biased mbedding r. Note that , and are the weight parameters. Since erms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get he desired biased and unbiased components, and can then prune he confounding bias. 5 A TRACTABLE OPTIMIZATION (3) where term 1 is a compression term that describes the mutual information between the variables x and the unbiased embedding z; term 2 is an accuracy term that describes the performance of the unbiased embedding z; term 3 is a de-confounder penalty term, which describes the dependency degree between the biased embedding r and the unbiased embedding z; and term 4 is similar to term 2 , which is used for the potential gain from the biased embedding r. Note that , and are the weight parameters. Since terms 1 and 2 in Eq.(3) are similar to a standard information bottleneck, we call the proposed method a debiased information bottleneck, or DIB for short. By optimizing LDIB, we expect to get the desired biased and unbiased components, and can then prune the confounding bias. Based on the chain rule of mutual information, we ha ing equation about the de-confounder penalty term I (z;r) = I (z; ) I (z; |r) + I (z;r | ) . We further inspect the term I (z;r| ) in Eq.(4). Sinc tion of z depends solely on the variables x, and x is we have H (z| ,r) = H (z| ), where H(·|·) denotes th entropy [15, 26]. Combining the properties of mutua we have, I (z;r | ) = H (z| ) H (z| ,r) = H (z| ) H (z By substituting Eq.(5) into Eq.(4), we have, I (z;r) = I (z; ) I (z; |r) . Since the term I (z; |r) in Eq.(6) is still dicult to be use the form of conditional entropy to further simpl I (z;r) = I (z; ) H ( |r) + H ( |z,r) . INXzy x y zxz YIX RHS and term ICYMI Iftiz I frizz RHS 3rd term Ift 2 1127 ILL p I 242 HZIL HHH 141272 H2 12 0 ZAKFitt It H Huston I 4 112,1 ICZL Iz214 I Z L Halt HLIZt z 7112127 41414 HAI zig 2h2 t 4 Z E NFHILEITHEEIFEE D TAEK I HT HH HAIL Z HEAD I TEKLE Lp I MAN t7112121 21412171 21 11214 2141212.4 2711217 Ellman's 1181741421 18441214 2141212,7 Exe I p er Ips 1 2 HUH 18441442112174 PIMM oleneck RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands he objective |z,r)] I (r; ) ,r) I (r; ) . (8) is related to e describe a tion using a relationship ) divergence, ows, This inequality also applies to the mutual information I (r; ) in Eq.(8). Combining Eq.(11) and Eq.(12), we can rewrite Eq.(8) as follows, LDIB = I (z;x) (1 ) I (z; ) H ( |r) + H ( |z,r) I (r; )  kµ(x)k2 2 + (1 )H( |z) ( )H( |r) + H( |z,r). (13) Finally, we get a tractable solution for LDIB, ˆ LDIB = (1 )H( |z) | {z } (a) ( )H( |r) | {z } (b) + H( |z,r) | {z } (c) + kµ(x)k2 2 | {z } (d) , (14) where 0 < < < 1. Let ˆr , ˆz, and ˆz,c be the predicted labels generated by the biased component r, the unbiased component z, and the biased embedding vector [z,r] as shown in Figure 2(a), 14 z given Ͱ y ʹ࢒Δ ৘ใ͸খ͍ͨ͘͞͠ 0 < α < γ < 1 r given Ͱ y ʹ࢒Δ ৘ใ͸ଟͯ͘΋Α͍ z,r given Ͱ y ʹ࢒Δ ৘ใ͸খ͍ͨ͘͞͠ z ͸ѹॖ͍ͨ͠

࣮ݧͷઃఆ RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Table
1: Statistics of the datasets. P/N represents the ratio between the numbers of positive and negative feedback. Yahoo! R3 Product #Feedback P/N #Feedback P/N training 254,713 67.02% 4,160,414 2.21% validation 56,991 67.00% 897,449 2.15% test 54,000 9.64% 225,762 1.03% the heuristic-based methods are usually inecien ased data augmentation methods are not suitable where only a biased data is available for trainin Therefore, we choose the representative methods lines as the baselines. Note that based on dierent of these baselines has two versions. • MF [17]: This is one of the most classic recomm els. The user-item interaction matrix is factori representation of the users and items, and their are used to predict the preferences. open data (Yahoo! R3) ͱ࣮ϓϩμΫτͷσʔλʢৄࡉهड़ͳ͠ʣͰ࣮ݧ biased σʔλͰֶशͱ AUC Λ࢖ͬͨ parameter tuning Λ࣮ࢪ σʔλʹؚ·Ε͍ͯΔ random set Ͱςετ͢Δ͜ͱͰ debiasing ͷޮՌΛݕূ Ϟσϧ͸ MF/NCF Λ༻͍ͯɺdebiasing ͷൺֱʹ͸ IPS ͳͲͷઌߦݚڀΛ࢖༻ Mitigating Confounding Bias in Recommendation via Information Boleneck Table 2: Hyper-parameters and their values tuned in the ex- periments. Name Range Functionality rank {5, 10, · · · , 195, 200} Embedded dimension 1e 5, 1e 4, · · · , 1e 1, 1 Regularization bs 27, 28, · · · , 213, 214 Batch size lr 1e 4, 5e 4 · · · 5e 2, 1e 1 Learning rate [0.1, 0.2] Loss weighting [0.001, 0.1] Loss weighting pr K m be to in n de 6 To on P 15

࣮ݧ݁Ռ͸ྑ޷ RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Liu
and Cheng et al. Table 3: Comparison results of unbiased evaluation using MF as the backbone model, where the best results and the second best results are marked in bold and underlined, respectively. AUC is the main evaluation metric. Yahoo! R3 Product Method AUC nDCG P@5 P@10 R@5 R@10 AUC nDCG P@5 P@10 R@5 R@10 MF 0.7081 0.0341 0.0043 0.0043 0.0123 0.0273 0.6936 0.0324 0.0085 0.0079 0.0407 0.0752 IPS-MF 0.7040 0.0259 0.0031 0.0033 0.0091 0.0182 0.7125 0.0408 0.0095 0.0105 0.0456 0.1019 SNIPS-MF 0.7124 0.0390 0.0057 0.0048 0.0182 0.0293 0.7098 0.0403 0.0092 0.0104 0.0438 0.1003 CVIB-MF 0.7086 0.0488 0.0080 0.0075 0.0253 0.0471 0.7143 0.0521 0.0106 0.0109 0.0520 0.1076 AT-MF 0.7290 0.0676 0.0113 0.0101 0.0345 0.0629 0.7191 0.0443 0.0110 0.0113 0.0532 0.1092 Rel-MF 0.7469 0.0843 0.0159 0.0131 0.0526 0.0863 0.6709 0.0656 0.0157 0.0142 0.0776 0.1386 DIB-MF 0.7602 0.0932 0.0177 0.0151 0.0566 0.0960 0.7365 0.0819 0.0198 0.0173 0.0965 0.1688 Table 4: Comparison results of unbiased evaluation using NCF as the backbone model, where the best results and the second best results are marked in bold and underlined, respectively. AUC is the main evaluation metric. Yahoo! R3 Product Method AUC nDCG P@5 P@10 R@5 R@10 AUC nDCG P@5 P@10 R@5 R@10 NCF 0.7251 0.0350 0.0037 0.0036 0.0097 0.0203 0.7211 0.1465 0.0174 0.0136 0.0865 0.1343 IPS-NCF 0.7221 0.0322 0.0032 0.0039 0.0085 0.0219 0.7284 0.1463 0.0176 0.0135 0.0870 0.1330 SNIPS-NCF 0.7230 0.0310 0.0030 0.0031 0.0087 0.0175 0.7257 0.1454 0.0173 0.0135 0.0856 0.1323 CVIB-NCF 0.7265 0.0347 0.0036 0.0030 0.0105 0.0177 0.7291 0.1440 0.0149 0.0137 0.0732 0.1350 AT-NCF 0.7139 0.0333 0.0031 0.0033 0.0084 0.0179 0.6814 0.1464 0.0156 0.0139 0.0774 0.1375 Rel-NCF 0.6867 0.0507 0.0071 0.0064 0.0233 0.0408 0.6653 0.1404 0.0156 0.0145 0.0763 0.1423 DIB-NCF 0.7553 0.0686 0.0108 0.0101 0.0339 0.0630 0.7345 0.1483 0.0175 0.0163 0.0865 0.1613 Table 5: Results of the ablation studies using MF and NCF as the backbone models, where the best results and the second best MF Ͱ͸͔ͳΓ༏ल NCF ͸ baseline ͕ྑ ͍ͱ͍͏ঢ়گͰ͸͋Δ ͕ɺ݁Ռ͸֓Ͷྑ޷  P@K, R@K ͷ K ͕খ ͍͞ͱΠϚΠν 16

ablation study Ͱ֤߲͕ޮ͍ͯΔ͜ͱ΋ݕূ IPS-NCF 0.7221 0.0322 0.0032 0.0039 0.0085 0.0219
0.7284 0.1463 0.0176 0.0135 0.0870 0.1330 SNIPS-NCF 0.7230 0.0310 0.0030 0.0031 0.0087 0.0175 0.7257 0.1454 0.0173 0.0135 0.0856 0.1323 CVIB-NCF 0.7265 0.0347 0.0036 0.0030 0.0105 0.0177 0.7291 0.1440 0.0149 0.0137 0.0732 0.1350 AT-NCF 0.7139 0.0333 0.0031 0.0033 0.0084 0.0179 0.6814 0.1464 0.0156 0.0139 0.0774 0.1375 Rel-NCF 0.6867 0.0507 0.0071 0.0064 0.0233 0.0408 0.6653 0.1404 0.0156 0.0145 0.0763 0.1423 DIB-NCF 0.7553 0.0686 0.0108 0.0101 0.0339 0.0630 0.7345 0.1483 0.0175 0.0163 0.0865 0.1613 Table 5: Results of the ablation studies using MF and NCF as the backbone models, where the best results and the second best results are marked in bold and underlined, respectively. AUC is the main evaluation metric. Yahoo! R3 Product Method AUC nDCG P@5 P@10 R@5 R@10 AUC nDCG P@5 P@10 R@5 R@10 DIB-MF 0.7602 0.0932 0.0177 0.0151 0.0566 0.0960 0.7365 0.0819 0.0198 0.0173 0.0965 0.1688 w/o term (b) 0.7505 0.0893 0.0175 0.0138 0.0563 0.0909 0.7173 0.0546 0.0126 0.0115 0.0614 0.1112 w/o term (c) 0.7545 0.0915 0.0173 0.0142 0.0569 0.0937 0.7156 0.0511 0.0109 0.0113 0.0527 0.1083 w/o term (d) 0.7342 0.0769 0.0144 0.0117 0.0478 0.0737 0.6809 0.0719 0.0178 0.0153 0.0867 0.1504 DIB-NCF 0.7553 0.0686 0.0108 0.0101 0.0339 0.0630 0.7345 0.1483 0.0175 0.0163 0.0865 0.1613 w/o term (b) 0.7326 0.0553 0.0089 0.0081 0.0271 0.0489 0.7274 0.1474 0.0181 0.0131 0.0901 0.1294 w/o term (c) 0.7373 0.0592 0.0097 0.0093 0.0292 0.0585 0.7276 0.1474 0.0181 0.0131 0.0901 0.1294 w/o term (d) 0.7243 0.0597 0.0102 0.0099 0.0318 0.0603 0.7133 0.1438 0.0177 0.0123 0.0876 0.1214 shown in Figure 4. A sucient reduction in the loss of term (a) guarantees the basic performance of the proposed method (corre- sponding to the unbiased component z). By comparing the trends of term (b) and term (c), we can see that using the biased component r alone will lead to a very poor result, but as long as it is combined with the unbiased component z, the best result can be achieved. This is reasonable, because the biased component itself does not reect the user’s preferences well, but the combination with the unbiased component will lead to an over-approximation eect on the observed feedback. In addition, the trend of term (d) 17 Ұ෦ (c) Λൈ͍ͨํ͕݁Ռ͕ྑ͍͕ɺ͜Ε͸ parameter tuning Ͱ AUC ͚ͩߟ͑ͯΔ͜ͱʹΑΔ ϊΠζ͔΋ͱݴ͍ͬͯΔʢ৴ጪੑ͋·Γͳ͍ʣ RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands ) This inequality also applies to the mutual information I (r; ) in Eq.(8). Combining Eq.(11) and Eq.(12), we can rewrite Eq.(8) as follows, LDIB = I (z;x) (1 ) I (z; ) H ( |r) + H ( |z,r) I (r; )  kµ(x)k2 2 + (1 )H( |z) ( )H( |r) + H( |z,r). (13) Finally, we get a tractable solution for LDIB, ˆ LDIB = (1 )H( |z) | {z } (a) ( )H( |r) | {z } (b) + H( |z,r) | {z } (c) + kµ(x)k2 2 | {z } (d) , (14)

·ͱΊ biased feedback ͷΈ͕ར༻Մೳͳঢ়گͰ confounding bias Λܰݮ͍ͨ͠  ຒΊࠐΈදݱΛ biased/unbiased ੒෼ʹ෼͚ͯ
information bottleneck ͷΞΠσ ΞΛൃలֶͤͯ͞श͠ɺ༧ଌʹ༗ޮͳ unbiased ͳ੒෼͕ಘΒΕΔ͜ͱΛࣔͨ͠ • ҼՌμΠΞάϥϜͱؔ࿈͚ͮͯఆࣜԽ • tractable ͳܗʹ੔ཧ • open/product σʔλ྆ํͰ༗ޮͰ͋Δͱ࣮ূ ࿦จಡΈϝϞ: https://github.com/yoheikikuta/paper-reading/issues/61 Mitigating Confounding Bias in Recommendation via Information Boleneck RecSys ’21, September 27-October 1, 2021, Amsterdam, RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Liu and Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A Treatment T Outcome y (a) Biased Feedback Variables x Confounder Variables C Instrumental Variables I Adjustment Variables A Treatment T Outcome y (b) Unbiased Feedback Figure 1: (a) A causal diagram of biased feedback, where x and are known variables, blue arrows indicate the indire and red arrows indicate the direct eect. (b) A causal diagram of unbiased feedback where the indirect eect are trun Variables x Variables x Figure 1: (a) A causal diagram of biased feedback, where x and are known variables, blue arrows indicate the indir and red arrows indicate the direct eect. (b) A causal diagram of unbiased feedback where the indirect eect are tru inference. The input variables x (i.e., the feature vector) can be divided into three parts, including the instrumental variables I, the confounder variables C and the adjustment variables A. The instrumental variables I and the confounder variables C determine the treatment T, and they have an indirect eect on the outcome (i.e., the label) through the path {I,C} ! T ! . The confounder variables C and the adjustment variables A have a direct eect on i.e., for user requests, the system randomly samples ite candidate set, and displays them after some random arr However, the uniform policy will harm the user exper reduce the platform revenue. Hence, a more appealing is that only the biased feedback is available. Therefo paper, we focus on mitigating the confounding bias with feedback alone. 18

20211023_recsys2021_paper_reading_YoheiKikuta

20211023_recsys2021_paper_reading_YoheiKikuta

yoppe

More Decks by yoppe

Other Decks in Research

Featured

Transcript

Yohei Kikuta (@yohei_kikuta), 2021/10/23 RecSys2021࿦จಡΈձ Mitigating Confounding Bias in Recommendation

Ubie גࣜձࣾͷ٠ాངฏͰ͢ • ֤छ৘ใ • GitHub: https://github.com/yoheikikuta • Twitter: https://twitter.com/yohei_kikuta

·ͱΊ biased feedback ͷΈ͕ར༻Մೳͳঢ়گͰ confounding bias Λܰݮ͍ͨ͠  ຒΊࠐΈදݱΛ biased/unbiased ੒෼ʹ෼͚ͯ

࿦จͷϞνϕʔγϣϯ ϥϯμϜԽൺֱࢼݧʢRCTʣΛ΍Δͷ͸ίετֻ͕͔ΓಘΔͷͰආ͚͍ͨ • աڈσʔλ͕ biased ͳ΋ͷ͔͠ͳ͍৔߹͸৽͘͠࡞Δඞཁ͕͋Δ • ৽͘͠४උ͢Δʹ΋Ϣʔβ͕ෆརӹΛඃΔՄೳੑ͕͋Δ biased feedback

ߟ͍͑ͯΔҼՌμΠΞάϥϜ Mitigating Confounding Bias in Recommendation via Information Boleneck RecSys

confounding bias ͷఆٛ Mitigating Confounding Bias in Recommendation via Information

ຒΊࠐΈදݱΛ biased/unbiased ʹ෼ׂ͢Δ RecSys ’21, September 27-October 1, 2021, Amsterdam,

Information Bottleneck ͷ෮श Rate Distortion theory: ֬཰ม਺ͷඇՄٯѹॖ Ͱ distortion function

ຒΊࠐΈදݱΛ෼ׂ͢ΔͨΊͷ໨తؔ਺ it may lead to bad results. Inspired by information

tractable ͳܗ΁ͷมܗ: ᶃ term of z depends solely on the

tractable ͳܗ΁ͷมܗ: ᶄ term of z depends solely on the

tractable ͳܗ΁ͷมܗ: ᶅ term of z depends solely on the

tractable ͳܗ΁ͷมܗ: ᶆ term of z depends solely on the

tractable ͳܗ΁ͷมܗ of z depends solely on the variables x,

࣮ݧͷઃఆ RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Table

࣮ݧ݁Ռ͸ྑ޷ RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands Liu

ablation study Ͱ֤߲͕ޮ͍ͯΔ͜ͱ΋ݕূ IPS-NCF 0.7221 0.0322 0.0032 0.0039 0.0085 0.0219

·ͱΊ biased feedback ͷΈ͕ར༻Մೳͳঢ়گͰ confounding bias Λܰݮ͍ͨ͠  ຒΊࠐΈදݱΛ biased/unbiased ੒෼ʹ෼͚ͯ