Slide 14
Slide 14 text
tractable ͳܗͷมܗ
of z depends solely on the variables x, we have H (z| ,r) = H (z| ),
where H(·|·) denotes the conditional entropy [15, 26]. Combining I (z;r) = I (z; ) H ( |r) + H ( |z,r) .
because the degree of dependence between them is determined by
the strength of the bias in the feedback data, and blindly optimizing
it may lead to bad results. Inspired by information theory, we can
then derive the required objective function to be minimized from
the above analysis,
because the degree of dependence between them is determined by
he strength of the bias in the feedback data, and blindly optimizing
t may lead to bad results.
Inspired by information theory, we can then derive the required
objective function from the above analysis,
LDIB := min I (z;x)
| {z }
1
I (z; )
| {z }
2
+ I (z;r)
| {z }
3
I (r; )
| {z }
4
, (3)
where term 1 is a compression term that describes the mutual
nformation between the variables x and the unbiased embedding
; term 2 is an accuracy term that describes the performance of
he unbiased embedding z; term 3 is a de-confounder penalty
erm, which describes the dependency degree between the biased
mbedding r and the unbiased embedding z; and term 4 is similar
o term 2 , which is used for the potential gain from the biased
mbedding r. Note that , and are the weight parameters. Since
erms 1 and 2 in Eq.(3) are similar to a standard information
bottleneck, we call the proposed method a debiased information
bottleneck, or DIB for short. By optimizing LDIB, we expect to get
he desired biased and unbiased components, and can then prune
he confounding bias.
5 A TRACTABLE OPTIMIZATION
(3)
where term 1 is a compression term that describes the mutual
information between the variables x and the unbiased embedding
z; term 2 is an accuracy term that describes the performance of
the unbiased embedding z; term 3 is a de-confounder penalty
term, which describes the dependency degree between the biased
embedding r and the unbiased embedding z; and term 4 is similar
to term 2 , which is used for the potential gain from the biased
embedding r. Note that , and are the weight parameters. Since
terms 1 and 2 in Eq.(3) are similar to a standard information
bottleneck, we call the proposed method a debiased information
bottleneck, or DIB for short. By optimizing LDIB, we expect to get
the desired biased and unbiased components, and can then prune
the confounding bias.
Based on the chain rule of mutual information, we ha
ing equation about the de-confounder penalty term
I (z;r) = I (z; ) I (z; |r) + I (z;r | ) .
We further inspect the term I (z;r| ) in Eq.(4). Sinc
tion of z depends solely on the variables x, and x is
we have H (z| ,r) = H (z| ), where H(·|·) denotes th
entropy [15, 26]. Combining the properties of mutua
we have,
I (z;r | ) = H (z| ) H (z| ,r) = H (z| ) H (z
By substituting Eq.(5) into Eq.(4), we have,
I (z;r) = I (z; ) I (z; |r) .
Since the term I (z; |r) in Eq.(6) is still dicult to be
use the form of conditional entropy to further simpl
I (z;r) = I (z; ) H ( |r) + H ( |z,r) .
INXzy x y zxz
YIX RHS
and
term ICYMI Iftiz I
frizz
RHS
3rd
term Ift 2 1127 ILL
p
I 242 HZIL HHH 141272 H2
12 0 ZAKFitt It H Huston
I 4
112,1 ICZL Iz214 I Z L Halt HLIZt z 7112127 41414 HAI
zig
2h2 t 4 Z E NFHILEITHEEIFEE
D TAEK I HT HH HAIL Z HEAD
I TEKLE Lp I MAN t7112121 21412171 21
11214 2141212.4 2711217
Ellman's 1181741421 18441214 2141212,7
Exe
I p er Ips 1 2 HUH 18441442112174 PIMM
oleneck RecSys ’21, September 27-October 1, 2021, Amsterdam, Netherlands
he objective
|z,r)] I (r; )
,r) I (r; ) .
(8)
is related to
e describe a
tion using a
relationship
) divergence,
ows,
This inequality also applies to the mutual information I (r; ) in
Eq.(8). Combining Eq.(11) and Eq.(12), we can rewrite Eq.(8) as
follows,
LDIB = I (z;x) (1 ) I (z; ) H ( |r) + H ( |z,r) I (r; )
kµ(x)k2
2 + (1 )H( |z) ( )H( |r) + H( |z,r).
(13)
Finally, we get a tractable solution for LDIB,
ˆ
LDIB = (1 )H( |z)
| {z }
(a)
( )H( |r)
| {z }
(b)
+ H( |z,r)
| {z }
(c)
+ kµ(x)k2
2
| {z }
(d)
,
(14)
where 0 < < < 1. Let ˆr , ˆz, and ˆz,c be the predicted labels
generated by the biased component r, the unbiased component
z, and the biased embedding vector [z,r] as shown in Figure 2(a),
14
z given Ͱ y ʹΔ
ใখ͍ͨ͘͞͠
0 < α < γ < 1
r given Ͱ y ʹΔ
ใଟͯ͘Α͍
z,r given Ͱ y ʹΔ
ใখ͍ͨ͘͞͠
z ѹॖ͍ͨ͠