Ami Wiesel - Speaker Deck

Slide 1

Slide 1 text

Ami Wiesel (HUJI) 1/36 Deep learning solutions to estimation and detection Ami Wiesel The Hebrew University of Jerusalem (HUJI) October 17, 2022

Slide 2

Slide 2 text

Ami Wiesel (HUJI) 2/36 Thanks ▶ Tzvi Diskin ▶ Yiftach Beer ▶ Yoav Wald ▶ Uri Ukon ▶ Yonina Eldar ▶ Google

Slide 3

Slide 3 text

Ami Wiesel (HUJI) 3/36 2002 vs 2022 ▶ Model ▶ Parameter estimation ▶ Hypothesis testing ▶ Algorithms ▶ (synthetic) Data ▶ Regression ▶ Classification ▶ Neural networks

Slide 4

Slide 4 text

Ami Wiesel (HUJI) 4/36 Learning without bias Detection with constant false alarm rate

Slide 5

Slide 5 text

Ami Wiesel (HUJI) 5/36 Outline Learning without bias Detection with constant false alarm rate

Slide 6

Slide 6 text

Ami Wiesel (HUJI) 6/36 Parameter estimation: 2002 vs 2022 ▶ Model ▶ Maximum Likelihood ▶ Inference was slow ▶ Asymptotically unbiased ▶ Cramer Rao Bound for all ▶ (synthetic) Data ▶ Regression ▶ Inference is fast ▶ Fitted on training set ▶ Best if train=test

Slide 7

Slide 7 text

Ami Wiesel (HUJI) 7/36 Estimation metrics ▶ Classical metrics: BIASˆ y(y) = E [ˆ y (x) |y] − y VARˆ y(y) = E ∥ˆ y(x) − E [ˆ y (x)]∥2 y MSEˆ y(y) = E ∥ˆ y(x) − y∥2 y = VARˆ y(y) + ∥BIASˆ y(y)∥2 ▶ Bayesian metric: BMSEˆ y = E [MSEˆ y(y)]

Slide 8

Slide 8 text

Ami Wiesel (HUJI) 8/36 Parameter estimation ▶ Classical approaches: XXXXXX X min ˆ y(·) MSEˆ y(y) ▶ Minimum Variance Unbiased Estimation (MVUE) ▶ Maximum Likelihood is asymptotically MVUE BIASˆ y (y) = 0 ∀ y ▶ Bayesian approach: ▶ Minimize BMSE min ˆ y(·) E [MSEˆ y (y)] Learning is Bayesian with respect to training set

Slide 9

Slide 9 text

Ami Wiesel (HUJI) 9/36 Bias Constrained Estimation (BCE) ▶ Standard learning DN = {yi , xi }N i=1 min ˆ y∈H ˆ EN ∥ˆ y(x) − y∥2 ▶ BCE: penalize the average squared bias DNM = yi , {xij }M j=1 N i=1 min ˆ y∈H ˆ ENM ∥ˆ y(x) − y∥2 + λˆ EN ˆ EM [ˆ y(x) − y|y] 2

Slide 10

Slide 10 text

Ami Wiesel (HUJI) 10/36 Collecting a BCE dataset DNM Synthetic data Data augmentation ▶ Fictitious prior pfake(y) ▶ Generate {yi }N i=1 ▶ For each yi generate {xj (yi )}M j=1 Khobahi, Gabrielli, Naimipour, Dreifuerst,...

Slide 11

Slide 11 text

Ami Wiesel (HUJI) 11/36 Minimum Variance Unbiased Estimator (MVUE) Theorem Under technical conditions, BCE is asymptotically MVUE. ▶ Maximum Likelihood is also asymptotically MVUE. ▶ BCE approximates it using deep learning. ▶ Asymptotically in everything! ▶ Note that we penalize the average bias (rather than the max). ▶ Asymptotically, achieves Cramer Rao bound for any value.

Slide 12

Slide 12 text

Ami Wiesel (HUJI) 12/36 BCE with linear architecture Theorem ˆ y = Ax A = ˆ ENM yxT 1 λ + 1 ˆ ENM xxT + 1 − 1 λ + 1 R −1 R = ˆ EN ˆ EM [x|y] ˆ EM xT |y Compare to the Bayesian linear MMSE (linear regression) A = ENM yxT ENM xxT −1

Slide 13

Slide 13 text

Ami Wiesel (HUJI) 13/36 BCE with linear architecture and linear model Theorem ˆ y = Ax x = Hy + n A = HT ˆ Σ−1 x H + 1 λ + 1 ˆ Σ−1 y −1 HT ˆ Σ−1 x Compare to Weighted Least Squares estimator (= MVUE) A = HT ˆ Σ−1 x H −1 HT ˆ Σ−1 x Gauss-Markov theorem.

Slide 14

Slide 14 text

Ami Wiesel (HUJI) 14/36 Experiment: SNR estimation My MSc with Messer in 2002. MMSE is best on training dist. BCE is always near MLE (EM). xi = ai h + ni ai = ±1 w.p. 1 2 ni ∼ N(0, σ2) ρ = h2 σ2

Slide 15

Slide 15 text

Ami Wiesel (HUJI) 15/36 Experiment: covariance estimation Structured covariance [Chaudhuri] p(x; Σ) ∼ N(0, Σ)   1 + y1 0 0 1 2 y6 0 0 1 + y2 0 1 2 y7 0 0 0 1 + y3 0 1 2 y8 1 2 y6 1 2 y7 0 1 + y4 1 2 y9 0 0 1 2 y8 1 2 y9 1 + y5   EMMSE is best on training dist. BCE is always near MVUE.

Slide 16

Slide 16 text

Ami Wiesel (HUJI) 16/36 BCE for averaging in test time ▶ Example: sensor networks. ▶ Example: test-time augmentation. Averaging Unbiasedness is necessary for consistent averaging. BCE is asymptotically unbiased.

Slide 17

Slide 17 text

Ami Wiesel (HUJI) 17/36 Experiment: Augmentation in test time ▶ CIFAR10 ▶ Random cropping and flipping ▶ Soft labels via distillation ▶ Both in train and in test ▶ BCE outperforms MMSE [Krizhevsky, Simonyan, Han,...]

Slide 18

Slide 18 text

Ami Wiesel (HUJI) 18/36 Fairness literature ▶ Related to “fairness” “out of distribution generalization”. ▶ Invariant Risk Minimization (IRM) by Arjovski ▶ Calibration and OOD by Wald, and more... ▶ We protect the labels themselves rather than the environments.

Slide 19

Slide 19 text

Ami Wiesel (HUJI) 19/36 Outline Learning without bias Detection with constant false alarm rate

Slide 20

Slide 20 text

Ami Wiesel (HUJI) 20/36 Parameter estimation: 2002 vs 2022 ▶ Model ▶ Likelihood Ratio Test ▶ Neyman Pearson ▶ Constant false alarm rate ▶ (synthetic) Data ▶ Classification ▶ Minimum prob of error ▶ Works well if train=test

Slide 21

Slide 21 text

Ami Wiesel (HUJI) 21/36 Simple Hypothesis Testing Goal x ∼ p(x; y) y ∈ {0; 1} Design a detector T(x) ≷ γ that maximizes PTPR(z) = P(T(x) > γ; y = 1) subject to a false alarm constraint PFPR(z) = P(T(x) > γ; y = 0). LRT = classifier is optimal and easy to learn TLRT(x) = 2 log p(x; y = 1) p(x; y = 0) Easy to learn as a Bayes optimal classifier. Can also optimize AUC-ROC, e.g., Herschtal, Brefeld, etc.

Slide 22

Slide 22 text

Ami Wiesel (HUJI) 22/36 Composite Hypothesis Testing x ∼ p(x; z) y = 0 : z ∈ Z0 noise only y = 1 : z ∈ Z1 target (Ill-posed) Goal Design a detector T(x) ≷ γ that maximizes PTPR(z) = P(T(x) > γ; z ∈ Z1) subject to a constant false alarm rate (CFAR) constraint on PFPR(z) = P(T(x) > γ; z ∈ Z0) for all z ∈ Z0.

Slide 23

Slide 23 text

Ami Wiesel (HUJI) 23/36 Generalized Likelihood Ratio Test (GLRT) ▶ GLRT is the standard approach TGLRT(x) = 2 log maxz∈Z1 p(x; z) maxz∈Z0 p(x; z) ▶ Pros: Under regular asymptotic conditions, TGLRT(x) asymp ∼ χ2 r (0) y = 0 χ2 r (λ) y = 1 and has a constant false alarm rate (CFAR). ▶ Cons: likelihood, optimizations, asymptotic.

Slide 24

Slide 24 text

Ami Wiesel (HUJI) 24/36 Learning to detect targets Learning detectors ▶ Choose pfake(y) and pfake(z; y). ▶ For each i = 1, · · · , N: Generate yi . Generate zi given yi . Generate xi given zi . ▶ Solve min ˆ T∈T 1 N N i=1 L( ˆ T(xi ), yi ). References: Ziemann, Kucer and Theiler, Girard, De La Mata-Moya and many more. . .

Slide 25

Slide 25 text

Ami Wiesel (HUJI) 25/36 Learning to detect targets is easy ▶ Also in composite hypothesis (unlike estimation). ▶ Target detection in Gaussian noise with unknown variance. ▶ A ̸= 0 and σ are deterministic and unknown. xi = A + σni i = 1, · · · , N

Slide 26

Slide 26 text

Ami Wiesel (HUJI) 26/36 But... learned classifiers are not CFAR!

Slide 27

Slide 27 text

Ami Wiesel (HUJI) 27/36 Learning CFAR detectors CFAR-NET ▶ Choose pfake(y) and pfake(z; y). ▶ For each i = 1, · · · , N: Generate yi . Generate zi given yi . Generate xi given zi . ▶ Solve min ˆ T∈T 1 N N i=1 L( ˆ T(xi ), yi ) + α ˆ R( ˆ T). ˆ R( ˆ T) = i,˜ i under y=0 d { ˆ T(xij }M j=1 ); { ˆ T(˜ x˜ ij )}M j=1 Ensures that ˆ T has the same distribution under all zi .

Slide 28

Slide 28 text

Ami Wiesel (HUJI) 28/36 Learning CFAR detectors II CFAR penalty ˆ R( ˆ T) = i,˜ i under y=0 d { ˆ T(xij }M j=1 ); { ˆ T(˜ x˜ ij )}M j=1 ▶ Differentiable distance between distributions. ▶ We use MMD by Gretton et al: dMMD = 1 N2 i,j k(Xi , Xj ) + 1 N2 i,j k(Yi , Yj ) − 2 N2 i,j k(Xi , Yj ) ▶ Can also use a GAN like loss.

Slide 29

Slide 29 text

Ami Wiesel (HUJI) 29/36 Detection in i.i.d. noise with unknown variance ▶ Gaussian noise: ▶ non-Gaussian noise:

Slide 30

Slide 30 text

Ami Wiesel (HUJI) 30/36 Detection in correlated noise ▶ Gaussian noise covariance estimated using secondary data. ▶ Adaptive Matched Filter (AMF): x = As + w0 xi = wi i = 1, · · · , n w0, wi ∼ N(0, Σ) TAMF(x) = sT ˆ Σ−1x 2 sT ˆ Σ−1s ˆ Σ = 1 n n i=1 wi wT i ▶ Diagonally loaded (LAMF) for regularization Σ + λI.

Slide 31

Slide 31 text

Ami Wiesel (HUJI) 31/36 CFAR-NET in correlated noise ▶ LAMF, NET and CFARnet are better than AMF. ▶ Unlike CFARnet, the LAMF and NET are highly non-CFAR.

Slide 32

Slide 32 text

Ami Wiesel (HUJI) 32/36 Real Hyperspectral data ▶ Pavia University dataset. ▶ 10 labeled materials. ▶ Partial AUC in (0 − 0.05). material net CFARnet unlabeled 0.49 0.47 1 0.31 0.38 2 0.74 0.77 3 0.33 0.35 4 0.69 0.73 5 0.27 0.34 6 0.49 0.53 7 0.47 0.72 8 0.41 0.49 9 0.88 0.9

Slide 33

Slide 33 text

Ami Wiesel (HUJI) 33/36 How is this related to the classics? Roughly speaking ▶ Simple tests: LRT = Bayes optimal classifier. ▶ Composite tests: GLRT = Bayes + CFAR. BayesCFAR : min ˆ T,γ Pr(1T≥γ ̸= y) s.t. ˆ T is CFAR Exact equivalence requires assumptions....

Slide 34

Slide 34 text

Ami Wiesel (HUJI) 34/36 GLRT solves Bayes CFAR BayesCFAR : min ˆ T,γ Pr(1T≥γ ̸= y) s.t. ˆ T is CFAR Theorem Consider an asymptotic linear Gaussian model with a large enough σ2 r then there exists a threshold γ such that GLRT solves BayesCFAR. ▶ Linear model x = Hzr + n. ▶ Noise covariance is parameterized arbitrarily by zn. ▶ CFAR-NET approximates it using deep learning.

Slide 35

Slide 35 text

Ami Wiesel (HUJI) 35/36 Fairness literature ▶ CFAR-NET is very similar to ▶ Setting is slightly different. ▶ CFAR-NET is non-symmetric. ▶ CFAR-NET is cheaper in our settings (1D MMD).

Slide 36

Slide 36 text

Ami Wiesel (HUJI) 36/36 Conclusions ▶ Everyone is switching to deep learning. ▶ But don’t forget the classics. ▶ To make a regressor closer to MLE/MVUE, add a bias penalty. ▶ To make a classifier closer to GLRT, add a CFAR penalty. ▶ Thank you!