Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文紹介:Controlling the False Discovery Rate via K...

Tsukasa
July 08, 2021

論文紹介:Controlling the False Discovery Rate via Knockoffs

Tsukasa

July 08, 2021
Tweet

More Decks by Tsukasa

Other Decks in Research

Transcript

  1. ૣେߴ౳ݚɹߨࢣ
 ෱Ӭ௡ਹ Controlling the False Discovery Rate via Knockoffs Rina

    Foygel Barbera and Emmanuel Candèsb a Department of Statistics, University of Chicago, Chicago, IL 60637, U.S.A. b Department of Statistics, Stanford University, Stanford, CA 94305, U.S.A. April 2014 Abstract In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR)—the expected fraction of false discoveries among all discoveries—is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. This paper introduces the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, and the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. As the name suggests, the method operates by manufacturing knockoff variables that are cheap—their construction does not require any new data—and are designed to mimic the correlation structure found within the existing variables, in a way that allows for accurate FDR control, beyond what is possible with permutation-based methods. The method of knockoffs is very general and flexible, and can work with a broad class of test statistics. We test the method in combination with statistics from the Lasso for sparse regression, and obtain empirical results showing that the resulting method has far more power than existing selection rules when the proportion of null variables is high. (Annals of Statistics 2015) Ҿ༻਺:418ճ
  2. ໰୊ઃఆ w ͋Δ؍ଌม਺͕ઢܗճؼʹै͍ͬͯΔͱ͢Δ
 
 w ·ͨ͜ͷ࣌ɺਅͷϞσϧͰ͸Ќͷଟ͘͸Ͱ͋ΓɺЌͷ͏ͪΘ͔ͣ
 ͚͕ͩyʹد༩͍ͯ͠Δͱߟ͑Δɻ εύʔεੑ 
 w

    Ќͷ͏ͪͲͷ߲͕Ͱ͸ͳ͍͔Λ஌Γ͍ͨͱߟ͑Δɻ
 ͜ͷ࣌ɺಛ௃ྔͷू߹ɹɹɹɹɹɹΛબΜͩ࣌ʹɺ
 
 
 
 ͕ίϯτϩʔϧ͞Ε͍ͯΔΑ͏ʹɹΛબͿํ๏ΛఏҊ͢Δɻ ection in the classical linear model under arbitrary designs. ariable selection riable of interest y and many potentially explanatory variables Xj on n the classical linear regression model y = X + z, (1.1) onses, X 2 Rn⇥p is a known design matrix, 2 Rp is an unknown vector sian noise. Because we are interested in valid inference from finitely many ntion to the case where n p as otherwise the model would not even be ften the case that there are typically just a few relevant variables among the for instance, we typically expect that only a few genes are associated with near model (1.1), this means that only a few components of the parameter rtainly is no shortage of data fitting strategies, it is not always clear whether ccuracy of the selection with a finite sample size. In this paper, we propose among all the selected variables, i.e. all the variables included in the model, dures, which provably achieve this goal. 1 false discovery rate in variable selection have recorded a response variable of interest y and many potentially explanatory variables l units. Our observations obey the classical linear regression model y = X + z, ual, y 2 Rn is a vector of responses, X 2 Rn⇥p is a known design matrix, 2 Rp is an unknow ts, and z ⇠ N(0, 2I) is Gaussian noise. Because we are interested in valid inference from finit shall mostly restrict our attention to the case where n p as otherwise the model would not Now in modern settings, it is often the case that there are typically just a few relevant variables am ave been recorded. In genetics, for instance, we typically expect that only a few genes are associa y of interest. In terms of the linear model (1.1), this means that only a few components of the par to be nonzero. While there certainly is no shortage of data fitting strategies, it is not always clear offers real guarantees on the accuracy of the selection with a finite sample size. In this paper, we he false discovery rate (FDR) among all the selected variables, i.e. all the variables included in th novel and very concrete procedures, which provably achieve this goal. 1 1.1 The false discovery rate in variable selection Suppose we have recorded a response variable of interest y and many potentially observational units. Our observations obey the classical linear regression model y = X + z, where as usual, y 2 Rn is a vector of responses, X 2 Rn⇥p is a known design matr of coefficients, and z ⇠ N(0, 2I) is Gaussian noise. Because we are interested in v samples, we shall mostly restrict our attention to the case where n p as otherwi identifiable. Now in modern settings, it is often the case that there are typically just a many that have been recorded. In genetics, for instance, we typically expect that only a phenotype y of interest. In terms of the linear model (1.1), this means that only a few are expected to be nonzero. While there certainly is no shortage of data fitting strateg any of these offers real guarantees on the accuracy of the selection with a finite samp controlling the false discovery rate (FDR) among all the selected variables, i.e. all the and develop novel and very concrete procedures, which provably achieve this goal. 1 e in variable selection onse variable of interest y and many potentially explanatory variables Xj on n ns obey the classical linear regression model y = X + z, (1.1) of responses, X 2 Rn⇥p is a known design matrix, 2 Rp is an unknown vector is Gaussian noise. Because we are interested in valid inference from finitely many ur attention to the case where n p as otherwise the model would not even be s, it is often the case that there are typically just a few relevant variables among the enetics, for instance, we typically expect that only a few genes are associated with f the linear model (1.1), this means that only a few components of the parameter here certainly is no shortage of data fitting strategies, it is not always clear whether n the accuracy of the selection with a finite sample size. In this paper, we propose FDR) among all the selected variables, i.e. all the variables included in the model, e procedures, which provably achieve this goal. 1 1.1 The false discovery rate in variable selection Suppose we have recorded a response variable of interest y and observational units. Our observations obey the classical linear regre y = X + z, where as usual, y 2 Rn is a vector of responses, X 2 Rn⇥p is a kn of coefficients, and z ⇠ N(0, 2I) is Gaussian noise. Because we a samples, we shall mostly restrict our attention to the case where n identifiable. Now in modern settings, it is often the case that there a many that have been recorded. In genetics, for instance, we typicall a phenotype y of interest. In terms of the linear model (1.1), this me are expected to be nonzero. While there certainly is no shortage of d any of these offers real guarantees on the accuracy of the selection w controlling the false discovery rate (FDR) among all the selected va and develop novel and very concrete procedures, which provably ac 1 is positive semidefinite. In turn, standard Schur complement calculation  ⌃ diag{s} diag{s} 2 diag{s} ⌫ 0 () dia 2⌃ as claimed earlier. Now let ˜ U 2 Rn⇥p be an orthonormal matrix who that ˜ U>X = 0: such a matrix exists because n 2p. Since A ⌫ 0, p ⇥ p. A simple calculation then shows that setting ˜ X = X(I ⌃ 1 diag{s}) + gives the correlation structure specified in (2.1). Now that we understand the condition on s necessary for knockoff to exist, it remains to discuss which one we should construct, that is, to s of statistic from Section 1.2, we will have a useful methodology only if tend to be selected before their knockoffs as we would otherwise have true model. Then we wish to have Xj enter before ˜ Xj . To make this and the true signal to be small, so that ˜ Xj does not enter the Lasso m and ˜ Xj to be as orthogonal to each other as possible. In a setting wher all j, we would like to have ˜ X> j Xj = 1 sj as close to zero as possib knockoffs: • Equi-correlated knockoffs: Here, sj = 2 min(⌃) ^ 1 for all j, s value hXj, ˜ Xj i = 1 2 min(⌃ Among all knockoffs with this equi-variant property, this choice y selected variables, a false discovery being a selected a selection procedure returning a subset ˆ S ⇢ {1, . . . , p} 0 and j 2 ˆ S} 2 ˆ S} _ 1 # . (1.2) o zero in the case that zero features are selected, i.e. l say that a selection rule controls the FDR at level q if e coefficients . This definition asks to control the type ningful and operational. Imagine we have a procedure our procedure is known to control the FDR at the 10% he expected proportion of falsely selected variables, a false discovery being a selected rue model. Formally, the FDR of a selection procedure returning a subset ˆ S ⇢ {1, . . . , p} FDR = E " #{j : j = 0 and j 2 ˆ S} #{j : j 2 ˆ S} _ 1 # . (1.2) minator above, sets the fraction to zero in the case that zero features are selected, i.e. ion a _ b = max{a, b}.) We will say that a selection rule controls the FDR at level q if t most q no matter the value of the coefficients . This definition asks to control the type ected variables, and is both meaningful and operational. Imagine we have a procedure veries. Then roughly speaking, if our procedure is known to control the FDR at the 10% on of falsely selected variables, a false discovery being a selected the FDR of a selection procedure returning a subset ˆ S ⇢ {1, . . . , p} #{j : = 0 and j 2 ˆ S} # Informally, the FDR is the expected proportion of falsely selected variables, a variable not appearing in the true model. Formally, the FDR of a selection procedure r of variables is defined as FDR = E " #{j : j = 0 and j 2 ˆ S} #{j : j 2 ˆ S} _ 1 # . (The definition of the denominator above, sets the fraction to zero in the case tha ˆ S = ;; here we use the notation a _ b = max{a, b}.) We will say that a selection r its FDR is guaranteed to be at most q no matter the value of the coefficients . This I error averaged over the selected variables, and is both meaningful and operationa that has just made 100 discoveries. Then roughly speaking, if our procedure is know level, this means that we can expect at most 10 of these discoveries to be false and In other words, if the collected data were the outcome of a scientific experiment, of the variables selected by the knockoff procedure correspond to real effects that
  3. ิ଍ w ͜ͷΑ͏ͳεύʔεͳղΛಘΔख๏ͱͯ͠͸ɺ-BTTPճؼ΍௚ަ
 Ϛονϯά௥੻ͳͲ͕͋Δɻ
 
 w -BTTPͰ͸ɺЕΛେ͖͘͢ΔͱΑΓεύʔεʹͳΓɺখ͘͢͞Δͱ
 εύʔεʹͳΒͳ͍ ͩͱී௨ͷઢܗճؼʹͳΔ ɻຊݚڀ͸ɺ


    -BTTPͰݴ͑͹ '%3Λίϯτϩʔϧ͢ΔΑ͏ʹЕΛબͿͱ͍͏࿩ɻ
 w Е͸Ұൠʹ͸$7ͷਫ਼౓ͰબͿ͜ͱ͕ଟ͍ ͱࢥ͏ ɻ͔͠͠-BTTPΛ
 ࢖͏ཧ༝͸ਫ਼౓ͱ͍͏ΑΓಛ௃ྔબ୒͕໨తͳͷͰɺЕΛબͿج४͕
 ਫ਼౓ʹͳ͍ͬͯΔͷ͸ɺݴΘΕͯΈΔͱҰ؏ੑ͕ͳ͍ɻ choose feature jk = arg maxj |X> j (y Xb(k 1))|, add this feature to the set by setting and refit the coefficients via least squares, b(k) = arg minb2Rp ky Xbk 2 subject to bj = 0 for all j / 2 Sk. number of steps K, or some stopping criterion for choosing K adaptively, we then take our vector to be ˆ = ˆ OMP(K) := b(K). Alternately, we might choose a convex optimization Lasso [29], where ˆ Lasso( ) = arg minb2Rp 1 2 ky Xbk2 2 + kbk1 (4) this: can we estimate the number of false discoveries in the coefficient vector produced by o, i.e. in ˆ = ˆ OMP or ˆ = ˆ Lasso ? That is to say, can we guess the number of coordinates = 0? The knockoff filter provides answers to questions of this kind by constructing a knockoff variable, which essentially acts as a control. ariables a knockoff copy e Xj is constructed such that the augmented n⇥(2p) design matrix of features offs satisfies e X ⇤> ⇥ X e X ⇤ = " X>X X> e X e X>X e X> e X # =  ⌃ ⌃ diag{s} ⌃ diag{s} ⌃ , G , (5) 0 of course must be small enough that the resulting Gram matrix is positive semidefinite.2 e that the knockoffs exhibit the same correlation structure as the original variables, and that is also preserved in the sense that for all pairs of distinct variables j 6= k, we have X> j Xk = y has the important consequence that if j is null, i.e. j = 0, then
  4. खଓ͖ͦͷ̍ ,OPDLP⒎ͷߏங w ·ͣఱԼΓతʹखଓ͖Λ঺հ͢Δɻ͋Δσʔληοτɹʹରͯ͠ɺ
 ࣍ͷΑ͏ʹϊʔϚϥΠζΛߦ͍ɺάϥϜߦྻΛܭࢉ͢Δɻ
 
 w ͜ͷ࣌ɺɹʹྨࣅͨ͠,OPDLP⒎ಛ௃ྔͱͯ͠ɺ࣍ͷ৚݅Λຬͨ͢
 ΛఆΊΔɻ s͸ඇෛͷQ࣍ݩϕΫτϧͱ͢Δ

    
 
 w ͭ·Γɺ
 
 ͱͳΔΑ͏ʹ͢Δɻ ng procedure that is guaranteed to work under any fixed design X 2 lows a linear Gaussian model as in (1.1). An important feature of this dge of the noise level . Also, it does not assume any knowledge about n be arbitrary. We now outline the steps of this new method. re Xj in the model (i.e. the jth column of X), we construct a “knock- bles is to imitate the correlation structure of the original features, in a ol. first calculate the Gram matrix ⌃ = X>X of the original features,1 kXj k2 2 = 1 for all j. We will ensure that these knockoff features obey ⌃, X> ˜ X = ⌃ diag{s}, (1.3) . In words, ˜ X exhibits the same covariance structure as the original tween distinct original and knockoff variables are the same as those g{s} are equal on off-diagonal entries): ˜ Xk = X> j Xk for all j 6= k. ff ˜ Xj , we see that ire any knowledge of the noise level . Also, it does not assume any knowledge about odel, which can be arbitrary. We now outline the steps of this new method. For each feature Xj in the model (i.e. the jth column of X), we construct a “knock- knockoff variables is to imitate the correlation structure of the original features, in a for FDR control. knockoffs, we first calculate the Gram matrix ⌃ = X>X of the original features,1 ch that ⌃jj = kXj k2 2 = 1 for all j. We will ensure that these knockoff features obey ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, (1.3) negative vector. In words, ˜ X exhibits the same covariance structure as the original correlations between distinct original and knockoff variables are the same as those ⌃ and ⌃ diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. Xj to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, o ensure that our method has good statistical power to detect signals, we will see that s as large as possible so that a variable Xj is not too similar to its knockoff ˜ Xj . ˜ X is to choose s 2 Rp + satisfying diag{s} 2⌃, and construct the n ⇥ p matrix ˜ X ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC; (1.4) y follows a linear Gaussian model as in (1.1). An important feature of this owledge of the noise level . Also, it does not assume any knowledge about ch can be arbitrary. We now outline the steps of this new method. feature Xj in the model (i.e. the jth column of X), we construct a “knock- variables is to imitate the correlation structure of the original features, in a control. s, we first calculate the Gram matrix ⌃ = X>X of the original features,1 jj = kXj k2 2 = 1 for all j. We will ensure that these knockoff features obey ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, (1.3) vector. In words, ˜ X exhibits the same covariance structure as the original ns between distinct original and knockoff variables are the same as those diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. nockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, hat our method has good statistical power to detect signals, we will see that as possible so that a variable Xj is not too similar to its knockoff ˜ Xj . oose s 2 Rp + satisfying diag{s} 2⌃, and construct the n ⇥ p matrix ˜ X ish to find a multiple reason why we will at feature j has been ikely belongs to the y fixed design X 2 ortant feature of this ny knowledge about w method. construct a “knock- riginal features, in a e original features,1 ockoff features obey (1.3) procedure is that it does not require any knowledge of the noise level . the number of variables in the model, which can be arbitrary. We now o Step 1: Construct knockoffs. For each feature Xj in the model (i.e. off” feature ˜ Xj . The goal of the knockoff variables is to imitate the cor very specific way that will allow for FDR control. Specifically, to construct the knockoffs, we first calculate the Gram after normalizing each feature such that ⌃jj = kXj k2 2 = 1 for all j. We ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ d where s is a p-dimensional nonnegative vector. In words, ˜ X exhibits design X, but in addition, the correlations between distinct original a between the originals (because ⌃ and ⌃ diag{s} are equal on off-dia X> j ˜ Xk = X> j Xk for all j 6= However, comparing a feature Xj to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 s while X> j Xj = ˜ X> j ˜ Xj = 1. To ensure that our method has good stati we should choose the entries of s as large as possible so that a variable r ral FDR controlling procedure that is guaranteed to work under any fixed design X 2 the response y follows a linear Gaussian model as in (1.1). An important feature of this equire any knowledge of the noise level . Also, it does not assume any knowledge about e model, which can be arbitrary. We now outline the steps of this new method. s. For each feature Xj in the model (i.e. the jth column of X), we construct a “knock- the knockoff variables is to imitate the correlation structure of the original features, in a low for FDR control. the knockoffs, we first calculate the Gram matrix ⌃ = X>X of the original features,1 e such that ⌃jj = kXj k2 2 = 1 for all j. We will ensure that these knockoff features obey ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, (1.3) nonnegative vector. In words, ˜ X exhibits the same covariance structure as the original he correlations between distinct original and knockoff variables are the same as those se ⌃ and ⌃ diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. e Xj to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, ta provide evidence against Hj to mean that the jth variable likely belongs to the FDR controlling procedure that is guaranteed to work under any fixed design X 2 response y follows a linear Gaussian model as in (1.1). An important feature of this re any knowledge of the noise level . Also, it does not assume any knowledge about odel, which can be arbitrary. We now outline the steps of this new method. For each feature Xj in the model (i.e. the jth column of X), we construct a “knock- knockoff variables is to imitate the correlation structure of the original features, in a for FDR control. knockoffs, we first calculate the Gram matrix ⌃ = X>X of the original features,1 ch that ⌃jj = kXj k2 2 = 1 for all j. We will ensure that these knockoff features obey ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, (1.3) negative vector. In words, ˜ X exhibits the same covariance structure as the original orrelations between distinct original and knockoff variables are the same as those ⌃ and ⌃ diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. j to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, o ensure that our method has good statistical power to detect signals, we will see that ˜ 1.2 The knockoff filter This paper introduces a general FDR controlling procedure that is guaranteed to work under any fixe Rn⇥p, as long as n p and the response y follows a linear Gaussian model as in (1.1). An important procedure is that it does not require any knowledge of the noise level . Also, it does not assume any kn the number of variables in the model, which can be arbitrary. We now outline the steps of this new meth Step 1: Construct knockoffs. For each feature Xj in the model (i.e. the jth column of X), we const off” feature ˜ Xj . The goal of the knockoff variables is to imitate the correlation structure of the origina very specific way that will allow for FDR control. Specifically, to construct the knockoffs, we first calculate the Gram matrix ⌃ = X>X of the orig after normalizing each feature such that ⌃jj = kXj k2 2 = 1 for all j. We will ensure that these knockoff ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, where s is a p-dimensional nonnegative vector. In words, ˜ X exhibits the same covariance structure design X, but in addition, the correlations between distinct original and knockoff variables are the between the originals (because ⌃ and ⌃ diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. However, comparing a feature Xj to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, while X> j Xj = ˜ X> j ˜ Xj = 1. To ensure that our method has good statistical power to detect signals, w we should choose the entries of s as large as possible so that a variable Xj is not too similar to its knoc A strategy for constructing ˜ X is to choose s 2 Rp + satisfying diag{s} 2⌃, and construct the n of knockoff features as ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC;
  5. खଓ͖ͦͷ ౷ܭྔWj ͷܭࢉ w ɹɹɹɹɹɹʹରͯ͠ɺे෼ੑͱ൓ରশੑΛຬͨ͢౷ܭྔɹɹΛ༩͑ɺ
 ͦΕΛܭࢉ͢Δɻ w ͨͱ͑͹-BTTPճؼͰ͸ɺ
 
 


    ʹ͓͍ͯɺͷ୅ΘΓʹɹɹɹɹΛσʔλͱͯ͠ճؼΛߦ͏ɻ
 ͦͷࡍɺ
 
 ͱͯ͠ Z΁ͷد༩͕େ͖͍ಛ௃΄ͲɺZj ͷ஋͸େ͖͘ͳΔ ɺ
 
 
 Λܭࢉ͢Δͷ͕Ұྫɻ 2: Calculate statistics for each pair of original and knockoff variables. We now wish to introduce statis or each j 2 {1, . . . , p}, which will help us tease apart apart those variables that are in the model from those ot. These Wj ’s are constructed so that large positive values are evidence against the null hypothesis j = 0. n this instance, we consider the Lasso model [22], an `1 -norm penalized regression that promotes sparse estim e coefficients , given by ˆ( ) = arg minb ⇢ 1 2 ky Xbk2 2 + kbk1 . ( parse linear models, the Lasso is known to be asymptotically accurate for both variable selection and for co or signal estimation (see for example [3, 5, 23, 24]), and so even in a non-asymptotic setting, we will typic ˆ( ) including many signal variables and few null variables at some value of the penalty parameter . Taking the point on the Lasso path at which feature Xj first enters the model, Zj = sup { : ˆ j( ) 6= 0}, ( hen hope that Zj is large for most of the signals, and small for most of the null variables. However, to be antify this and choose an appropriate threshold for variable selection, we need to use the knockoff variable rate our threshold. With this in mind, we instead compute the statistics (1.6) on the augmented n ⇥ 2p de x ⇥ X ˜ X ⇤ (this is the columnwise concatenation of X and ˜ X), so that ⇥ X ˜ X ⇤ replaces X in (1.5). This yi dimensional vector (Z1, . . . , Zp, ˜ Z1, . . . , ˜ Zp). Finally, for each j 2 {1, . . . , p}, we set Wj = Zj _ ˜ Zj · ( +1, Zj > ˜ Zj, 1, Zj < ˜ Zj ( Step 2: Calculate sta Wj for each j 2 {1, are not. These Wj ’s a In this instance, w of the coefficients , g For sparse linear mod cient or signal estima see ˆ( ) including m to be the point on th we then hope that Zj to quantify this and c calibrate our threshold matrix ⇥ X ˜ X ⇤ (this i a 2p-dimensional vect cs for each pair of original and knockoff variables. We now wish to introduce statistics p}, which will help us tease apart apart those variables that are in the model from those that onstructed so that large positive values are evidence against the null hypothesis j = 0. nsider the Lasso model [22], an `1 -norm penalized regression that promotes sparse estimates n by ˆ( ) = arg minb ⇢ 1 2 ky Xbk2 2 + kbk1 . (1.5) the Lasso is known to be asymptotically accurate for both variable selection and for coeffi- (see for example [3, 5, 23, 24]), and so even in a non-asymptotic setting, we will typically signal variables and few null variables at some value of the penalty parameter . Taking Zj sso path at which feature Xj first enters the model, Zj = sup { : ˆ j( ) 6= 0}, (1.6) arge for most of the signals, and small for most of the null variables. However, to be able e an appropriate threshold for variable selection, we need to use the knockoff variables to With this in mind, we instead compute the statistics (1.6) on the augmented n ⇥ 2p design e columnwise concatenation of X and ˜ X), so that ⇥ X ˜ X ⇤ replaces X in (1.5). This yields Z1, . . . , Zp, ˜ Z1, . . . , ˜ Zp). Finally, for each j 2 {1, . . . , p}, we set Wj = Zj _ ˜ Zj · ( +1, Zj > ˜ Zj, ˜ (1.7) would expect that most eproduced in followup wish to find a multiple the reason why we will that feature j has been e likely belongs to the any fixed design X 2 mportant feature of this e any knowledge about new method. we construct a “knock- e original features, in a f the original features,1 knockoff features obey ˆ( ) = arg minb 1 2 ky Xbk2 2 For sparse linear models, the Lasso is known to be asymptotically acc cient or signal estimation (see for example [3, 5, 23, 24]), and so even see ˆ( ) including many signal variables and few null variables at som to be the point on the Lasso path at which feature Xj first enters the Zj = sup { : ˆ j( ) 6= 0 we then hope that Zj is large for most of the signals, and small for m to quantify this and choose an appropriate threshold for variable sele calibrate our threshold. With this in mind, we instead compute the s matrix ⇥ X ˜ X ⇤ (this is the columnwise concatenation of X and ˜ X), s a 2p-dimensional vector (Z1, . . . , Zp, ˜ Z1, . . . , ˜ Zp). Finally, for each j Wj = Zj _ ˜ Zj · ( +1, Zj 1, Zj (we can set Wj to zero in case of equality Zj = ˜ Zj ). A large positive the Lasso model early (at some large value of ) and that it does so b indication that this variable is a genuine signal and belongs in the mod constructing the Wj ’s; for instance, instead of recording the variable forward selection methods and record the order in which the variables and other alternatives). pair of original and knockoff variables. We now wish to introduce statistics will help us tease apart apart those variables that are in the model from those that o that large positive values are evidence against the null hypothesis j = 0. asso model [22], an `1 -norm penalized regression that promotes sparse estimates ( ) = arg minb ⇢ 1 2 ky Xbk2 2 + kbk1 . (1.5) s known to be asymptotically accurate for both variable selection and for coeffi- ample [3, 5, 23, 24]), and so even in a non-asymptotic setting, we will typically bles and few null variables at some value of the penalty parameter . Taking Zj which feature Xj first enters the model, Zj = sup { : ˆ j( ) 6= 0}, (1.6) st of the signals, and small for most of the null variables. However, to be able priate threshold for variable selection, we need to use the knockoff variables to mind, we instead compute the statistics (1.6) on the augmented n ⇥ 2p design se concatenation of X and ˜ X), so that ⇥ X ˜ X ⇤ replaces X in (1.5). This yields ˜ Z1, . . . , ˜ Zp). Finally, for each j 2 {1, . . . , p}, we set Wj = Zj _ ˜ Zj · ( +1, Zj > ˜ Zj, (1.7) each pair of original and knockoff variables. We now wish to introduce statistics hich will help us tease apart apart those variables that are in the model from those that ted so that large positive values are evidence against the null hypothesis j = 0. the Lasso model [22], an `1 -norm penalized regression that promotes sparse estimates ˆ( ) = arg minb ⇢ 1 2 ky Xbk2 2 + kbk1 . (1.5) sso is known to be asymptotically accurate for both variable selection and for coeffi- r example [3, 5, 23, 24]), and so even in a non-asymptotic setting, we will typically variables and few null variables at some value of the penalty parameter . Taking Zj th at which feature Xj first enters the model, Zj = sup { : ˆ j( ) 6= 0}, (1.6) r most of the signals, and small for most of the null variables. However, to be able ppropriate threshold for variable selection, we need to use the knockoff variables to s in mind, we instead compute the statistics (1.6) on the augmented n ⇥ 2p design mnwise concatenation of X and ˜ X), so that ⇥ X ˜ X ⇤ replaces X in (1.5). This yields , Zp, ˜ Z1, . . . , ˜ Zp). Finally, for each j 2 {1, . . . , p}, we set Wj = Zj _ ˜ Zj · ( +1, Zj > ˜ Zj, 1, Zj < ˜ Zj (1.7) f equality Zj = ˜ Zj ). A large positive value of Wj indicates that variable Xj enters large value of ) and that it does so before its knockoff copy ˜ Xj . Hence, this is an genuine signal and belongs in the model. We may also consider other alternatives for
  6. खଓ͖ͦͷ ᮢ஋ͷܾఆ w ᮢ஋5Λ࣍ͷΑ͏ʹܾఆ͢Δɻ R͸ઃఆ͍ͨ͠'%3 
 w ͜ͷ࣌ɺɹɹɹɹɹɹͱͳΔΑ͏ʹಛ௃ྔΛબͿͱɺ
 
 


    Λຬͨ͢ɻ͜ΕΛ,OPDLP⒎ϞσϧͱݺͿɻ ͜Ε͸'%3ͷۙࣅ  w ·ͨɺɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹͱ͢Δͱɺ
 
 ɹɹɹɹɹɹɹ
 
 Λຬͨ͢ɻ͜ΕΛ,OPDLP⒎ ϞσϧͱݺͿɻ ͜Ε͸'%3 ection 2, we discuss a broader methodology, where the statistics Wj may be defined in any manner that the sufficiency property and the antisymmetry property, which we will define later on; the construction above ific instance that we find to perform well empirically. Calculate a data-dependent threshold for the statistics. We wish to select variables such that Wj is large tive, i.e. such that Wj t for some t > 0. Letting q be the target FDR, define a data-dependent threshold T T = min ⇢ t 2 W : #{j : Wj  t} #{j : Wj t} _ 1  q (or T = +1 if this set is empty) , (1.8) W = {|Wj | : j = 1, . . . , p}\{0} is the set of unique nonzero3 values attained by the |Wj |’s. We shall see that ion appearing above, is an estimate of the proportion of false discoveries if we were to select all features j’s t. For this reason, we will often refer to this fraction as the “knockoff estimate of FDP”. a visual representation of this step, see Figure 1, where we plot the point (Zj, ˜ Zj) for each feature j, with black oting null features and red squares denoting true signals. Recall that Wj is positive if the original variable is before its knockoff (i.e. Zj > ˜ Zj ), and is negative otherwise (i.e. Zj < ˜ Zj ). Therefore a feature j whose s below the dashed diagonal line in Figure 1 then has a positive value of Wj , while points above the diagonal gned negative Wj ’s. For a given value of t, the numerator and denominator of the fraction appearing in (1.8) re given by the numbers of points in the two gray shaded regions of the figure (with nulls and non-nulls both , since in practice we do not know which features are null). h these steps in place, we are ready to define our procedure: j = 0 for some feature Xj, then this gives no evidence for rejecting the hypothesis j = 0, and so our method will never select such 3 ion of the knockoff procedure plotting pairs (Zj, ˜ Zj). Black dots correspond to null hy- le red squares are nonnulls ( j 6= 0). Setting t = 1.5, the number of points in the shaded onal is equal to #{j : Wj t}, the number of selected variables at this threshold, while n the shaded region above the diagonal is equal to #{j : Wj  t}. Observe that the true e primarily below the diagonal, indicating Wj > 0, while the null features (black dots) are distributed across the diagonal. onstruct ˜ X as in (1.4), and calculate statistics Wj satisfying the sufficiency and antisym- Section 2; (1.7) above gives an example of a statistic satisfying these properties). Then ˆ S = {j : Wj T}, ent threshold (1.8). s that this procedure controls a quantity nearly equal to the FDR: , 1], the knockoff method satisfies E " #{j : j = 0 and j 2 ˆ S} #{j : j 2 ˆ S} + q 1 #  q, en over the Gaussian noise z in the model (1.1), while treating X and ˜ X as fixed. unded by this theorem is very close to the FDR in settings where a large number of features n the denominator then has little effect), but it sometimes may be preferable to control the opose a slightly more conservative procedure: Select a model as in Definition 1 but with a data-dependent threshold T defined as 2 W : 1 + #{j : Wj  t} #{j : Wj t} _ 1  q (or T = +1 if this set is empty) , (1.9) rimarily below the diagonal, indicating Wj > 0, while the null features (black dots) are tributed across the diagonal. struct ˜ X as in (1.4), and calculate statistics Wj satisfying the sufficiency and antisym- ction 2; (1.7) above gives an example of a statistic satisfying these properties). Then ˆ S = {j : Wj T}, threshold (1.8). hat this procedure controls a quantity nearly equal to the FDR: , the knockoff method satisfies E " #{j : j = 0 and j 2 ˆ S} #{j : j 2 ˆ S} + q 1 #  q, over the Gaussian noise z in the model (1.1), while treating X and ˜ X as fixed. ded by this theorem is very close to the FDR in settings where a large number of features he denominator then has little effect), but it sometimes may be preferable to control the ose a slightly more conservative procedure: ect a model as in Definition 1 but with a data-dependent threshold T defined as W : 1 + #{j : Wj  t} #{j : Wj t} _ 1  q (or T = +1 if this set is empty) , (1.9) en by knockoff+ is always higher (or equal to) than that chosen in (1.8) by the knockoff s (slightly) more conservative. ws that knockoff+ controls the FDR. metry properties (defined in Section 2; (1.7) above gives an example of a statistic satisfying these prope select the model ˆ S = {j : Wj T}, where T is the data-dependent threshold (1.8). A main result of this paper is that this procedure controls a quantity nearly equal to the FDR: Theorem 1. For any q 2 [0, 1], the knockoff method satisfies E " #{j : j = 0 and j 2 ˆ S} #{j : j 2 ˆ S} + q 1 #  q, where the expectation is taken over the Gaussian noise z in the model (1.1), while treating X and ˜ X as fi The “modified FDR” bounded by this theorem is very close to the FDR in settings where a large numbe are selected (as adding q 1 in the denominator then has little effect), but it sometimes may be preferable to FDR exactly. For this, we propose a slightly more conservative procedure: Definition 2 (Knockoff+). Select a model as in Definition 1 but with a data-dependent threshold T defin T = min ⇢ t 2 W : 1 + #{j : Wj  t} #{j : Wj t} _ 1  q (or T = +1 if this set is empty) , Note that the threshold T chosen by knockoff+ is always higher (or equal to) than that chosen in (1.8) by t filter, meaning that knockoff+ is (slightly) more conservative. Our second main result shows that knockoff+ controls the FDR. Theorem 2. For any q 2 [0, 1], the knockoff+ method satisfies FDR = E " #{j : j = 0 and j 2 ˆ S} #{j : j 2 ˆ S} _ 1 #  q, ent threshold (1.8). is that this procedure controls a quantity nearly equal to the FDR: 0, 1], the knockoff method satisfies E " #{j : j = 0 and j 2 ˆ S} #{j : j 2 ˆ S} + q 1 #  q, ken over the Gaussian noise z in the model (1.1), while treating X and ˜ X as fixed. unded by this theorem is very close to the FDR in settings where a large number of features in the denominator then has little effect), but it sometimes may be preferable to control the ropose a slightly more conservative procedure: Select a model as in Definition 1 but with a data-dependent threshold T defined as t 2 W : 1 + #{j : Wj  t} #{j : Wj t} _ 1  q (or T = +1 if this set is empty) , (1.9) hosen by knockoff+ is always higher (or equal to) than that chosen in (1.8) by the knockoff f+ is (slightly) more conservative. shows that knockoff+ controls the FDR. 0, 1], the knockoff+ method satisfies FDR = E " #{j : j = 0 and j 2 ˆ S} #{j : j 2 ˆ S} _ 1 #  q, ken over the Gaussian noise z in the model (1.1), while treating X and ˜ X as fixed. 4
  7. Ͳ͏͍͏͜ͱͳͷ͔ w ͜ͷख๏Ͱ͸ɺద੾ʹߏங͞Εͨɹͱ8ʹରͯ͠ɺ࣍ͷ&YDIBOHFBCJMJUZ
 ͕੒Γཱͭ͜ͱΛར༻͍ͯ͠Δɻ
 w ਅͷϞσϧͰβj ͳΒ͹ɺWj ͕UҎ্ʹͳΔ֬཰ͱɺUҎԼʹͳΔ֬཰͸ ౳͍͠ɻͭ·Γɺ͋Δಛ௃ྔ͕ɺରԠ͢ΔLOPDLP⒎ͷಛ௃ྔʹൺ΂ͯ
 ઌʹ-BTTPϞσϧʹೖΔ͔ޙʹೖΔ͔͸ɺϥϯμϜʹ൒ʑͰ͋Δɻ


    
 
 
 w ٯʹݴ͑͹ɺఱԼΓʹ༩͑ͨɹͱ8͸ɺ͜ͷ&YDIBOHFBCJMJUZΛຬͨ͢
 Α͏ʹ৚݅෇͚͞Ε͍ͯΔɻ We have explained why a large positive value of Wj bears some evidence against the null hypothesis j = w give a brief intuition for how our specific choice of threshold allows control of FDR (or of the modified F way in which W is constructed implies that the signs of the Wj ’s are i.i.d. random for the “null hypotheses or those j’s such that j = 0. Therefore, for any threshold t, #{j : j = 0 and Wj t} d = #{j : j = 0 and Wj  t}, ere d = means equality in distribution. In Figure 1, for instance, #{j : j = 0 and Wj t} is the number o nts (black dots) in the shaded region below the diagonal, while #{j : j = 0 and Wj  t} is the number o nts in the shaded region above the diagonal. Note that the null points are distributed approximately symmetr oss the diagonal, as described by (1.10). Hence, we can estimate the false discovery proportion (FDP) at the threshold t as #{j : j = 0 and Wj t} #{j : Wj t} _ 1 ⇡ #{j : j = 0 and Wj  t} #{j : Wj t} _ 1  #{j : Wj  t} #{j : Wj t} _ 1 =: d FDP(t), ere d FDR(t) is the knockoff estimate of FDP. The knockoff procedure can be interpreted as finding a thresho = min{t 2 W : d FDP(t)  q}, with the convention that T = +1 if no such t exists; this is the most l shold with the property that the estimated FDP is under control. In fact, the inequality in (1.11) will usua t because most strong signals will be selected before their knockoff copies (in Figure 1, we see that most of t ares lie below the diagonal, i.e. Wj 0); this means that our estimate of FDP will probably be fairly tight u signal strength is weak. (We will see later that the additional “+1” appearing in the knockoff+ method, yie ightly more conservative procedure, is necessary both theoretically and empirically to control FDR in sce ere extremely few discoveries are made.) We have explained why a large positive value of Wj bears some evidence against the null hypothesis j = now give a brief intuition for how our specific choice of threshold allows control of FDR (or of the modified F The way in which W is constructed implies that the signs of the Wj ’s are i.i.d. random for the “null hypotheses is, for those j’s such that j = 0. Therefore, for any threshold t, #{j : j = 0 and Wj t} d = #{j : j = 0 and Wj  t}, where d = means equality in distribution. In Figure 1, for instance, #{j : j = 0 and Wj t} is the number o points (black dots) in the shaded region below the diagonal, while #{j : j = 0 and Wj  t} is the number o points in the shaded region above the diagonal. Note that the null points are distributed approximately symmetr across the diagonal, as described by (1.10). Hence, we can estimate the false discovery proportion (FDP) at the threshold t as #{j : j = 0 and Wj t} #{j : Wj t} _ 1 ⇡ #{j : j = 0 and Wj  t} #{j : Wj t} _ 1  #{j : Wj  t} #{j : Wj t} _ 1 =: d FDP(t), where d FDR(t) is the knockoff estimate of FDP. The knockoff procedure can be interpreted as finding a thresho T = min{t 2 W : d FDP(t)  q}, with the convention that T = +1 if no such t exists; this is the most l threshold with the property that the estimated FDP is under control. In fact, the inequality in (1.11) will usua tight because most strong signals will be selected before their knockoff copies (in Figure 1, we see that most of t squares lie below the diagonal, i.e. Wj 0); this means that our estimate of FDP will probably be fairly tight the signal strength is weak. (We will see later that the additional “+1” appearing in the knockoff+ method, yi hat will allow for FDR control. construct the knockoffs, we first calculate the Gram matrix ⌃ = X>X of the origin each feature such that ⌃jj = kXj k2 2 = 1 for all j. We will ensure that these knockoff f ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, mensional nonnegative vector. In words, ˜ X exhibits the same covariance structure as addition, the correlations between distinct original and knockoff variables are the sa als (because ⌃ and ⌃ diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. ng a feature Xj to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, ˜ X> j ˜ Xj = 1. To ensure that our method has good statistical power to detect signals, we the entries of s as large as possible so that a variable Xj is not too similar to its knock constructing ˜ X is to choose s 2 Rp + satisfying diag{s} 2⌃, and construct the n ⇥ es as ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC; p orthonormal matrix that is orthogonal2 to the span of the features X, and C>C = {s} is a Cholesky decomposition (whose existence is guaranteed by the condition dia or details). able to reject individual hypotheses while controlling the FDR. This is the reason why gy from this literature, and may say that Hj has been rejected to mean that feature j h hat the data provide evidence against Hj to mean that the jth variable likely belong f filter a general FDR controlling procedure that is guaranteed to work under any fixed desig p and the response y follows a linear Gaussian model as in (1.1). An important featur not require any knowledge of the noise level . Also, it does not assume any knowled in the model, which can be arbitrary. We now outline the steps of this new method. ckoffs. For each feature Xj in the model (i.e. the jth column of X), we construct a oal of the knockoff variables is to imitate the correlation structure of the original featu will allow for FDR control. struct the knockoffs, we first calculate the Gram matrix ⌃ = X>X of the original fe feature such that ⌃jj = kXj k2 2 = 1 for all j. We will ensure that these knockoff featu ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, onal nonnegative vector. In words, ˜ X exhibits the same covariance structure as the ion, the correlations between distinct original and knockoff variables are the same
  8. 8ͷ৚݅ े෼ੑͱ൓ରশੑ w 8͕ɹɹɹɹͷάϥϜߦྻ͓ΑͼZͱͷ಺ੵ͚͔ͩΒܭࢉͰ͖ΔͳΒɺ
 8͸े෼ੑΛຬͨ͢ͱఆٛ͢Δɻͭ·Γɺ
 
 w ͋Δಛ௃ྔͱɺͦΕʹରԠ͢ΔLOPDLP⒎ͷಛ௃ྔΛೖΕସ͑ͨ࣌ɺ
 8Ͱ͸ਖ਼ෛ͕൓స͢Δ৔߹ɺ8͕൓ରশੑΛ࣋ͭͱ͍͏ɻͭ·Γɺ bility.

    ic statistics a statistic W ⇣⇥ X ˜ X ⇤ , y ⌘ 2 Rp with large positive values of Wj giving evidence that j 6= 0, and ple properties. W is said to obey the sufficiency property if W depends only on the Gram matrix and on feature- ner products; that is, we can write W = f ⇣⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ , ⇥ X ˜ X ⇤> y ⌘ : S + 2p ⇥ R2p ! Rp, where S + 2p is the cone of 2p ⇥ 2p positive semidefinite matrices. We allow call this the sufficiency property since under Gaussian noise X>y is a sufficient statistic for . W is said to obey the antisymmetry property if swapping Xj and ˜ Xj has the effect of switching Wj —that is, for any S ✓ {1, . . . , p}, Wj ⇣⇥ X ˜ X ⇤ swap(S), y ⌘ = Wj ⇣⇥ X ˜ X ⇤ , y ⌘ · ( +1, j 62 S, 1, j 2 S. ite ⇥ X ˜ X ⇤ swap(S) to mean that the columns Xj and ˜ Xj have been swapped in the matrix ⇥ X ˜ X ⇤ , S. Formally, if V 2 Rn⇥2p with columns Vj , then for each j = 1, . . . , p, (Vswap(S) )j = ( Vj, j / 2 S, Vj+p, j 2 S, (Vswap(S) )j+p = ( Vj+p, j 62 S, Vj, j 2 S. cs ⇣⇥ X ˜ X ⇤ , y ⌘ 2 Rp with large positive values of Wj giving evidence that j 6= 0, and s. o obey the sufficiency property if W depends only on the Gram matrix and on feature- that is, we can write W = f ⇣⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ , ⇥ X ˜ X ⇤> y ⌘ p ! Rp, where S + 2p is the cone of 2p ⇥ 2p positive semidefinite matrices. We allow sufficiency property since under Gaussian noise X>y is a sufficient statistic for . o obey the antisymmetry property if swapping Xj and ˜ Xj has the effect of switching for any S ✓ {1, . . . , p}, Wj ⇣⇥ X ˜ X ⇤ swap(S), y ⌘ = Wj ⇣⇥ X ˜ X ⇤ , y ⌘ · ( +1, j 62 S, 1, j 2 S. wap(S) to mean that the columns Xj and ˜ Xj have been swapped in the matrix ⇥ X ˜ X ⇤ , y, if V 2 Rn⇥2p with columns Vj , then for each j = 1, . . . , p, wap(S) )j = ( Vj, j / 2 S, (Vswap(S) )j+p = ( Vj+p, j 62 S, ic statistics statistic W ⇣⇥ X ˜ X ⇤ , y ⌘ 2 Rp with large positive values of Wj giving evidence that j 6= 0, and le properties. W is said to obey the sufficiency property if W depends only on the Gram matrix and on feature- er products; that is, we can write W = f ⇣⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ , ⇥ X ˜ X ⇤> y ⌘ S + 2p ⇥ R2p ! Rp, where S + 2p is the cone of 2p ⇥ 2p positive semidefinite matrices. We allow call this the sufficiency property since under Gaussian noise X>y is a sufficient statistic for . W is said to obey the antisymmetry property if swapping Xj and ˜ Xj has the effect of switching Wj —that is, for any S ✓ {1, . . . , p}, Wj ⇣⇥ X ˜ X ⇤ swap(S), y ⌘ = Wj ⇣⇥ X ˜ X ⇤ , y ⌘ · ( +1, j 62 S, 1, j 2 S. te ⇥ X ˜ X ⇤ swap(S) to mean that the columns Xj and ˜ Xj have been swapped in the matrix ⇥ X ˜ X ⇤ , S. Formally, if V 2 Rn⇥2p with columns Vj , then for each j = 1, . . . , p, (Vswap(S) )j = ( Vj, j / 2 S, Vj+p, j 2 S, (Vswap(S) )j+p = ( Vj+p, j 62 S, Vj, j 2 S. metric statistics ider a statistic W ⇣⇥ X ˜ X ⇤ , y ⌘ 2 Rp with large positive values of Wj giving evidence that j 6= 0, and simple properties. tistic W is said to obey the sufficiency property if W depends only on the Gram matrix and on feature- e inner products; that is, we can write W = f ⇣⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ , ⇥ X ˜ X ⇤> y ⌘ me f : S + 2p ⇥ R2p ! Rp, where S + 2p is the cone of 2p ⇥ 2p positive semidefinite matrices. We allow es to call this the sufficiency property since under Gaussian noise X>y is a sufficient statistic for . tistic W is said to obey the antisymmetry property if swapping Xj and ˜ Xj has the effect of switching n of Wj —that is, for any S ✓ {1, . . . , p}, Wj ⇣⇥ X ˜ X ⇤ swap(S), y ⌘ = Wj ⇣⇥ X ˜ X ⇤ , y ⌘ · ( +1, j 62 S, 1, j 2 S. we write ⇥ X ˜ X ⇤ swap(S) to mean that the columns Xj and ˜ Xj have been swapped in the matrix ⇥ X ˜ X ⇤ , h j 2 S. Formally, if V 2 Rn⇥2p with columns Vj , then for each j = 1, . . . , p, (Vswap(S) )j = ( Vj, j / 2 S, Vj+p, j 2 S, (Vswap(S) )j+p = ( Vj+p, j 62 S, Vj, j 2 S. 7 odel identifiability. Symmetric statistics next consider a statistic W ⇣⇥ X ˜ X ⇤ , y ⌘ 2 Rp with large positive values of Wj giving evidence oduce two simple properties. • The statistic W is said to obey the sufficiency property if W depends only on the Gram matrix response inner products; that is, we can write W = f ⇣⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ , ⇥ X ˜ X ⇤> y ⌘ for some f : S + 2p ⇥ R2p ! Rp, where S + 2p is the cone of 2p ⇥ 2p positive semidefinite mat ourselves to call this the sufficiency property since under Gaussian noise X>y is a sufficient s • The statistic W is said to obey the antisymmetry property if swapping Xj and ˜ Xj has the eff the sign of Wj —that is, for any S ✓ {1, . . . , p}, Wj ⇣⇥ X ˜ X ⇤ swap(S), y ⌘ = Wj ⇣⇥ X ˜ X ⇤ , y ⌘ · ( +1, j 62 S, 1, j 2 S. Here, we write ⇥ X ˜ X ⇤ swap(S) to mean that the columns Xj and ˜ Xj have been swapped in the for each j 2 S. Formally, if V 2 Rn⇥2p with columns Vj , then for each j = 1, . . . , p, ( (
  9. 8ͷྫ w े෼ੑͱ൓ରশੑΛຬ͍ͨͯ͠Ε͹ɺ8͸༷ʑʹઃܭͰ͖Δɻ
 -BTTPΛ࢖͏ඞཁੑ΋ͳ͍ 
 
 
 
 
 


    
 
 ͳͲ The statistic W we examined in Section 1.2, given in equation (1.7), obeys these t sufficiency property holds is that the Lasso (1.5) is equivalent to minimize 1 2 b>X>Xb b>X>y + kbk1, and thus depends upon the problem data (X, y) through X>X and X>y only.4 N in (1.7) is explicit. This is only one example of a statistic of this type—other examp 1. Wj = X> j y ˜ X> j y. Under the Gaussian model for y, one can show that W which means that the p statistics are independent. Rescaling things so that distribution can be obtained by simply taking W = ⌃ 1X>y + N(0, 2( terms in the sum are independent. For a large signal | j |, however, Wj may on the sign of j . 2. Wj = |X> j y| | ˜ X> j y|, which resolves the issue of sign above. 3. Wj = |ˆLS j | |ˆLS j+p | or Wj = |ˆLS j |2 |ˆLS j+p |2, where ˆLS is the least-squa y on the augmented design, ˆLS = ⇣⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤⌘ 1 ⇥ X ˜ X ⇤> y. 4. Define Zj as in Section 1.2, Zj = sup{ : ˆ j( ) 6= 0} for j = 1, . . . , 2p augmented Lasso model regressing y on ⇥ X ˜ X ⇤ . We may then take Wj = ( may also consider other options such as Wj = Zj Zj+p , or alternately we c minimize 2 b X Xb b X y + kbk1, and thus depends upon the problem data (X, y) through X>X and X>y only.4 Note t in (1.7) is explicit. This is only one example of a statistic of this type—other examples 1. Wj = X> j y ˜ X> j y. Under the Gaussian model for y, one can show that W ⇠ which means that the p statistics are independent. Rescaling things so that W distribution can be obtained by simply taking W = ⌃ 1X>y + N(0, 2(2 di terms in the sum are independent. For a large signal | j |, however, Wj may be p on the sign of j . 2. Wj = |X> j y| | ˜ X> j y|, which resolves the issue of sign above. 3. Wj = |ˆLS j | |ˆLS j+p | or Wj = |ˆLS j |2 |ˆLS j+p |2, where ˆLS is the least-squares s y on the augmented design, ˆLS = ⇣⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤⌘ 1 ⇥ X ˜ X ⇤> y. 4. Define Zj as in Section 1.2, Zj = sup{ : ˆ j( ) 6= 0} for j = 1, . . . , 2p whe augmented Lasso model regressing y on ⇥ X ˜ X ⇤ . We may then take Wj = (Zj _ may also consider other options such as Wj = Zj Zj+p , or alternately we can t for some fixed value of . (For consistency with the rest of this section, the not than in Section 1.2, with Zj+p instead of ˜ Zj giving the value when ˜ Xj entered 5. The example above of course extends to all penalized likelihood estimation proce minimize 2 b X Xb b X y + kb and thus depends upon the problem data (X, y) through X>X and X>y only. in (1.7) is explicit. This is only one example of a statistic of this type—other ex 1. Wj = X> j y ˜ X> j y. Under the Gaussian model for y, one can show th which means that the p statistics are independent. Rescaling things so t distribution can be obtained by simply taking W = ⌃ 1X>y + N(0, terms in the sum are independent. For a large signal | j |, however, Wj m on the sign of j . 2. Wj = |X> j y| | ˜ X> j y|, which resolves the issue of sign above. 3. Wj = |ˆLS j | |ˆLS j+p | or Wj = |ˆLS j |2 |ˆLS j+p |2, where ˆLS is the least-s y on the augmented design, ˆLS = ⇣⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤⌘ 1 ⇥ X ˜ X ⇤> y. 4. Define Zj as in Section 1.2, Zj = sup{ : ˆ j( ) 6= 0} for j = 1, . . . augmented Lasso model regressing y on ⇥ X ˜ X ⇤ . We may then take Wj may also consider other options such as Wj = Zj Zj+p , or alternately w for some fixed value of . (For consistency with the rest of this section than in Section 1.2, with Zj+p instead of ˜ Zj giving the value when ˜ Xj
  10. &YDIBOHFBCJMJUZূ໌ͷΞ΢τϥΠϯ w ·ͣ࣍ͷࣜΛࣔ͢ɻ
 
 w ͞Βʹ࣍ͷࣜΛࣔ͢ɻ
 
 w ͜ΕΒ͕ͭࣔ͞ΕͨԼͰɺ&YDIBOHFBCJMJUZΛࣔ͢͜ͱ͕Ͱ͖Δ
 


    
 
 
 ͜ͷূ໌ʹ͓͍ͯɺɹͱ8ͷཁ݅Λඞཁͱ͢Δɻ justifies our earlier statement (1.10) that #{j : j = 0, Wj  t} is distributed as #{j : deed, conditional on |W | = (|W1 |, . . . , |Wp |), both these random variables follow the same which implies that their marginal distributions are identical. In turn, this gives that d FDP(t) from ate of the true false discovery proportion FDP(t). erty for the nulls is a consequence of two exchangeability properties for X and ˜ X. xchangeability for the features). For any subset S ⇢ {1, . . . , p}, ⇥ X ˜ X ⇤> swap(S) ⇥ X ˜ X ⇤ swap(S) = ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ . ix of ⇥ X ˜ X ⇤ is unchanged when we swap Xj and ˜ Xj for each j 2 S. vially from the definition of G = ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ in (2.1). xchangeability for the response). For any subset S of nulls, ⇥ X ˜ X ⇤> swap(S) y d = ⇥ X ˜ X ⇤> y. n of the product ⇥ X ˜ X ⇤> y is unchanged when we swap Xj and ˜ Xj for each j 2 S, as long d features appear in the true model. X , 2I), for any S 0, we have ˜ X ⇤> swap(S0) y ⇠ N ⇣⇥ X ˜ X ⇤> swap(S0) X , 2 ⇥ X ˜ X ⇤> swap(S0) ⇥ X ˜ X ⇤ swap(S0) ⌘ . mean and variance calculated here are the same for S 0 = S and for S 0 = ;. Lemma 2 proves equal. For the means, since X> j Xi = ˜ X> j Xi for all i 6= j, and support( ) \ S = ;, we see for all j 2 S, which is sufficient. r any set S ⇢ {1, . . . , p}, let Wswap(S) be the statistic we would get if we had replaced ⇥ X ˜ X ⇤ hen calculating W . The anti-symmetry property gives rue false discovery proportion FDP(t). e nulls is a consequence of two exchangeability properties for X and ˜ X. bility for the features). For any subset S ⇢ {1, . . . , p}, X ˜ X ⇤> swap(S) ⇥ X ˜ X ⇤ swap(S) = ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ . ˜ X ⇤ is unchanged when we swap Xj and ˜ Xj for each j 2 S. m the definition of G = ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ in (2.1). bility for the response). For any subset S of nulls, ⇥ X ˜ X ⇤> swap(S) y d = ⇥ X ˜ X ⇤> y. oduct ⇥ X ˜ X ⇤> y is unchanged when we swap Xj and ˜ Xj for each j 2 S, as long appear in the true model. for any S 0, we have y ⇠ N ⇣⇥ X ˜ X ⇤> swap(S0) X , 2 ⇥ X ˜ X ⇤> swap(S0) ⇥ X ˜ X ⇤ swap(S0) ⌘ . variance calculated here are the same for S 0 = S and for S 0 = ;. Lemma 2 proves the means, since X> j Xi = ˜ X> j Xi for all i 6= j, and support( ) \ S = ;, we see 2 S, which is sufficient. ⇢ {1, . . . , p}, let Wswap(S) be the statistic we would get if we had replaced ⇥ X ˜ X ⇤ ating W . The anti-symmetry property gives This property fully justifies our earlier statement (1.10) that #{j : j = 0, Wj  t} is dist j = 0, Wj t}. Indeed, conditional on |W | = (|W1 |, . . . , |Wp |), both these random variables inomial distribution, which implies that their marginal distributions are identical. In turn, this gives Section 1.2 is an estimate of the true false discovery proportion FDP(t). The i.i.d. sign property for the nulls is a consequence of two exchangeability properties for X an Lemma 2 (Pairwise exchangeability for the features). For any subset S ⇢ {1, . . . , p}, ⇥ X ˜ X ⇤> swap(S) ⇥ X ˜ X ⇤ swap(S) = ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ . That is, the Gram matrix of ⇥ X ˜ X ⇤ is unchanged when we swap Xj and ˜ Xj for each j 2 S. Proof. This follows trivially from the definition of G = ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ in (2.1). Lemma 3 (Pairwise exchangeability for the response). For any subset S of nulls, ⇥ X ˜ X ⇤> swap(S) y d = ⇥ X ˜ X ⇤> y. That is, the distribution of the product ⇥ X ˜ X ⇤> y is unchanged when we swap Xj and ˜ Xj for eac s none of the swapped features appear in the true model. Proof. Since y ⇠ N(X , 2I), for any S 0, we have ⇥ X ˜ X ⇤> swap(S0) y ⇠ N ⇣⇥ X ˜ X ⇤> swap(S0) X , 2 ⇥ X ˜ X ⇤> swap(S0) ⇥ X ˜ X ⇤ swap(S0) ⌘ tion 1.2 is an estimate of the true false discovery proportion FDP(t). The i.i.d. sign property for the nulls is a consequence of two exchangeability properties for X and X mma 2 (Pairwise exchangeability for the features). For any subset S ⇢ {1, . . . , p}, ⇥ X ˜ X ⇤> swap(S) ⇥ X ˜ X ⇤ swap(S) = ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ . t is, the Gram matrix of ⇥ X ˜ X ⇤ is unchanged when we swap Xj and ˜ Xj for each j 2 S. of. This follows trivially from the definition of G = ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ in (2.1). mma 3 (Pairwise exchangeability for the response). For any subset S of nulls, ⇥ X ˜ X ⇤> swap(S) y d = ⇥ X ˜ X ⇤> y. t is, the distribution of the product ⇥ X ˜ X ⇤> y is unchanged when we swap Xj and ˜ Xj for each j none of the swapped features appear in the true model. of. Since y ⇠ N(X , 2I), for any S 0, we have ⇥ X ˜ X ⇤> swap(S0) y ⇠ N ⇣⇥ X ˜ X ⇤> swap(S0) X , 2 ⇥ X ˜ X ⇤> swap(S0) ⇥ X ˜ X ⇤ swap(S0) ⌘ . xt we check that the mean and variance calculated here are the same for S 0 = S and for S 0 = ;. Le the variances are equal. For the means, since X> j Xi = ˜ X> j Xi for all i 6= j, and support( ) \ X>X = ˜ X>X for all j 2 S, which is sufficient. That is, the distribution of the product ⇥ X ˜ X ⇤> y is unchanged when we swap Xj and ˜ Xj for each j 2 S, as long as none of the swapped features appear in the true model. Proof. Since y ⇠ N(X , 2I), for any S 0, we have ⇥ X ˜ X ⇤> swap(S0) y ⇠ N ⇣⇥ X ˜ X ⇤> swap(S0) X , 2 ⇥ X ˜ X ⇤> swap(S0) ⇥ X ˜ X ⇤ swap(S0) ⌘ . Next we check that the mean and variance calculated here are the same for S 0 = S and for S 0 = ;. Lemma 2 proves that the variances are equal. For the means, since X> j Xi = ˜ X> j Xi for all i 6= j, and support( ) \ S = ;, we see that X> j X = ˜ X> j X for all j 2 S, which is sufficient. Proof of Lemma 1. For any set S ⇢ {1, . . . , p}, let Wswap(S) be the statistic we would get if we had replaced ⇥ X ˜ X ⇤ with ⇥ X ˜ X ⇤ swap(S) when calculating W . The anti-symmetry property gives Wswap(S) = (W1 · ✏1, . . . , Wp · ✏p), ✏j = ( +1, j / 2 S, 1, j 2 S. Now let ✏ be as in the statement of the lemma and let S = {j : ✏j = 1}. Since S contains only nulls, Lemmas 2 and 3 give Wswap(S) = f ⇣⇥ X ˜ X ⇤> swap(S) (X ˜ X)swap(S), ⇥ X ˜ X ⇤> swap(S) y ⌘ d = f ⇣⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ , ⇥ X ˜ X ⇤> y ⌘ = W . This proves the claim. 2.4 Proof sketch for main results With the exchangeability property of the Wj ’s in place, we now turn to the proof of our main results, Theorems 1 and 2, which establish FDR control for the knockoff filter (specifically, approximate FDR control for the knockoff method, and exact FDR control for knockoff+). In this section, we sketch the main ideas behind the proof, restricting our attention to the knockoff+ method for simplicity. The full details will be presented later, in Sections 5 and 6, where we ±1}p be a sign sequence independent of W , with ✏j = +1 for all p) d = (W1 · ✏1, . . . , Wp · ✏p). ent (1.10) that #{j : j = 0, Wj  t} is distributed as #{j : = (|W1 |, . . . , |Wp |), both these random variables follow the same ginal distributions are identical. In turn, this gives that d FDP(t) from y proportion FDP(t). uence of two exchangeability properties for X and ˜ X. ures). For any subset S ⇢ {1, . . . , p}, ˜ X ⇤ swap(S) = ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ . when we swap Xj and ˜ Xj for each j 2 S. G = ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ in (2.1). onse). For any subset S of nulls, > wap(S) y d = ⇥ X ˜ X ⇤> y. is unchanged when we swap Xj and ˜ Xj for each j 2 S, as long s testing, we are interested in the p hypotheses Hj : j = 0 and wish to find a multiple ject individual hypotheses while controlling the FDR. This is the reason why we will his literature, and may say that Hj has been rejected to mean that feature j has been ata provide evidence against Hj to mean that the jth variable likely belongs to the FDR controlling procedure that is guaranteed to work under any fixed design X 2 response y follows a linear Gaussian model as in (1.1). An important feature of this ire any knowledge of the noise level . Also, it does not assume any knowledge about odel, which can be arbitrary. We now outline the steps of this new method. For each feature Xj in the model (i.e. the jth column of X), we construct a “knock- knockoff variables is to imitate the correlation structure of the original features, in a for FDR control. knockoffs, we first calculate the Gram matrix ⌃ = X>X of the original features,1 uch that ⌃jj = kXj k2 2 = 1 for all j. We will ensure that these knockoff features obey ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, (1.3) negative vector. In words, ˜ X exhibits the same covariance structure as the original
  11. 1FSNVUBUJPOߦྻͰ͸Կނμϝͳͷ͔ʁ w ,OPDLP⒎ಛ௃ྔͷߏங͸ɹɹɹɹɹɹͰྑ͍ͷͰ͸ͳ͍͔ʁ
 w ͜Ε͸࢒೦ͳ͕Β&YDIBOHFBCJMJUZΛຬͨ͞ͳ͍ɻ ଟ෼ɺZʹ͸
 د༩͠ͳ͍͕ɺZʹد༩͢Δಛ௃ྔͱ૬ؔͷ͋Δಛ௃ྔ͕͋ͬͨ࣌ɺ
 ͦͷಛ௃ྔ͸ରԠ͢Δ1FSNVUBUJPOߦྻͷಛ௃ྔΑΓ΋ઌʹ
 -BTTPʹೖΔɻٖࣅ૬ؔͷΑ͏ͳɻ 


    w Լه৚݅ͰγϛϡϨʔγϣϯͯ͠Έͨ݁Ռ͕࣍εϥΠυ r understand the ideas behind our method, we next ask whether we could have constructed the matrix of features ˜ X with a simple permutation. Specifically, would the above results hold if, instead of constructing ove, we use a matrix X⇡, with entries given by X⇡ i,j = X⇡(i),j 10 particular, the matrix X⇡ will always ginal covariates, while breaking asso- be quite effective under a global null, o [6, 7] for other sources of problems ngs, where some signals do exist, can n-based construction can dramatically To understand why, suppose that the X⇡ ⇤ is equal to ⇥ X> i X⇡ j ⇤ = 0 when the features are d for the augmented matrix ⇥ X X⇡ ⇤ , s. We let n = 300 and p = 100, and all i and ⇥ij = 0.3 for all i 6= j. We 0, In). (3.1) atrix ⇥ X X⇡ ⇤ . Let Zj and Z⇡ j denote enter the Lasso path (exactly as for ⇡ domly chosen permutation ⇡ of the sample indices {1, . . . , n}? In particular, the matrix X⇡ will always X⇡>X⇡ = X>X, rmuted covariates display the same correlation structure as the original covariates, while breaking asso- he response y due to the permutation. on methods are widely used in applied research. While they may be quite effective under a global null, to yield correct answers in cases other than the global null; see also [6, 7] for other sources of problems th permutation methods. Consequently, inference in practical settings, where some signals do exist, can rted. In the linear regression problem considered here, a permutation-based construction can dramatically e the FDP in cases where X displays nonvanishing correlations. To understand why, suppose that the are centered. Then the Gram of the augmented design matrix ⇥ X X⇡ ⇤ is equal to ⇥ X X⇡ ⇤> ⇥ X X⇡ ⇤ ⇡  ⌃ 0 0 ⌃ , proximately-zero off-diagonal blocks arise from the fact that E ⇡ ⇥ X> i X⇡ j ⇤ = 0 when the features are particular, the exchangeability results (Lemmas 2 and 3) will not hold for the augmented matrix ⇥ X X⇡ ⇤ , ead to extremely poor control of FDR. is empirically, consider a setting high correlation between features. We let n = 300 and p = 100, and h row of X i.i.d. from a N(0, ⇥) distribution, where ⇥ii = 1 for all i and ⇥ij = 0.3 for all i 6= j. We nd normalize the columns of X, and define y = 3.5 · (X1 + · · · + X30) + z where z ⇠ N(0, In). (3.1) he Lasso path (1.5) for the response y and the augmented design matrix ⇥ X X⇡ ⇤ . Let Zj and Z⇡ j denote values when feature Xj and permuted feature X⇡, respectively, enter the Lasso path (exactly as for hosen permutation ⇡ of the sample indices {1, . . . , n}? In particular, the matrix X⇡ will always X⇡>X⇡ = X>X, covariates display the same correlation structure as the original covariates, while breaking asso- nse y due to the permutation. ods are widely used in applied research. While they may be quite effective under a global null, correct answers in cases other than the global null; see also [6, 7] for other sources of problems utation methods. Consequently, inference in practical settings, where some signals do exist, can he linear regression problem considered here, a permutation-based construction can dramatically P in cases where X displays nonvanishing correlations. To understand why, suppose that the red. Then the Gram of the augmented design matrix ⇥ X X⇡ ⇤ is equal to ⇥ X X⇡ ⇤> ⇥ X X⇡ ⇤ ⇡  ⌃ 0 0 ⌃ , tely-zero off-diagonal blocks arise from the fact that E ⇡ ⇥ X> i X⇡ j ⇤ = 0 when the features are , the exchangeability results (Lemmas 2 and 3) will not hold for the augmented matrix ⇥ X X⇡ ⇤ , xtremely poor control of FDR. cally, consider a setting high correlation between features. We let n = 300 and p = 100, and X i.i.d. from a N(0, ⇥) distribution, where ⇥ii = 1 for all i and ⇥ij = 0.3 for all i 6= j. We alize the columns of X, and define y = 3.5 · (X1 + · · · + X30) + z where z ⇠ N(0, In). (3.1) path (1.5) for the response y and the augmented design matrix ⇥ X X⇡ ⇤ . Let Zj and Z⇡ j denote when feature Xj and permuted feature X⇡ j , respectively, enter the Lasso path (exactly as for but with X⇡ replacing ˜ X). Figure 2 shows the values of Zj and Z⇡ j for a single trial of this
  12. 1FSNVUBUJPOߦྻͰ͸Կނμϝͳͷ͔ʁ not enter the Lasso path until is extremely small;

    the difference arises from the fact that only the o s are correlated with the signals X1, . . . , X30 . In other words, the X⇡ j ’s are not good knockoffs e nulls j = 31, . . . , 100—they behave very differently in the Lasso regression. Next we test the ues on FDR control. Using either the knockoff features ˜ Xj or the permuted features X⇡ j , we proc he statistics W (1.7) and the threshold T (1.8) as with the knockoff method (i.e. after computing ng either ⇥ X ˜ X ⇤ or ⇥ X X⇡ ⇤ ). We obtain the following FDR, when the target FDR is set at q = 20 FDR over 1000 trials (nominal level q = 20%) Knockoff method 12.29% Permutation method 45.61% at there are many ways in which the permuted features X⇡ may be used to try to estimate or contro ral such methods will suffer from similar issues arising from the lack of correlation between the per ginal features. Benjamini-Hochberg procedure and variants mini-Hochberg (BHq) procedure [1] is a hypothesis testing method known to control FDR unde Given z-scores Z1, . . . , Zp corresponding to p hypotheses being tested so that Zj ⇠ N(0, 1) if is null, the procedure5 rejects a hypothesis whenever |Zj | T, where T is a data-dependent thr T = min ⇢ t : p · P {|N(0, 1)| t} #{j : |Zj | t}  q (or T = +1 if this set is empty), nt BHq in the notation from [19], and use z-scores rather than p-values, to facilitate comparison with our methods. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 0 50 100 150 200 0 5 10 15 20 25 Index of column in the augmented matrix [X Xπ] Value of λ when variable enters model • Original non−null features Original null features Permuted features Figure 2: Results of the Lasso path, with simulated data specified in (3.1). Many of the null features Xj for
  13. ͷߏஙํ๏ ch that ⌃jj = kXj k2 2 = 1

    for all j. We will ensure that these knockoff ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, egative vector. In words, ˜ X exhibits the same covariance structure a orrelations between distinct original and knockoff variables are the s and ⌃ diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. j to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, ensure that our method has good statistical power to detect signals, we as large as possible so that a variable Xj is not too similar to its knock X is to choose s 2 Rp + satisfying diag{s} 2⌃, and construct the n ⇥ ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC; matrix that is orthogonal2 to the span of the features X, and C>C = setting n 2p r, the matrix ˜ X must obey ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ =  ⌃ ⌃ diag{s} ⌃ diag{s} ⌃ := G, (2 me vector. A necessary and sufficient condition for ˜ X to exist is that G is positive semidefini G ⌫ 0 if and only if the Schur complement = ⌃ (⌃ diag{s})⌃ 1(⌃ diag{s}) = 2 diag{s} diag{s}⌃ 1 diag{s} (2 ite. In turn, standard Schur complement calculations show that A ⌫ 0 if and only if  ⌃ diag{s} diag{s} 2 diag{s} ⌫ 0 () diag{s} ⌫ 0 2⌃ diag{s} ⌫ 0 Now let ˜ U 2 Rn⇥p be an orthonormal matrix whose column space is orthogonal to that of X ch a matrix exists because n 2p. Since A ⌫ 0, we can factorize it as A = C>C, where C ulation then shows that setting ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC (2 structure specified in (2.1). w ɹ͕ଘࡏ͢Δඞཁे෼৚݅͸ɺ(͕൒ਖ਼ఆ஋ʹͳΔ͜ͱɻ w ͢ͳΘͪɺγϡʔΞͷิߦྻΛߟ͑ɺ
 
 ͕൒ਖ਼ఆ஋ʹͳΕ͹Α͍ɻ͜ΕΛγϡʔΞͷิߦྻͱͯ͠ݩͷ
 ۠෼ߦྻΛߟ͑Δͱɺ
 
 
 ͕ඞཁे෼ɻ࠷ޙʹผͷ۠ըʹରͯ͠γϡʔΞͷิߦྻΛܭࢉ͢Δͱɺ
 
 ɹɹɹɹɹɹɹɹɹ͕ඞཁे෼ rbitrary. We now outline the steps of this new method. in the model (i.e. the jth column of X), we construct a “knock- s to imitate the correlation structure of the original features, in a alculate the Gram matrix ⌃ = X>X of the original features,1 2 2 = 1 for all j. We will ensure that these knockoff features obey X> ˜ X = ⌃ diag{s}, (1.3) words, ˜ X exhibits the same covariance structure as the original distinct original and knockoff variables are the same as those re equal on off-diagonal entries): X> j Xk for all j 6= k. , we see that = ⌃jj sj = 1 sj, hod has good statistical power to detect signals, we will see that so that a variable Xj is not too similar to its knockoff ˜ Xj . Rp + satisfying diag{s} 2⌃, and construct the n ⇥ p matrix ˜ X ⌃ 1 diag{s}) + ˜ UC; (1.4) gonal2 to the span of the features X, and C>C = 2 diag{s} The natural setting n 2p duced earlier, the matrix ˜ X must obey ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ =  ⌃ ⌃ diag{s} ⌃ diag{s} ⌃ := G, 2 Rp is some vector. A necessary and sufficient condition for ˜ X to exist is that G is positive semidefi ecall that G ⌫ 0 if and only if the Schur complement A = ⌃ (⌃ diag{s})⌃ 1(⌃ diag{s}) = 2 diag{s} diag{s}⌃ 1 diag{s} ve semidefinite. In turn, standard Schur complement calculations show that A ⌫ 0 if and only if  ⌃ diag{s} diag{s} 2 diag{s} ⌫ 0 () diag{s} ⌫ 0 2⌃ diag{s} ⌫ 0 ed earlier. Now let ˜ U 2 Rn⇥p be an orthonormal matrix whose column space is orthogonal to that of X X = 0: such a matrix exists because n 2p. Since A ⌫ 0, we can factorize it as A = C>C, where simple calculation then shows that setting ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC correlation structure specified in (2.1). setting n 2p er, the matrix ˜ X must obey ⇥ X ˜ X ⇤> ⇥ X ˜ X ⇤ =  ⌃ ⌃ diag{s} ⌃ diag{s} ⌃ := G, (2.1) me vector. A necessary and sufficient condition for ˜ X to exist is that G is positive semidefinite. G ⌫ 0 if and only if the Schur complement = ⌃ (⌃ diag{s})⌃ 1(⌃ diag{s}) = 2 diag{s} diag{s}⌃ 1 diag{s} (2.2) nite. In turn, standard Schur complement calculations show that A ⌫ 0 if and only if  ⌃ diag{s} diag{s} 2 diag{s} ⌫ 0 () diag{s} ⌫ 0 2⌃ diag{s} ⌫ 0 Now let ˜ U 2 Rn⇥p be an orthonormal matrix whose column space is orthogonal to that of X so ch a matrix exists because n 2p. Since A ⌫ 0, we can factorize it as A = C>C, where C is culation then shows that setting ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC (2.3) ⌃ diag{s} g{s} ⌃ := G, (2.1) ondition for ˜ X to exist is that G is positive semidefinite. nt s}) = 2 diag{s} diag{s}⌃ 1 diag{s} (2.2) calculations show that A ⌫ 0 if and only if () diag{s} ⌫ 0 2⌃ diag{s} ⌫ 0
  14. ͷߏஙํ๏ ch that ⌃jj = kXj k2 2 = 1

    for all j. We will ensure that these knockoff ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, egative vector. In words, ˜ X exhibits the same covariance structure a orrelations between distinct original and knockoff variables are the s and ⌃ diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. j to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, ensure that our method has good statistical power to detect signals, we as large as possible so that a variable Xj is not too similar to its knock X is to choose s 2 Rp + satisfying diag{s} 2⌃, and construct the n ⇥ ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC; matrix that is orthogonal2 to the span of the features X, and C>C = w ɹɹɹɹɹͱͳΔਖ਼ن௚ަߦྻɹɹɹɹɹΛߟ͑Δ
 ͜Ε͸ɺOQͱ͍ͯ͠ΔͷͰऔΔ͜ͱ͕Ͱ͖Δ 
 w "͸൒ਖ਼ఆ஋ͳͷͰɹɹɹɹɹͱग़དྷΔɻ
 w ͜ͷ࣌ɺ
 ͸ཁ݅Λຬͨ͢ɻ is positive semidefinite. In turn, standard Schur complement calculations show that A  ⌃ diag{s} diag{s} 2 diag{s} ⌫ 0 () diag{s} ⌫ 0 2⌃ diag{s} as claimed earlier. Now let ˜ U 2 Rn⇥p be an orthonormal matrix whose column sp that ˜ U>X = 0: such a matrix exists because n 2p. Since A ⌫ 0, we can factor p ⇥ p. A simple calculation then shows that setting ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC gives the correlation structure specified in (2.1). Now that we understand the condition on s necessary for knockoff features with to exist, it remains to discuss which one we should construct, that is, to specify a choi of statistic from Section 1.2, we will have a useful methodology only if those variab tend to be selected before their knockoffs as we would otherwise have no power. Im true model. Then we wish to have Xj enter before ˜ Xj . To make this happen, we and the true signal to be small, so that ˜ Xj does not enter the Lasso model early. In and ˜ Xj to be as orthogonal to each other as possible. In a setting where the features all j, we would like to have ˜ X> j Xj = 1 sj as close to zero as possible. Below, w knockoffs: • Equi-correlated knockoffs: Here, sj = 2 min(⌃) ^ 1 for all j, so that all the value hXj, ˜ Xj i = 1 2 min(⌃) ^ 1. Among all knockoffs with this equi-variant property, this choice minimizes the  ⌃ diag{s} diag{s} 2 diag{s} ⌫ 0 () diag{s} ⌫ 0 2⌃ diag{s} ⌫ 0 as claimed earlier. Now let ˜ U 2 Rn⇥p be an orthonormal matrix whose column space is orthogo that ˜ U>X = 0: such a matrix exists because n 2p. Since A ⌫ 0, we can factorize it as A = p ⇥ p. A simple calculation then shows that setting ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC gives the correlation structure specified in (2.1). Now that we understand the condition on s necessary for knockoff features with the desired to exist, it remains to discuss which one we should construct, that is, to specify a choice of s. Retur of statistic from Section 1.2, we will have a useful methodology only if those variables that truly tend to be selected before their knockoffs as we would otherwise have no power. Imagine that v true model. Then we wish to have Xj enter before ˜ Xj . To make this happen, we need the corr and the true signal to be small, so that ˜ Xj does not enter the Lasso model early. In other words and ˜ Xj to be as orthogonal to each other as possible. In a setting where the features are normaliz all j, we would like to have ˜ X> j Xj = 1 sj as close to zero as possible. Below, we consider tw knockoffs: • Equi-correlated knockoffs: Here, sj = 2 min(⌃) ^ 1 for all j, so that all the correlations t value hXj, ˜ Xj i = 1 2 min(⌃) ^ 1. Among all knockoffs with this equi-variant property, this choice minimizes the value of |hX t }) = 2 diag{s} diag{s}⌃ 1 diag{s} (2.2) calculations show that A ⌫ 0 if and only if () diag{s} ⌫ 0 2⌃ diag{s} ⌫ 0 atrix whose column space is orthogonal to that of X so A ⌫ 0, we can factorize it as A = C>C, where C is iag{s}) + ˜ UC (2.3) knockoff features with the desired correlation structure hat is, to specify a choice of s. Returning to the example ogy only if those variables that truly belong to the model wise have no power. Imagine that variable Xj is in the make this happen, we need the correlation between ˜ Xj e Lasso model early. In other words, we would like Xj tting where the features are normalized, i.e. ⌃jj = 1 for o as possible. Below, we consider two particular types of (⌃ diag{s})⌃ 1(⌃ diag{s}) = 2 diag{s} diag{s}⌃ 1 diag{s} n turn, standard Schur complement calculations show that A ⌫ 0 if and only if  ⌃ diag{s} diag{s} 2 diag{s} ⌫ 0 () diag{s} ⌫ 0 2⌃ diag{s} ⌫ 0 et ˜ U 2 Rn⇥p be an orthonormal matrix whose column space is orthogonal to that of matrix exists because n 2p. Since A ⌫ 0, we can factorize it as A = C>C, where on then shows that setting ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC cture specified in (2.1). and the condition on s necessary for knockoff features with the desired correlation stru uss which one we should construct, that is, to specify a choice of s. Returning to the exa 1.2, we will have a useful methodology only if those variables that truly belong to the m e their knockoffs as we would otherwise have no power. Imagine that variable Xj is sh to have Xj enter before ˜ Xj . To make this happen, we need the correlation betwee small, so that ˜ Xj does not enter the Lasso model early. In other words, we would lik nal to each other as possible. In a setting where the features are normalized, i.e. ⌃jj =
  15. ͷߏஙํ๏ ch that ⌃jj = kXj k2 2 = 1

    for all j. We will ensure that these knockoff ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, egative vector. In words, ˜ X exhibits the same covariance structure a orrelations between distinct original and knockoff variables are the s and ⌃ diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. j to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, ensure that our method has good statistical power to detect signals, we as large as possible so that a variable Xj is not too similar to its knock X is to choose s 2 Rp + satisfying diag{s} 2⌃, and construct the n ⇥ ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC; matrix that is orthogonal2 to the span of the features X, and C>C = w ࠷ޙʹɺT͸ͲͷΑ͏ʹઃఆ͢΂͖͔ʹ͍ͭͯ w K͕ਅͷϞσϧͰ͋Ε͹ɺɹ͸ɹɹΑΓ΋ͳΔ΂͘ઌʹϞσϧʹ
 ೖͬͯཉ͍͠ w ͱͳΔͱɺ͜ͷͭ͸ͳΔ΂͘૬͕ؔͳ͍ ಺ੵ͕ʹ͍ۙ ํ͕͍͍ɻ
 ɹɹɹɹɹɹɹͰ͔͋ͬͨΒɺ͜ΕΛߟ͑Δͱ࣍ͷ͕ͭߟ͑ΒΕΔ
 
 
  ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC gives the correlation structure specified in (2.1). Now that we understand the condition on s necessary for knockoff features with the desired correlati to exist, it remains to discuss which one we should construct, that is, to specify a choice of s. Returning to t of statistic from Section 1.2, we will have a useful methodology only if those variables that truly belong t tend to be selected before their knockoffs as we would otherwise have no power. Imagine that variable X true model. Then we wish to have Xj enter before ˜ Xj . To make this happen, we need the correlation b and the true signal to be small, so that ˜ Xj does not enter the Lasso model early. In other words, we wo and ˜ Xj to be as orthogonal to each other as possible. In a setting where the features are normalized, i.e. ⌃ all j, we would like to have ˜ X> j Xj = 1 sj as close to zero as possible. Below, we consider two particu knockoffs: • Equi-correlated knockoffs: Here, sj = 2 min(⌃) ^ 1 for all j, so that all the correlations take on t value hXj, ˜ Xj i = 1 2 min(⌃) ^ 1. Among all knockoffs with this equi-variant property, this choice minimizes the value of |hXj, ˜ Xj i| • SDP knockoffs: Another possibility is to select knockoffs so that the average correlation between variable and its knockoff is minimum. This is done by solving the convex problem minimize P j |1 sj | subject to sj 0 diag{s} 2⌃. This optimization problem is a highly structured semidefinite program (SDP), which can be solve ciently [4]. At the optimum, sj  1 for all j (the reason is that if sj > 1, then setting sj = feasibility and reduces the value of the objective), so that this problem is equivalent to the simpler S orrelation structure specified in (2.1). at we understand the condition on s necessary for knockoff features with the desired correlation emains to discuss which one we should construct, that is, to specify a choice of s. Returning to th from Section 1.2, we will have a useful methodology only if those variables that truly belong to selected before their knockoffs as we would otherwise have no power. Imagine that variable X . Then we wish to have Xj enter before ˜ Xj . To make this happen, we need the correlation be e signal to be small, so that ˜ Xj does not enter the Lasso model early. In other words, we woul be as orthogonal to each other as possible. In a setting where the features are normalized, i.e. ⌃j ould like to have ˜ X> j Xj = 1 sj as close to zero as possible. Below, we consider two particula -correlated knockoffs: Here, sj = 2 min(⌃) ^ 1 for all j, so that all the correlations take on the e hXj, ˜ Xj i = 1 2 min(⌃) ^ 1. ng all knockoffs with this equi-variant property, this choice minimizes the value of |hXj, ˜ Xj i|. knockoffs: Another possibility is to select knockoffs so that the average correlation between a ble and its knockoff is minimum. This is done by solving the convex problem minimize P j |1 sj | subject to sj 0 diag{s} 2⌃. culation then shows that setting ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC (2.3) structure specified in (2.1). derstand the condition on s necessary for knockoff features with the desired correlation structure o discuss which one we should construct, that is, to specify a choice of s. Returning to the example tion 1.2, we will have a useful methodology only if those variables that truly belong to the model before their knockoffs as we would otherwise have no power. Imagine that variable Xj is in the e wish to have Xj enter before ˜ Xj . To make this happen, we need the correlation between ˜ Xj o be small, so that ˜ Xj does not enter the Lasso model early. In other words, we would like Xj hogonal to each other as possible. In a setting where the features are normalized, i.e. ⌃jj = 1 for to have ˜ X> j Xj = 1 sj as close to zero as possible. Below, we consider two particular types of ed knockoffs: Here, sj = 2 min(⌃) ^ 1 for all j, so that all the correlations take on the identical hXj, ˜ Xj i = 1 2 min(⌃) ^ 1. (2.4) ockoffs with this equi-variant property, this choice minimizes the value of |hXj, ˜ Xj i|. s: Another possibility is to select knockoffs so that the average correlation between an original s knockoff is minimum. This is done by solving the convex problem minimize P j |1 sj | subject to sj 0 diag{s} 2⌃. p ⇥ p. A simple calculation then shows that setting ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC gives the correlation structure specified in (2.1). Now that we understand the condition on s necessary for knockoff features with the desire to exist, it remains to discuss which one we should construct, that is, to specify a choice of s. Re of statistic from Section 1.2, we will have a useful methodology only if those variables that tru tend to be selected before their knockoffs as we would otherwise have no power. Imagine tha true model. Then we wish to have Xj enter before ˜ Xj . To make this happen, we need the c and the true signal to be small, so that ˜ Xj does not enter the Lasso model early. In other wo and ˜ Xj to be as orthogonal to each other as possible. In a setting where the features are norma all j, we would like to have ˜ X> j Xj = 1 sj as close to zero as possible. Below, we consider knockoffs: • Equi-correlated knockoffs: Here, sj = 2 min(⌃) ^ 1 for all j, so that all the correlation value hXj, ˜ Xj i = 1 2 min(⌃) ^ 1. Among all knockoffs with this equi-variant property, this choice minimizes the value of | • SDP knockoffs: Another possibility is to select knockoffs so that the average correlatio variable and its knockoff is minimum. This is done by solving the convex problem minimize P j |1 sj | ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC s the correlation structure specified in (2.1). Now that we understand the condition on s necessary for knockoff features with the desired ist, it remains to discuss which one we should construct, that is, to specify a choice of s. Retu atistic from Section 1.2, we will have a useful methodology only if those variables that truly to be selected before their knockoffs as we would otherwise have no power. Imagine that model. Then we wish to have Xj enter before ˜ Xj . To make this happen, we need the cor he true signal to be small, so that ˜ Xj does not enter the Lasso model early. In other word ˜ Xj to be as orthogonal to each other as possible. In a setting where the features are normali we would like to have ˜ X> j Xj = 1 sj as close to zero as possible. Below, we consider tw koffs: Equi-correlated knockoffs: Here, sj = 2 min(⌃) ^ 1 for all j, so that all the correlations value hXj, ˜ Xj i = 1 2 min(⌃) ^ 1. Among all knockoffs with this equi-variant property, this choice minimizes the value of |hX SDP knockoffs: Another possibility is to select knockoffs so that the average correlation variable and its knockoff is minimum. This is done by solving the convex problem minimize P j |1 sj | X = X(I ⌃ diag{s}) + UC gives the correlation structure specified in (2.1). Now that we understand the condition on s necessary for knockoff features with the desire to exist, it remains to discuss which one we should construct, that is, to specify a choice of s. Re of statistic from Section 1.2, we will have a useful methodology only if those variables that tru tend to be selected before their knockoffs as we would otherwise have no power. Imagine tha true model. Then we wish to have Xj enter before ˜ Xj . To make this happen, we need the co and the true signal to be small, so that ˜ Xj does not enter the Lasso model early. In other wor and ˜ Xj to be as orthogonal to each other as possible. In a setting where the features are norma all j, we would like to have ˜ X> j Xj = 1 sj as close to zero as possible. Below, we consider knockoffs: • Equi-correlated knockoffs: Here, sj = 2 min(⌃) ^ 1 for all j, so that all the correlation value hXj, ˜ Xj i = 1 2 min(⌃) ^ 1. Among all knockoffs with this equi-variant property, this choice minimizes the value of | • SDP knockoffs: Another possibility is to select knockoffs so that the average correlatio variable and its knockoff is minimum. This is done by solving the convex problem minimize P j |1 sj | subject to sj 0 diag{s} 2⌃. jj would like to have ˜ X> j Xj = 1 sj as close to zero as possible. Below, we consider two particular s: ui-correlated knockoffs: Here, sj = 2 min(⌃) ^ 1 for all j, so that all the correlations take on the i lue hXj, ˜ Xj i = 1 2 min(⌃) ^ 1. mong all knockoffs with this equi-variant property, this choice minimizes the value of |hXj, ˜ Xj i|. DP knockoffs: Another possibility is to select knockoffs so that the average correlation between an riable and its knockoff is minimum. This is done by solving the convex problem minimize P j |1 sj | subject to sj 0 diag{s} 2⌃. is optimization problem is a highly structured semidefinite program (SDP), which can be solved v ently [4]. At the optimum, sj  1 for all j (the reason is that if sj > 1, then setting sj = 1 m asibility and reduces the value of the objective), so that this problem is equivalent to the simpler SDP minimize P j (1 sj) subject to 0  sj  1 diag{s} 2⌃.
  16. ɹɹɹɹɹͷ৔߹ w ϊΠζϨϕϧ͕ط஌ ͋Δ͍͸͸ਪఆՄೳ ͱͯ͠
 σʔλΛBVHNFOUBUJPO͢ΔͳͲͱͯ͠ਖ਼ن௚ަߦྻΛऔΕΔΑ͏ʹ͢Δ 2.1.2 Extensions to p

     n < 2p When n < 2p, we can no longer find a subspace of dimension p which is ortho ˜ U as above. We can still use the knockoff filter, however, as long as the noise le instance, under the Gaussian noise model (1.1), we can use the fact that the res is distributed as ky X ˆLSk2 2 ⇠ 2 · 2 n p , where ˆLS is the vector of coeffic letting ˆ be our estimate of , draw a (n p)-dimensional vector y0 with i.i.d ˆ will be an extremely accurate estimate of , and we can proceed as though the response vector y with the new (n p)-length vector y0, and augment th zeros. Then approximately,  y y0 ⇠ N ✓ X 0 , 2I ◆ . We now have a linear model with p variables and 2p observations, and so row-augmented data using the method described for the n 2p setting. We present one other alternative method for the p < n < 2p regime, w estimate the noise level . Specifically, we can use the “gap” n p betwe number of variables, to test only a subset of n p of the variables. To do ˜ Xj = Xj so that these are simply duplicated, and construct knockoffs for the vector s in (2.1) such that only n p entries are nonzero. Then since the m write A = C>C for some C 2 R(n p)⇥p. Finally, we find ˜ U 2 Rn⇥(n p) in (2.3) and proceed as before. Clearly, while this approach will control the d a subspace of dimension p which is orthogonal to X, and so we cannot construct ckoff filter, however, as long as the noise level is known or can be estimated—for model (1.1), we can use the fact that the residual sum of squares from the full model 2 · 2 n p , where ˆLS is the vector of coefficients in a least-squares regression. Now a (n p)-dimensional vector y0 with i.i.d. N(0, ˆ2) entries. If n p is large, then mate of , and we can proceed as though and ˆ were equal. We then augment (n p)-length vector y0, and augment the design matrix X with n p rows of  y y0 ⇠ N ✓ X 0 , 2I ◆ . p variables and 2p observations, and so we can apply the knockoff filter to this od described for the n 2p setting. e method for the p < n < 2p regime, which may be used if one prefers not to cally, we can use the “gap” n p between the number of observations and the subset of n p of the variables. To do this, select 2p n variables for which uplicated, and construct knockoffs for the remaining n p; that is, we choose the p entries are nonzero. Then since the matrix A in (2.2) has rank n p, we can n p)⇥p. Finally, we find ˜ U 2 Rn⇥(n p) orthogonal to X, and then define ˜ X as arly, while this approach will control the FDR, it will have no power in detecting . To address this, then, we can cycle through the duplicates; for clarity, we present p variables as J1 [J2 , with each Ji of cardinality p/2. In the first round, duplicate offs for those in J2 . In the second, reverse the roles of J1 and J2 . Now if in each controlling the FDR at level q/2, the overall FDR is guaranteed to be controlled at
  17. #)ͱͷൺֱ݁Ռ w ྑ͛͞ Method FDR (%) Power (%) Theoretical guarantee

    (nominal level q = 20%) of FDR control? Knockoff+ (equivariant construction) 14.40 60.99 Yes Knockoff (equivariant construction) 17.82 66.73 No Knockoff+ (SDP construction) 15.05 61.54 Yes Knockoff (SDP construction) 18.72 67.50 No Benjamini-Hochberg (BHq) [1] 18.70 48.88 No BHq + log-factor correction [2] 2.20 19.09 Yes BHq with whitened noise 18.79 2.33 Yes Table 1: FDR and power in the setting of Section 3.3.1 with n = 3000 observations, p = 1000 variables and k = 30 variables in the model with regression coefficients of magnitude 3.5. Bold face font highlights those methods that are known theoretically to control FDR at the nominal level q = 20%. Since these z-scores are now independent, applying BHq yields an FDR of at most ⇡0q. For all of the variants of BHq considered here, FDR control is estimated or guaranteed to be at a level of ⇡0q, which is lower than the nominal level q, i.e. the method is more conservative than desired. However, here we are primarily interested in a sparse setting where ⇡0 ⇡ 1, and so this will not have a strong effect. 3.3 Empirical comparisons with the Benjamini-Hochberg method and variants
  18. .PEFM9LOPDLP⒎ Panning for Gold: Model-X Knocko↵s for High-dimensional Controlled Variable

    Selection Emmanuel Cand` es⇤1 , Yingying Fan2 , Lucas Janson1 , and Jinchi Lv2 1 Department of Statistics, Stanford University 2 Data Sciences and Operations Department, Marshall School of Business, USC Abstract Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a nonlinear fashion, such as when the response is binary. Al- though this modeling problem has been extensively studied, it remains unclear how to e↵ectively control the fraction of false discoveries even in high-dimensional logistic regression, not to mention general high- dimensional nonlinear models. To address such a practical problem, we propose a new framework of model-X knocko↵s, which reads from a di↵erent perspective the knocko↵ procedure (Barber and Cand` es, 2015) originally designed for controlling the false discovery rate in linear models. Whereas the knocko↵s procedure is constrained to homoscedastic linear models with n p, the key innovation here is that model-X knocko↵s provide valid inference from finite samples in settings in which the conditional dis- tribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knocko↵ variables probabilistically instead of geometrically. To do this, our approach requires the covariates be random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown/estimated distri- butions. To our knowledge, no other procedure solves the controlled variable selection problem in such (J. R. Stat. Soc. B. 2018) Ҿ༻਺:250ճ w ઌ΄Ͳͷ,OPDLP⒎͸ઢܗճؼͷ৔߹΍QOͷ৔߹ʹͷΈར༻Մೳ
 ࠓճ͸৚݅ΛҰൠԽ͠ɺ෯޿͍Ϟσϧʹରͯ͠ద༻Մೳͳ
 .PEFM9LOPDLP⒎Λఏএ
 w &YDIBOHFBCJMJUZ͕੒Γཱͭ৚݅ΛҰൠԽ
  19. ͱ8ͷ৚݅ ch that ⌃jj = kXj k2 2 = 1

    for all j. We will ensure that these knockoff ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, egative vector. In words, ˜ X exhibits the same covariance structure a orrelations between distinct original and knockoff variables are the s and ⌃ diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. j to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, ensure that our method has good statistical power to detect signals, we as large as possible so that a variable Xj is not too similar to its knock X is to choose s 2 Rp + satisfying diag{s} 2⌃, and construct the n ⇥ ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC; matrix that is orthogonal2 to the span of the features X, and C>C = w ࣍ͷͭͷ৚݅Λຬͨ͢Α͏ʹɹɹΛߏங 9Λ֬཰ม਺ͱΈ͍ͯΔ 
 
 
 w 8ͷཁ݅ ൓ରশੑͷΈʹͳͬͨ 
 
 
 will allow for FDR control. truct the knockoffs, we first calculate the Gram matrix ⌃ = X>X of the origin eature such that ⌃jj = kXj k2 2 = 1 for all j. We will ensure that these knockoff f ˜ X> ˜ X = ⌃, X> ˜ X = ⌃ diag{s}, onal nonnegative vector. In words, ˜ X exhibits the same covariance structure as ion, the correlations between distinct original and knockoff variables are the sa because ⌃ and ⌃ diag{s} are equal on off-diagonal entries): X> j ˜ Xk = X> j Xk for all j 6= k. eature Xj to its knockoff ˜ Xj , we see that X> j ˜ Xj = ⌃jj sj = 1 sj, j = 1. To ensure that our method has good statistical power to detect signals, we ntries of s as large as possible so that a variable Xj is not too similar to its knock ructing ˜ X is to choose s 2 Rp + satisfying diag{s} 2⌃, and construct the n ⇥ ˜ X = X(I ⌃ 1 diag{s}) + ˜ UC; honormal matrix that is orthogonal2 to the span of the features X, and C>C = (X, ˜ X)swap(S) d = (X, ˜ X); (3.1) ponse Y . (2) is guaranteed if ˜ X is constructed without looking at Y . wap(S) is obtained from (X, ˜ X) by swapping the entries Xj and ˜ Xj for each and S = {2, 3}, X3, ˜ X1, ˜ X2, ˜ X3)swap({2,3}) d = (X1, ˜ X2, ˜ X3, ˜ X1, X2, X3). al and knocko↵ variables are pairwise exchangeable: taking any subset of with their knocko↵s leaves the joint distribution invariant. Note that our n the covariates, and thus bears little resemblance to exchangeability condi- esting (see, e.g., Westfall and Troendle (2008)). To give an example of MX N(0, ⌃). Then a joint distribution obeying (3.1) is this: N(0, G), where G =  ⌃ ⌃ diag{s} ⌃ diag{s} ⌃ ; (3.2) matrix selected in such a way that the joint covariance matrix G is positive bution obtained by swapping variables with their knocko↵s is Gaussian with dom variables X = (X1, . . . , Xp) are a new th the following two properties: (1) for any subset S ⇢ {1, . . . , p},4 (X, ˜ X)sw (2) ˜ X ? ? Y | X if there is a response Y . (2) is gu Above, the vector (X, ˜ X)swap(S) is obtained f j 2 S; for example, with p = 3 and S = {2, 3}, (X1, X2, X3, ˜ X1, ˜ X2, ˜ X3)swa We see from (3.1) that original and knocko↵ va variables and swapping them with their knocko↵ exchangeability condition is on the covariates, an tions for closed permutation testing (see, e.g., W knocko↵s, suppose that X ⇠ N(0, ⌃). Then a jo (X, ˜ X) ⇠ N(0, G), where subset S ⇢ {1, . . . , p},4 (X, ˜ X)swap(S) d = (X, ˜ X); (2) ˜ X ? ? Y | X if there is a response Y . (2) is guaranteed if ˜ X is constructed without loo Above, the vector (X, ˜ X)swap(S) is obtained from (X, ˜ X) by swapping the entries X j 2 S; for example, with p = 3 and S = {2, 3}, (X1, X2, X3, ˜ X1, ˜ X2, ˜ X3)swap({2,3}) d = (X1, ˜ X2, ˜ X3, ˜ X1, X2, X3). We see from (3.1) that original and knocko↵ variables are pairwise exchangeable: tak variables and swapping them with their knocko↵s leaves the joint distribution invarian exchangeability condition is on the covariates, and thus bears little resemblance to exch tions for closed permutation testing (see, e.g., Westfall and Troendle (2008)). To give a knocko↵s, suppose that X ⇠ N(0, ⌃). Then a joint distribution obeying (3.1) is this: (X, ˜ X) ⇠ N(0, G), where G =  ⌃ ⌃ diag{s} ⌃ diag{s} ⌃ ; here, diag{s} is any diagonal matrix selected in such a way that the joint covariance ma semidefinite. Indeed, the distribution obtained by swapping variables with their knocko↵ istics relevant variables, we now compute statistics Wj for each j 2 {1, . . . , p}, a la providing evidence against the hypothesis that Xj is null. This statistic depends original variables but also on the knocko↵s; that is, Wj = wj([X, ˜ X], y) As in Barber and Cand` es (2015), we impose a flip-sign property, which says that sw with its knocko↵ has the e↵ect of changing the sign of Wj. Formally, if [X, ˜ X]swap d by swapping columns in S, wj([X, ˜ X]swap(S), y) = ( wj([X, ˜ X], y), j 62 S, wj([X, ˜ X], y), j 2 S. (3 rementioned work, we do not require the su ciency property that wj depend on gh [X, ˜ X]>[X, ˜ X] and [X, ˜ X]> y. may help the reader unfamiliar with the knocko↵ framework to think about knock . . , Wp) in two steps: first, consider a statistic T for each original and knocko↵ variab T , (Z, ˜ Z) = (Z1, . . . , Zp, ˜ Z1, . . . , ˜ Zp) = t([X, ˜ X], y),
  20. ࣮ࡍʹϊοΫΦϑΛͲ͏΍ͬͯߏங͢Δͷ͔ w ਺ཧతʹ͸࣍ͷ௨Γ
 
 
 
 w ͔͜͠͠ͷαϯϓϦϯά͸ࣗ໌Ͱ͸ͳ͍ɻ
 ɹɹɹ͕Ψ΢ε෼෍ʹै͏ɺ͋Δ͍͸࣍ͷϞʔϝϯτ·Ͱ͕Ұகͯ͠
 ͍Ε͹෼෍͕ಉ͡Ͱ͋ΔͱΈͳ͢ͳΒՄೳɻ

    The proof consists of simple manipulations of the definition and is, therefore, omitted. Our prob can thus also be posed as constructing pairs that are conditionally exchangeable. If the components of vector X are independent, then any independent copy of X would work; that is, any vector ˜ X independe sampled from the same joint distribution as X would work. With dependent coordinates, we may pro as follows: Algorithm 1 Sequential Conditional Independent Pairs. j = 1 while j  p do Sample ˜ Xj from L(Xj | X-j, ˜ X1:j 1) j = j + 1 end Above, L(Xj | X-j, ˜ X1:j 1) is the conditional distribution of Xj given (X-j, ˜ X1:j 1). When p = 3, would work as follows: sample ˜ X1 from L(X1 | X2:3). Once this is done, L(X1:3, ˜ X1) is available we, therefore, know L(X2 | X1, X3, ˜ X1). Hence, we can sample ˜ X2 from this distribution. Continu L(X1:3, ˜ X1:2) becomes known and we can sample ˜ X3 from L(X3 | X1:2, ˜ X1:2). It is not immediately clear why Algorithm 1 yields a sequence of random variables obeying the changeability property (3.1), and we prove this fact in Appendix B. There is, of course, nothing sp about the ordering in which knocko↵s are created and equally valid constructions may be obtained looping through an arbitrary ordering of the variables. For example, in a data analysis application w we would need to build a knocko↵ copy for each row of the design, independent (random) orderings be used. To have power or, equivalently, to have a low Type II error rate, it is intuitive that we would lik have original features Xj and their knocko↵ companions ˜ Xj to be as “independent” as possible. We do not mean to imply that running Algorithm 1 is a simple matter. In fact, it may prove ra complicated since we would have to recompute the conditional distribution at each step; this proble t running Algorithm 1 is a simple matter. In fact, it may prove rather to recompute the conditional distribution at each step; this problem is in this paper we shall work with approximate MX knocko↵s and will models of interest, such constructions yield FDR control. tions: second-order model-X knocko↵s wap(S) and (X, ˜ X) have the same distribution for any subset S, we can two moments, i.e., the same mean and covariance. Equality of means is variances are concerned, equality is equivalent to = G, where G =  ⌃ ⌃ diag{s} ⌃ diag{s} ⌃ . (3.12) e form as in (3.2) where the parameter s is chosen to yield a positive When (X, ˜ X) is Gaussian, a matching of the first two moments implies ons so that we have an exact construction.) Furthermore, Appendix A s already solved in Barber and Cand` es (2015), as the same constraint al covariance replacing the true covariance. This means that the same rber and Cand` es (2015) are just as applicable to second-order model-X on, we will assume the covariates have each been translated and rescaled ne. To review, the equicorrelated construction uses s EQ = 2 min(⌃) ^ 1 for all j, becomes known and we can sample ˜ X3 from L(X3 | X1:2, ˜ X1:2). mmediately clear why Algorithm 1 yields a sequence of random variables obeying the ex- property (3.1), and we prove this fact in Appendix B. There is, of course, nothing special ering in which knocko↵s are created and equally valid constructions may be obtained by gh an arbitrary ordering of the variables. For example, in a data analysis application where d to build a knocko↵ copy for each row of the design, independent (random) orderings may ower or, equivalently, to have a low Type II error rate, it is intuitive that we would like to eatures Xj and their knocko↵ companions ˜ Xj to be as “independent” as possible. mean to imply that running Algorithm 1 is a simple matter. In fact, it may prove rather nce we would have to recompute the conditional distribution at each step; this problem is research. Instead, in this paper we shall work with approximate MX knocko↵s and will mpirically that for models of interest, such constructions yield FDR control. oximate constructions: second-order model-X knocko↵s sking that (X, ˜ X)swap(S) and (X, ˜ X) have the same distribution for any subset S, we can have the same first two moments, i.e., the same mean and covariance. Equality of means is er. As far as the covariances are concerned, equality is equivalent to cov(X, ˜ X) = G, where G =  ⌃ ⌃ diag{s} ⌃ diag{s} ⌃ . (3.12) recognize the same form as in (3.2) where the parameter s is chosen to yield a positive ovariance matrix. (When (X, ˜ X) is Gaussian, a matching of the first two moments implies the joint distributions so that we have an exact construction.) Furthermore, Appendix A e same problem was already solved in Barber and Cand` es (2015), as the same constraint
  21. ϊοΫΦϑͷߏஙํ๏ʹ͍ͭͯͷޙଓݚڀ Biometrika (2019), 106, 1, pp. 1–18 doi: 10.1093/biomet/asy033 Printed

    in Great Britain Advance Access publication 4 August 2018 Gene hunting with hidden Markov model knockoffs By M. SESIA, C. SABATTI and E. J. CANDÈS Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California 94305, U.S.A. [email protected] [email protected] [email protected] Summary Modern scientific studies often require the identification of a subset of explanatory variables. Several statistical methods have been developed to automate this task, and the framework of knockoffs has been proposed as a general solution for variable selection under rigorous Type I error control, without relying on strong modelling assumptions. In this paper, we extend the methodology of knockoffs to problems where the distribution of the covariates can be described by a hidden Markov model. We develop an exact and efficient algorithm to sample knockoff variables in this setting and then argue that, combined with the existing selective framework, this provides a natural and powerful tool for inference in genome-wide association studies with guaranteed false discovery rate control. We apply our method to datasets on Crohn’s disease and some continuous phenotypes. Some key words: False discovery rate; Genome-wide association study; Knockoff; Variable selection. 1. Introduction 1.1. The need for controlled variable selection Automatic variable selection is a fundamental challenge in statistics, the urgency of which is induced by the growing reliance of many fields of science on the analysis of large amounts of Downloaded from https://academic.oup.com/biomet/article/106/1/1/5066539 by Was (Biometrika. 2019) w 9͕ ӅΕ ϚϧίϑϞσϧʹै͏࣌ͷϊοΫΦϑߏஙํ๏ΛఏҊ Deep Knockoffs Yaniv Romano⇤†, Matteo Sesia⇤† , Emmanuel J. Candès†‡ November 19, 2018 Abstract This paper introduces a machine for sampling approximate model-X knockoffs for arbitrary and un- specified data distributions using deep generative models. The main idea is to iteratively refine a knockoff sampling mechanism until a criterion measuring the validity of the produced knockoffs is optimized; this criterion is inspired by the popular maximum mean discrepancy in machine learning and can be thought of as measuring the distance to pairwise exchangeability between original and knockoff features. By building upon the existing model-X framework, we thus obtain a flexible and model-free statistical tool to perform controlled variable selection. Extensive numerical experiments and quantitative tests confirm the generality, effectiveness, and power of our deep knockoff machines. Finally, we apply this new method to a real study of mutations linked to changes in drug resistance in the human immunodeficiency virus. w ,OPDLP⒎Λੜ੒͢ΔΑ͏ͳਂ૚ϞσϧΛֶश͢Δ (J. Am. Stat. Assoc. 2019) Metropolized Knockoff Sampling Stephen Bates⇤1 , Emmanuel Candès 1,2 , Lucas Janson 3 , and Wenshuo Wang 3 1 Department of Statistics, Stanford University, Stanford, CA 94305 2 Department of Mathematics, Stanford University, Stanford, CA 94305 3 Department of Statistics, Harvard University, Cambridge, MA 02138 March 1, 2019 Abstract Model-X knockoffs is a wrapper that transforms essentially any feature importance measure into a variable selection algorithm, which discovers true effects while rigorously controlling the expected fraction of false positives. A frequently discussed challenge to apply this method is w ϝτϩϙϦε๏Ͱ,OPDP⒎ΛαϯϓϦϯά͢Δख๏ͷఏҊ (arxiv 2019)
  22. όΠΦσʔλ΁ͷద༻ྫ ARTICLE Multi-resolution localization of causal variants across the genome

    Matteo Sesia 1, Eugene Katsevich 2, Stephen Bates 1, Emmanuel Candès3✉ & Chiara Sabatti4✉ In the statistical analysis of genome-wide association data, it is challenging to precisely localize the variants that affect complex traits, due to linkage disequilibrium, and to maximize power while limiting spurious findings. Here we report on KnockoffZoom: a flexible method that localizes causal variants at multiple resolutions by testing the conditional associations of genetic segments of decreasing width, while provably controlling the false discovery rate. Our method utilizes artificial genotypes as negative controls and is equally valid for quantitative and binary phenotypes, without requiring any assumptions about their genetic architectures. Instead, we rely on well-established genetic models of linkage disequilibrium. We demon- strate that our method can detect more associations than mixed effects models and achieve fine-mapping precision, at comparable computational cost. Lastly, we apply KnockoffZoom to data from 350k subjects in the UK Biobank and report many new findings. https://doi.org/10.1038/s41467-020-14791-2 OPEN 1 Department of Statistics, Stanford University, Stanford, CA 94305, USA. 2 Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 3 Departments of Mathematics and of Statistics, Stanford University, Stanford, CA 94305, USA. 4 Departments of Biomedical Data Science and of Statistics, ✉ 1234567890():,; (Nature Communications. 2020) ARTICLE Identification of putative causal loci in whole- genome sequencing data via knockoff statistics Zihuai He 1,2✉, Linxi Liu 3, Chen Wang 4, Yann Le Guen 1, Justin Lee2, Stephanie Gogarten 5, Fred Lu6, Stephen Montgomery 7,8, Hua Tang 6,7, Edwin K. Silverman9, Michael H. Cho 9, Michael Greicius1 & Iuliana Ionita-Laza4✉ The analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpret- ability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability, and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer’s Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method com- pared with conventional association tests can lead to substantially more discoveries. https://doi.org/10.1038/s41467-021-22889-4 OPEN 1234567890():,; (Nature Communications. 2021) www.nature.com/scientificreports Model-based and Model-free Machine Learning Techniques for Diagnostic Prediction and Classification of Clinical Outcomes in Parkinson’s Disease Chao Gao1,2, Hanbo Sun1,3, Tuo Wang1,3, Ming Tang1,2, Nicolaas I. Bohnen4,5,6, Martijn L. T. M. Müller 4,5,6, Talia Herman7, Nir Giladi7,9, Alexandr Kalinin 1,6,11, Cathie Spino2,6, William Dauer5,6, Jeffrey M. Hausdorff7,8,10 & Ivo D. Dinov1,6,11,12 In this study, we apply a multidisciplinary approach to investigate falls in PD patients using clinical, 2018 18 x OPEN (Scientific Reports. 2018)
  23. (BVTTJBOHSBQIJDBMNPEFMͷԠ༻ྫ False Discovery Rates in Biological Networks Lu Yu Tobias

    Kaufmann Johannes Lederer Department of Statistical Sciences University of Toronto NORMENT, University of Oslo Oslo University Hospital Department of Mathematics Ruhr-University Bochum Abstract The increasing availability of data has gen- erated unprecedented prospects for network analyses in many biological fields, such as neuroscience (e.g., brain networks), genomics (e.g., gene-gene interaction networks), and ecology (e.g., species interaction networks). A powerful statistical framework for estimat- ing such networks is Gaussian graphical mod- els, but standard estimators for the corre- sponding graphs are prone to large numbers of false discoveries. In this paper, we in- troduce a novel graph estimator based on knocko↵s that imitate the partial correlation structures of unconnected nodes. We then show that this new estimator provides accu- rate control of the false discovery rate and yet large power. 1 Introduction Biological processes can often be formulated as net- works; examples include gene-gene regulation net- are dependent conditionally on all other coordinates: xi is conditionally independent of xj given all other coordinates of x if and only if (i, j) 2 E. For example, in modeling functional brain networks based on func- tional Magnetic Resonance Imaging (fMRI), p is the number of brain regions under consideration, xi is the activity in the ith region, and the edge set E denotes the directly connected pairs of regions. A number of estimators for the edge set E are known. Besides simplistic correlational approaches, popular estimators are neighborhood selection (Meinshausen and B¨ uhlmann, 2006), which combines node-wise lasso estimates, and graphical lasso (Friedman et al., 2008; Yuan and Lin, 2007), which maximizes an `1 - penalized log-likelihood. These two estimators have been equipped with sharp prediction and estimation guarantees even for high-dimensional settings, where the number of samples is not much larger than the number of nodes p (Ravikumar et al., 2011; Rothman et al., 2008; Zhuang and Lederer, 2018). In contrast to such prediction and estimation results, what is less well understood for high-dimensional Gaussian graph- ical models is inference. Our objective is inference in terms of control over the false discovery rate (FDR), which is the expected (AISTATS 2021) False Discovery Rates in Biolo 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FDR Power n=300, p=200 KO GLASSO MB(and) MB(or) CT PT 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FDR Power n=400, p=200 KO GLASSO MB(and) MB(or) CT PT 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FDR Power n=600, p=400 KO GLASSO MB(and) MB(or) CT PT 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FDR Power n=800, p=400 KO GLASSO MB(and) MB(or) CT PT the c dividu calcul P k2g The s in Fig nectio count earlie the b 4.2 We n crobio //hum cific w ͏·͍͜ͱ,OPDLP⒎σʔλΛ ࡞ͬͯ͋͛Ε͹ྑ͍ͱ͍͏࿩ͳ ͷͰɺ6OTVQFSWJTFE-FBSOJOH ʹ΋ద༻Մೳ