[7] Suppose we observe 𝐸 environments E = {𝑒1 , … , 𝑒𝐸 }, where 𝜎2 𝐸 = 1, ∀𝑒 ∈ [1, 𝐸]. Then, for any 𝜖 > 1, there exists a featurizer Φ𝜖 which, combined with the ERM-optimal classifier ̂ 𝛽 = [𝛽𝑐 , 𝛽𝑒;𝐸𝑅𝑀 , 𝛽0 ]⊤, satisfies the following 1. The regularization term of Φ𝜖 , ̂ 𝛽 is bounded as 1 𝐸 ∑ 𝑒∈E ‖∇ ̂ 𝛽 𝑅𝑒(Φ𝜖 , ̂ 𝛽)‖ 2 2 ∈ O (𝑝2 𝜖 (𝑐𝜖 𝑑𝑒 + 1 𝐸 ∑ 𝑒∈E ‖𝜇𝑒 ‖2 2 )) , (13) for some constants 𝑐𝜖 and 𝑝𝜖 ≔ exp{−𝑑𝑒 min(𝜖 − 1, (𝜖 − 1)2/8)}. 2. Φ𝜖 , ̂ 𝛽 is equivalent to the ERM -optimal predicter on at least 1 − 𝑞 fraction of the test distribution, where 𝑞 ≔ 2𝑅 √𝜋𝛿 exp{−𝛿2}. output.tex 18 ʢ 24