Paper Intro: Human Rademacher Complexity

Slide 1

Slide 1 text

Human Rademacher Complexity Created by: Masanari Kimura Institute: The Graduate University for Advanced Studies, SOKENDAI Dept: Department of Statistical Science, School of Multidisciplinary Sciences E-mail: [email protected]

Slide 2

Slide 2 text

Table of contents Introduction Rademacher complexity Human Rademacher complexity Conclusion and discussion 1 ʢ 23

Slide 3

Slide 3 text

Introduction

Slide 4

Slide 4 text

TL;DR ⊚ NIPS2009 [12]; ⊚ In statistical learning theory, Rademacher complexity is one of the complexity measures of function class, and induces the generalization bounds; ⊚ This work propose to use Rademacher complexity as a measure of human learning capacity. 2 ʢ 23

Slide 5

Slide 5 text

Introduction ⊚ The capacity is one of the main research questions in cognitive psychology: • How much information can humans hold [7, 6, 3]? • What kinds of functions can humans easily acquire [11, 4]? • How do humans avoid over-fitting [8]? ⊚ In statistical learning theory, there are several concepts for capacity of function class: • Vapnik-Chervonenkis (VC) dimension [10]; • Rademacher complexity [5]. ⊚ These capacity notions provide the generalization bounds and probability of over-fitting; ⊚ Q. Are these capacity measures useful for evaluating human cognitive ability? 3 ʢ 23

Slide 6

Slide 6 text

Rademacher complexity

Slide 7

Slide 7 text

Notations ⊚ X: input space; ⊚ 𝑥 ∈ X: an instance from input space; ⊚ 𝑃𝑋 : underlying marginal distribution on X; ⊚ F: hypothesis space; ⊚ 𝑓 ∶ X → ℝ: a real-valued function (hypothesis); 4 ʢ 23

Slide 8

Slide 8 text

Rademacher complexity ⊚ Consider a sample of 𝑛 instances: 𝑥1, … , 𝑥𝑛 drawn i.i.d. from 𝑃𝑋 . ⊚ Generate 𝑛 random variables 𝜎1, … , 𝜎𝑛 ∈ {−1, +1}. Definition (Rademacher complexity) For a set of real-valued functions F with input space X, a distribution 𝑃𝑋 on X, and sample size 𝑛, the Rademacher complexity 𝑅(F, X, 𝑃𝑋 , 𝑛) is 𝑅(F, X, 𝑃𝑋 , 𝑛) = 𝔼 𝑥1 ,…,𝑥𝑛 ∼𝑃𝑋 𝜎1 ,…,𝜎𝑛 ∼𝐵𝑒𝑟(1/2) [sup 𝑓 ∈F | 2 𝑛 𝑛 ∑ 𝑖=1 𝜎𝑖 𝑓 (𝑥𝑖 )|] , (1) where 𝜎1 … , 𝜎𝑛 ∼ 𝐵𝑒𝑟(1/2) with values ±1. 5 ʢ 23

Slide 9

Slide 9 text

⊚ Rademacher complexity measures how easy it is for F to fit random label flipping. • Flexible function class F ⇒ High complexity; • Inflexible function class F ⇒ Low complexity. 6 ʢ 23

Slide 10

Slide 10 text

7 ʢ 23

Slide 11

Slide 11 text

8 ʢ 23

Slide 12

Slide 12 text

9 ʢ 23

Slide 13

Slide 13 text

10 ʢ 23

Slide 14

Slide 14 text

Empirical estimation of Rademacher complexity ⊚ We can estimate Rademacher complexity from random samples {𝑥(1) 𝑖 , 𝜎(1) 𝑖 }𝑛 𝑖=1 , … , {𝑥(𝑚) 𝑖 , 𝜎(𝑚) 𝑖 }𝑛 𝑖=1 . ⊚ From McDiarmid’s inequality, we have the following theorem. Theorem Let F be a set of functions mapping to [−1, 1]. For any integers 𝑛 and 𝑚, we have ℙ [|𝑅(F, X, 𝑃𝑋 , 𝑛) − 1 𝑚 𝑚 ∑ 𝑗=1 sup 𝑓 ∈F | 2 𝑛 𝑛 ∑ 𝑖=1 𝜎(𝑗) 𝑖 𝑓 (𝑥(𝑗) 𝑖 )|| ≥ 𝜖] ≤ 2 exp {− 𝜖2𝑛𝑚 8 } (2) 11 ʢ 23

Slide 15

Slide 15 text

Generalization error bounds Theorem Let F be a set of functions mapping X to {−1, 1}. Let 𝑃𝑋𝑌 be a probability distribution on X × {−1, 1} with marginal distribution 𝑃𝑋 on X. Let {(𝑥𝑖 , 𝑦𝑖 )}𝑛 𝑖=1 i.i.d. ∼ 𝑃𝑋𝑌 be a training sample of size 𝑛. For any 𝛿 > 0, with probability at least 1 − 𝛿, every function 𝑓 ∈ F satisfies 𝑒(𝑓 ) − ̂ 𝑒(𝑓 ) ≤ 𝑅(F, X, 𝑃𝑋 , 𝑛) 2 + √ ln 1 𝛿 2 , (3) where 𝑒(𝑓 ) ≔ 𝔼(𝑥,𝑦)∼𝑃𝑋𝑌 [𝑦 ≠ 𝑓 (𝑥)] and ̂ 𝑒(𝑓 ) ≔ 1 𝑛 ∑𝑛 𝑖=1 𝑦𝑖 ≠ 𝑓 (𝑥𝑖 ). 12 ʢ 23

Slide 16

Slide 16 text

Human Rademacher complexity

Slide 17

Slide 17 text

⊚ Goal: measure the Rademacher complexity of human learning system. ⊚ 𝐻𝛼: set of functions F that an average human subject can come up with on the experiments. ⊚ Two assumptions: • Universality[1]: every individual has the same 𝐻𝛼 . • Computability of the supremum on 𝐻𝛼 : when making classification judgements, participants use the best function at their disposal. ⊚ ⇒ Participants are doing their best to perform the task. 13 ʢ 23

Slide 18

Slide 18 text

Computation of Human Rademacher complexity ⊚ Each participants is presented with a training sample {(𝑥𝑖, 𝜎𝑖)}𝑛 𝑖=1 . ⊚ They are asked to learn the instance-label mapping. • The subject is not told that the labels are random. ⊚ Assume that the subject will search within 𝐻𝛼 for the best rule: minimizing training error 𝑓 ∗ = argmax𝑓 ∈𝐻𝛼 ∑𝑛 𝑖=1 𝜎𝑖𝑓 (𝑥𝑖) = argmin𝑓 ∈𝐻𝛼 ̂ 𝑒(𝑓 ). ⊚ Later, ask the subject to classify the same training instances {𝑥𝑖}𝑛 𝑖=1 and approximate as sup 𝑓 ∈𝐻𝛼 | 2 𝑛 𝑛 ∑ 𝑖=1 𝜎𝑖𝑓 (𝑥𝑖)| ≈ | 2 𝑛 𝑛 ∑ 𝑖=1 𝜎𝑖𝑓 ∗(𝑥𝑖)| . (4) 14 ʢ 23

Slide 19

Slide 19 text

Given domain X, distribution 𝑃𝑋 , training sample size 𝑛, and number of subjects 𝑚, generate {(𝑥(1) 𝑖 , 𝜎(1) 𝑖 )}𝑛 𝑖=1 , … , {(𝑥(𝑚) 𝑖 , 𝜎(𝑚) 𝑖 )}𝑛 𝑖=1 , where 𝑥(𝑗) 𝑖 i.i.d. ∼ 𝑃𝑋 and 𝜎(𝑗) 𝑖 i.i.d. ∼ Ber(1/2, 1/2) with value ±1. 1. Participant 𝑗 is shown {(𝑥(𝑗) 𝑖 , 𝜎(𝑗) 𝑖 )}𝑛 𝑖=1 . The participant is informed that there are only two categories; the order does not matter; they have only three minutes to study; and later they will be asked to use what they have learned to categorize more instances. 2. After three minutes the sheet is taken away. To prevent active maintenance of training items in working memory, the participant performs a filler task consisting of ten two-digit addition / subtraction questions. 3. The participant is given another sheet with the same {𝑥(𝑗) 𝑖 }𝑛 𝑖=1 without labels. The order of the 𝑛 instances is randomized. The participant is not told that they are the same training instances, is encouraged to guess if necessary, and there is no time limit. Conduct a post-experiment interview where the subject reports any insights or hypothesis they may have on the categories. 15 ʢ 23

Slide 20

Slide 20 text

Experimental setup ⊚ Materials: For simplicity, 𝑃𝑋 is uniform in all experiments. 1) The ”Shape” Domain: X consists of 321 computer-generated 3D shapes. The shapes are parametrized by a real number 𝑥 ∈ [0, 1], such that small 𝑥 produces spiky shapes, while large 𝑥 produces smooth ones. 2) The ”Word” Domain X consists of 321 English words. Based on the Wisconsin Perceptual Attribute Ratings Database, the words are sorted by their emotion valence. The 161 most positive and the 160 most negative ones are used in the experiments. ⊚ Participants: They are 80 undergraduate students, participating for partial course credit. They are divided evenly into eight groups. Each group of 𝑚 = 10 subjects worked on a unique combination of the Shape or the Word domain, and training sample size 𝑛 ∈ {5, 10, 20, 40}. 16 ʢ 23

Slide 21

Slide 21 text

Experimental results 17 ʢ 23

Slide 22

Slide 22 text

⊚ Observation 1: Human Rademacher complexities in both domain decrease as 𝑛 increase. • When 𝑛 = 5, one subject thought the shape categories are determined by whether the shape faces downward; another thought the word categories indicated whether the word contains the letter T. • When 𝑛 = 40, about half the participants believe the labels to be random. ⊚ Observation 2: Human Rademacher complexities are significantly higher in the Word domain than in the Shape domain. • One can speculate that Human Rademacher complexities reflect the richness of the participant’s pre-existing knowledge about the domain. ⊚ Observation 3: Many of these Human Rademacher complexities are relatively large. • This means that humans have a large capacity to learn arbitrary labels. 18 ʢ 23

Slide 23

Slide 23 text

Human generalization bounds 19 ʢ 23

Slide 24

Slide 24 text

Conclusion and discussion

Slide 25

Slide 25 text

Conclusion and discussion ⊚ In this study, they suggest that complexity measures of statistical machine learning are useful for analyzing human cognitive ability. ⊚ Human Rademacher complexity may help explain the human tendency to discern patterns in random stimuli: • illusory correlations [2]; • false memory effect [9]. ⊚ Human Rademacher complexity can assist experimental psychologists in assessing the likelihood of overfitting in their stimulus materials. • Human Rademacher complexity exhibits significant variation across domains (from experimental results). 20 ʢ 23

Slide 26

Slide 26 text

References [1] Alfonso Caramazza and Michael McCloskey. “The case for single-patient studies”. In: Cognitive Neuropsychology 5.5 (1988), pp. 517–527. [2] Loren J Chapman. “Illusory correlation in observational report”. In: Journal of Verbal Learning and Verbal Behavior 6.1 (1967), pp. 151–155. [3] Nelson Cowan. “The magical number 4 in short-term memory: A reconsideration of mental storage capacity”. In: Behavioral and brain sciences 24.1 (2001), pp. 87–114. [4] Jacob Feldman. “Minimization of Boolean complexity in human concept learning”. In: Nature 407.6804 (2000), pp. 630–633. [5] Michael J Kearns and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994. [6] George A Miller. “Some limits on our capacity for processing information”. In: Psychological Review 63 (1956), pp. 81–97. 21 ʢ 23

Slide 27

Slide 27 text

References [7] George A Miller. “The magical number seven, plus or minus two: Some limits on our capacity for processing information.”. In: Psychological review 63.2 (1956), p. 81. [8] Randall C O’Reilly and James L McClelland. “Hippocampal Conjunctive Encoding, Storage, and Recall: Avoiding a Tradeoff, Parallel Distributed Processing and Cognitive Neuroscience Technical Report PDP”. In: CNS. 1994. [9] Henry L Roediger and Kathleen B McDermott. “Creating false memories: Remembering words not presented in lists.”. In: Journal of experimental psychology: Learning, Memory, and Cognition 21.4 (1995), p. 803. [10] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999. [11] William D Wattenmaker et al. “Linear separability and concept learning: Context, relational properties, and concept naturalness”. In: Cognitive Psychology 18.2 (1986), pp. 158–194. 22 ʢ 23

Slide 28

Slide 28 text

References [12] Jerry Zhu, Bryan Gibson, and Timothy T Rogers. “Human rademacher complexity”. In: Advances in neural information processing systems 22 (2009). 23 ʢ 23