Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Paper Intro: Human Rademacher Complexity

Paper Intro: Human Rademacher Complexity

Masanari Kimura

August 21, 2023
Tweet

More Decks by Masanari Kimura

Other Decks in Research

Transcript

  1. Human Rademacher Complexity Created by: Masanari Kimura Institute: The Graduate

    University for Advanced Studies, SOKENDAI Dept: Department of Statistical Science, School of Multidisciplinary Sciences E-mail: [email protected]
  2. TL;DR ⊚ NIPS2009 [12]; ⊚ In statistical learning theory, Rademacher

    complexity is one of the complexity measures of function class, and induces the generalization bounds; ⊚ This work propose to use Rademacher complexity as a measure of human learning capacity. 2 ʢ 23
  3. Introduction ⊚ The capacity is one of the main research

    questions in cognitive psychology: • How much information can humans hold [7, 6, 3]? • What kinds of functions can humans easily acquire [11, 4]? • How do humans avoid over-fitting [8]? ⊚ In statistical learning theory, there are several concepts for capacity of function class: • Vapnik-Chervonenkis (VC) dimension [10]; • Rademacher complexity [5]. ⊚ These capacity notions provide the generalization bounds and probability of over-fitting; ⊚ Q. Are these capacity measures useful for evaluating human cognitive ability? 3 ʢ 23
  4. Notations ⊚ X: input space; ⊚ 𝑥 ∈ X: an

    instance from input space; ⊚ 𝑃𝑋 : underlying marginal distribution on X; ⊚ F: hypothesis space; ⊚ 𝑓 ∶ X → ℝ: a real-valued function (hypothesis); 4 ʢ 23
  5. Rademacher complexity ⊚ Consider a sample of 𝑛 instances: 𝑥1,

    … , 𝑥𝑛 drawn i.i.d. from 𝑃𝑋 . ⊚ Generate 𝑛 random variables 𝜎1, … , 𝜎𝑛 ∈ {−1, +1}. Definition (Rademacher complexity) For a set of real-valued functions F with input space X, a distribution 𝑃𝑋 on X, and sample size 𝑛, the Rademacher complexity 𝑅(F, X, 𝑃𝑋 , 𝑛) is 𝑅(F, X, 𝑃𝑋 , 𝑛) = 𝔼 𝑥1 ,…,𝑥𝑛 ∼𝑃𝑋 𝜎1 ,…,𝜎𝑛 ∼𝐵𝑒𝑟(1/2) [sup 𝑓 ∈F | 2 𝑛 𝑛 ∑ 𝑖=1 𝜎𝑖 𝑓 (𝑥𝑖 )|] , (1) where 𝜎1 … , 𝜎𝑛 ∼ 𝐵𝑒𝑟(1/2) with values ±1. 5 ʢ 23
  6. ⊚ Rademacher complexity measures how easy it is for F

    to fit random label flipping. • Flexible function class F ⇒ High complexity; • Inflexible function class F ⇒ Low complexity. 6 ʢ 23
  7. Empirical estimation of Rademacher complexity ⊚ We can estimate Rademacher

    complexity from random samples {𝑥(1) 𝑖 , 𝜎(1) 𝑖 }𝑛 𝑖=1 , … , {𝑥(𝑚) 𝑖 , 𝜎(𝑚) 𝑖 }𝑛 𝑖=1 . ⊚ From McDiarmid’s inequality, we have the following theorem. Theorem Let F be a set of functions mapping to [−1, 1]. For any integers 𝑛 and 𝑚, we have ℙ [|𝑅(F, X, 𝑃𝑋 , 𝑛) − 1 𝑚 𝑚 ∑ 𝑗=1 sup 𝑓 ∈F | 2 𝑛 𝑛 ∑ 𝑖=1 𝜎(𝑗) 𝑖 𝑓 (𝑥(𝑗) 𝑖 )|| ≥ 𝜖] ≤ 2 exp {− 𝜖2𝑛𝑚 8 } (2) 11 ʢ 23
  8. Generalization error bounds Theorem Let F be a set of

    functions mapping X to {−1, 1}. Let 𝑃𝑋𝑌 be a probability distribution on X × {−1, 1} with marginal distribution 𝑃𝑋 on X. Let {(𝑥𝑖 , 𝑦𝑖 )}𝑛 𝑖=1 i.i.d. ∼ 𝑃𝑋𝑌 be a training sample of size 𝑛. For any 𝛿 > 0, with probability at least 1 − 𝛿, every function 𝑓 ∈ F satisfies 𝑒(𝑓 ) − ̂ 𝑒(𝑓 ) ≤ 𝑅(F, X, 𝑃𝑋 , 𝑛) 2 + √ ln 1 𝛿 2 , (3) where 𝑒(𝑓 ) ≔ 𝔼(𝑥,𝑦)∼𝑃𝑋𝑌 [𝑦 ≠ 𝑓 (𝑥)] and ̂ 𝑒(𝑓 ) ≔ 1 𝑛 ∑𝑛 𝑖=1 𝑦𝑖 ≠ 𝑓 (𝑥𝑖 ). 12 ʢ 23
  9. ⊚ Goal: measure the Rademacher complexity of human learning system.

    ⊚ 𝐻𝛼: set of functions F that an average human subject can come up with on the experiments. ⊚ Two assumptions: • Universality[1]: every individual has the same 𝐻𝛼 . • Computability of the supremum on 𝐻𝛼 : when making classification judgements, participants use the best function at their disposal. ⊚ ⇒ Participants are doing their best to perform the task. 13 ʢ 23
  10. Computation of Human Rademacher complexity ⊚ Each participants is presented

    with a training sample {(𝑥𝑖, 𝜎𝑖)}𝑛 𝑖=1 . ⊚ They are asked to learn the instance-label mapping. • The subject is not told that the labels are random. ⊚ Assume that the subject will search within 𝐻𝛼 for the best rule: minimizing training error 𝑓 ∗ = argmax𝑓 ∈𝐻𝛼 ∑𝑛 𝑖=1 𝜎𝑖𝑓 (𝑥𝑖) = argmin𝑓 ∈𝐻𝛼 ̂ 𝑒(𝑓 ). ⊚ Later, ask the subject to classify the same training instances {𝑥𝑖}𝑛 𝑖=1 and approximate as sup 𝑓 ∈𝐻𝛼 | 2 𝑛 𝑛 ∑ 𝑖=1 𝜎𝑖𝑓 (𝑥𝑖)| ≈ | 2 𝑛 𝑛 ∑ 𝑖=1 𝜎𝑖𝑓 ∗(𝑥𝑖)| . (4) 14 ʢ 23
  11. Given domain X, distribution 𝑃𝑋 , training sample size 𝑛,

    and number of subjects 𝑚, generate {(𝑥(1) 𝑖 , 𝜎(1) 𝑖 )}𝑛 𝑖=1 , … , {(𝑥(𝑚) 𝑖 , 𝜎(𝑚) 𝑖 )}𝑛 𝑖=1 , where 𝑥(𝑗) 𝑖 i.i.d. ∼ 𝑃𝑋 and 𝜎(𝑗) 𝑖 i.i.d. ∼ Ber(1/2, 1/2) with value ±1. 1. Participant 𝑗 is shown {(𝑥(𝑗) 𝑖 , 𝜎(𝑗) 𝑖 )}𝑛 𝑖=1 . The participant is informed that there are only two categories; the order does not matter; they have only three minutes to study; and later they will be asked to use what they have learned to categorize more instances. 2. After three minutes the sheet is taken away. To prevent active maintenance of training items in working memory, the participant performs a filler task consisting of ten two-digit addition / subtraction questions. 3. The participant is given another sheet with the same {𝑥(𝑗) 𝑖 }𝑛 𝑖=1 without labels. The order of the 𝑛 instances is randomized. The participant is not told that they are the same training instances, is encouraged to guess if necessary, and there is no time limit. Conduct a post-experiment interview where the subject reports any insights or hypothesis they may have on the categories. 15 ʢ 23
  12. Experimental setup ⊚ Materials: For simplicity, 𝑃𝑋 is uniform in

    all experiments. 1) The ”Shape” Domain: X consists of 321 computer-generated 3D shapes. The shapes are parametrized by a real number 𝑥 ∈ [0, 1], such that small 𝑥 produces spiky shapes, while large 𝑥 produces smooth ones. 2) The ”Word” Domain X consists of 321 English words. Based on the Wisconsin Perceptual Attribute Ratings Database, the words are sorted by their emotion valence. The 161 most positive and the 160 most negative ones are used in the experiments. ⊚ Participants: They are 80 undergraduate students, participating for partial course credit. They are divided evenly into eight groups. Each group of 𝑚 = 10 subjects worked on a unique combination of the Shape or the Word domain, and training sample size 𝑛 ∈ {5, 10, 20, 40}. 16 ʢ 23
  13. ⊚ Observation 1: Human Rademacher complexities in both domain decrease

    as 𝑛 increase. • When 𝑛 = 5, one subject thought the shape categories are determined by whether the shape faces downward; another thought the word categories indicated whether the word contains the letter T. • When 𝑛 = 40, about half the participants believe the labels to be random. ⊚ Observation 2: Human Rademacher complexities are significantly higher in the Word domain than in the Shape domain. • One can speculate that Human Rademacher complexities reflect the richness of the participant’s pre-existing knowledge about the domain. ⊚ Observation 3: Many of these Human Rademacher complexities are relatively large. • This means that humans have a large capacity to learn arbitrary labels. 18 ʢ 23
  14. Conclusion and discussion ⊚ In this study, they suggest that

    complexity measures of statistical machine learning are useful for analyzing human cognitive ability. ⊚ Human Rademacher complexity may help explain the human tendency to discern patterns in random stimuli: • illusory correlations [2]; • false memory effect [9]. ⊚ Human Rademacher complexity can assist experimental psychologists in assessing the likelihood of overfitting in their stimulus materials. • Human Rademacher complexity exhibits significant variation across domains (from experimental results). 20 ʢ 23
  15. References [1] Alfonso Caramazza and Michael McCloskey. “The case for

    single-patient studies”. In: Cognitive Neuropsychology 5.5 (1988), pp. 517–527. [2] Loren J Chapman. “Illusory correlation in observational report”. In: Journal of Verbal Learning and Verbal Behavior 6.1 (1967), pp. 151–155. [3] Nelson Cowan. “The magical number 4 in short-term memory: A reconsideration of mental storage capacity”. In: Behavioral and brain sciences 24.1 (2001), pp. 87–114. [4] Jacob Feldman. “Minimization of Boolean complexity in human concept learning”. In: Nature 407.6804 (2000), pp. 630–633. [5] Michael J Kearns and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994. [6] George A Miller. “Some limits on our capacity for processing information”. In: Psychological Review 63 (1956), pp. 81–97. 21 ʢ 23
  16. References [7] George A Miller. “The magical number seven, plus

    or minus two: Some limits on our capacity for processing information.”. In: Psychological review 63.2 (1956), p. 81. [8] Randall C O’Reilly and James L McClelland. “Hippocampal Conjunctive Encoding, Storage, and Recall: Avoiding a Tradeoff, Parallel Distributed Processing and Cognitive Neuroscience Technical Report PDP”. In: CNS. 1994. [9] Henry L Roediger and Kathleen B McDermott. “Creating false memories: Remembering words not presented in lists.”. In: Journal of experimental psychology: Learning, Memory, and Cognition 21.4 (1995), p. 803. [10] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999. [11] William D Wattenmaker et al. “Linear separability and concept learning: Context, relational properties, and concept naturalness”. In: Cognitive Psychology 18.2 (1986), pp. 158–194. 22 ʢ 23
  17. References [12] Jerry Zhu, Bryan Gibson, and Timothy T Rogers.

    “Human rademacher complexity”. In: Advances in neural information processing systems 22 (2009). 23 ʢ 23