Slide 1

Slide 1 text

Interpretability
 Machine Learning 6.3 Prototype and Criticisms Makoto Hiramatsu <@himkt> * Figures is taken from the book

Slide 2

Slide 2 text

Prototypes and Criticisms •A Prototype is a data instance
 that is representative of all the data •A Criticism is a data instance
 that is not well represented by the set of prototypes 2

Slide 3

Slide 3 text

Prototypes and Criticisms 3

Slide 4

Slide 4 text

Prototypes and Criticisms • Prototypes are manually selected • There are many approaches to find prototypes • How do we find criticisms?: MMD-critic [Kim+, 2016] • Combines prototypes and criticisms in a single framework 4

Slide 5

Slide 5 text

MMD-critic •Maximum Mean Discrepancy:
 discrepancy between two distributions 1. Select the number of prototypes and criticisms 2. Find prototypes (by greedy) • Selected so that the distribution of the prototypes
 is close to the data distribution 3. Find criticisms (by greedy) • Selected so that the distribution of the criticisms
 differs from the data distribution 5

Slide 6

Slide 6 text

MMD-critic: Ingredients •Kernel function: estimate the data density •Witness function: tell us how different two distribution are at a particular data point •Search Strategy: (greedy) 6 witness(x) = 1 n n X i=1 k(x, x0) 1 m m X j=1 k(x, zj) AAACMHichY7LTgIxGIVbvCHeRl26aSQaSBRn0MQVCdGNS0zkkjgw6ZSCZabTSdtRcTLxPXwMn0ZXxq1PIaMYA5h4Nj39z9+vxw19prRpvsLM3PzC4lJ2Obeyura+YWxuNZSIJKF1InwhWy5W1GcBrWumfdoKJcXc9WnT9c7TvHlLpWIiuNLDkLY57gesxwjWo5FjPN0xHVClCrYr/K4a8tER3ydFVEF2T2ISW0kcJMhWEXdiVrGSToC8qeUDNHnv2KFknBbR4S+D/zAGKYP/w3hInEHRMfJmyfwSmjXW2OTBWDXHeLS7gkScBpr4WKlrywx1O8ZSM+LTJGdHioaYeLhPY8xV+leC9jjWN2o6S4d/Z1Li4STKFcLT2FVJbtTYmu43axrlknVcKl+e5Ktn4+5ZsAN2QQFY4BRUwQWogTogEMB9eARN+Axf4Bt8/17NwPGbbTAh+PEJLlCoiw== … Gaussian RBF k(x, x0) = exp( ||x x0||2) ( > 0) AAACW3icbVFNTwIxFOyuX4hfq8aTl0ZiAomSXTTRi4boxSMmAiYskm55QEO7u2m7BrLwJz3pwb9iLLAHUV/SdDpvJn2dBjFnSrvuh2WvrK6tb+Q281vbO7t7zv5BQ0WJpFCnEY/kc0AUcBZCXTPN4TmWQETAoRkM72f95itIxaLwSY9jaAvSD1mPUaIN1XHksOgHEe+qsTBbOpqe4eXzix9LJqCEb7APo7h47veJEARPJsu6839tk8lLpYR9XMxct9gtdZyCW3bnhf8CLwMFlFWt47z53YgmAkJNOVGq5bmxbqdEakY5TPN+oiAmdEj60DIwJAJUO51nM8WnhuniXiTNCjWesz8dKRFqNrVRCqIH6ndvRv7XayW6d91OWRgnGkK6uKiXcKwjPAsad5kEqvnYAEIlM7NiOiCSUG2+I29C8H4/+S9oVMreRbnyeFmo3mVx5NAxOkFF5KErVEUPqIbqiKJ39GVtWDnr016x8/b2QmpbmecQLZV99A2mmLW1

Slide 7

Slide 7 text

7 •Kernel function: estimate the data density •Witness function: tell us how different two distribution are at a particular data point •Search Strategy: (greedy) MMD-critic: Ingredients witness(x) = 1 n n X i=1 k(x, x0) 1 m m X j=1 k(x, zj) AAACMHichY7LTgIxGIVbvCHeRl26aSQaSBRn0MQVCdGNS0zkkjgw6ZSCZabTSdtRcTLxPXwMn0ZXxq1PIaMYA5h4Nj39z9+vxw19prRpvsLM3PzC4lJ2Obeyura+YWxuNZSIJKF1InwhWy5W1GcBrWumfdoKJcXc9WnT9c7TvHlLpWIiuNLDkLY57gesxwjWo5FjPN0xHVClCrYr/K4a8tER3ydFVEF2T2ISW0kcJMhWEXdiVrGSToC8qeUDNHnv2KFknBbR4S+D/zAGKYP/w3hInEHRMfJmyfwSmjXW2OTBWDXHeLS7gkScBpr4WKlrywx1O8ZSM+LTJGdHioaYeLhPY8xV+leC9jjWN2o6S4d/Z1Li4STKFcLT2FVJbtTYmu43axrlknVcKl+e5Ktn4+5ZsAN2QQFY4BRUwQWogTogEMB9eARN+Axf4Bt8/17NwPGbbTAh+PEJLlCoiw== 0  k(x, x0)  1 AAABzHicbU7LTsJAFL2DL8RX1aWbRmKCiSEtGt0S3bgymMgjsUhmhitOOu3UmcFImsadX+HXuNUf8G8E6QbwbO6559zHYYkUxnreDyksLa+srhXXSxubW9s7zu5ey6ih5tjkSirdYdSgFDE2rbASO4lGGjGJbRZeTfz2C2ojVHxnRwl2IzqIxaPg1I6lnnPuuYHEZzesBEzJvhlF45K+ZifubP8QJFpEeDyd9ntO2at6f3AXiZ+TMuRo9Jy3oK/4MMLYckmNufe9xHZTqq3gErNSMDSYUB7SAaY0MpO/mXsUUftk5r2J+L+nNR3NnmJKhZYyk5XGif35fIukVav6p9Xa7Vm5fplnL8IBHEIFfLiAOlxDA5rA4QM+4Qu+yQ2xJCXZdLRA8p19mAF5/wVQmYG1 Infinite far apart (?) Equal What would happen to the formula if
 we used all n data points as prototypes? … Gaussian RBF k(x, x0) = exp( ||x x0||2) AAACTHicbVBNTwIxFOyiIuIX6tFLIzHBRMkumujFhOjFIyYCJiyQbnlAQ7u7absGsvADvXjw5q/w4kFjTCwfBwFf0nQ6bybvdbyQM6Vt+81KrKyuJddTG+nNre2d3czefkUFkaRQpgEP5KNHFHDmQ1kzzeExlECEx6Hq9W7H/eoTSMUC/0EPQqgL0vFZm1GiDdXM0F7O9QLeUgNhrrg/OsXz74YbSibgBF9jF/ph7sztECEIHg7ndWf/2obDRuGkmcnaeXtSeBk4M5BFsyo1M69uK6CRAF9TTpSqOXao6zGRmlEOo7QbKQgJ7ZEO1Az0iQBVjydhjPCxYVq4HUhzfI0n7F9HTIQar2mUguiuWuyNyf96tUi3r+ox88NIg0+ng9oRxzrA42Rxi0mgmg8MIFQysyumXSIJ1Sb/tAnBWfzyMqgU8s55vnB/kS3ezOJIoUN0hHLIQZeoiO5QCZURRc/oHX2iL+vF+rC+rZ+pNGHNPAdorhLJX8RUtiA=

Slide 8

Slide 8 text

Witness function •Positive value at point x: 
 the prototype distribution overestimates the data distribution •Negative value at point x:
 the prototype distribution underestimates the data distribution • We look for extreme values of the witness function
 in both negative and positive directions 8

Slide 9

Slide 9 text

MMD2 (squared) 9 MMD2 = 1 m2 m X i,j=1 k(zi, zj) 2 mn m,n X i,j=1 k(zi, xj) + 1 n2 n X i,j=1 k(xi, xj) AAACbnichY5bS8MwAIXTepv1VhV8EbE4hIlztFXwaTLUB18GE9wF7FbSLJtdk7Y0qWyW4t/Uf+FPsKuFuTrwvORwTvLlWD6xGVfVT0FcWl5ZXSusSxubW9s78u5ei3lhgHATecQLOhZkmNgubnKbE9zxAwypRXDbcu6mffsVB8z23Cc+8XGXwqFrD2wEeRKZ8ke9ft/TlapiDAKIIi2OaE+PFYOF1IzssjKqanGPKk7JsDzSZxOaHNFbbCZVLhmdSRcZRU8o7gySMiJaTqL/OOOUcz5b4/5ek4LcHGS8GGLKRbWiplL+Gi0zRZCpYcrvRt9DIcUuRwQy9qypPu9GMOA2IjiWjJBhHyIHDnEEKZv+FSunFPIXlu+m4eIuCOBkHmV5nsOhxWIpWazl9/01Lb2iXVb0x6ti7TbbXgCH4ASUgAauQQ08gAZoAiTcCH2BCq7wJR6IR+Lxz1VRyN7sgzmJpW8yV7ve 1SPYJNJUZCFUXFFOQSPUPUZQFT 1SPYJNJUZCFUXFFOEBUBQPJOUBOEQSPUPUZQF 1SPYJNJUZCFUXFFOEBUBQPJOUT

Slide 10

Slide 10 text

When MMD2 is minimized 10 • Answer: when All data points are prototypes

Slide 11

Slide 11 text

MMD2 behaviors 11

Slide 12

Slide 12 text

Advantages • We can focus on typical/edge cases • Remember Google’s image classifier problem • Participants performed better when the sets
 showed prototypes criticisms instead of
 random images of class • Suggest that prototypes and criticisms
 are informative examples • MMD-critic works with any type of data and
 any type of machine learning model 12

Slide 13

Slide 13 text

Examples: dog breeds classification • Left prototypes: The face of dog • Left criticisms: Without dog faces or in different colors • Right prototypes: Outdoor images of dogs • Right criticisms: Dogs in costumes 13

Slide 14

Slide 14 text

Examples: mnist • Prototypes: Various of ways of writing the digits • Criticisms: Unusual thick or thin and unrecognizable • (Note: not searched with a fixed number per class) 14

Slide 15

Slide 15 text

Disadvantages • Hard to choose number of prototypes and criticisms • Elbow method is useful? • Hard to choose kernel and scaring parameter • Disregard the fact:
 “some features might not be relevant” 15