Interpretable Machine Learning 6.3 - Prototypes and Criticisms

Interpretability  Machine Learning 6.3 Prototype and Criticisms Makoto Hiramatsu <@himkt>
* Figures is taken from the book

Prototypes and Criticisms •A Prototype is a data instance  that
is representative of all the data •A Criticism is a data instance  that is not well represented by the set of prototypes 2

Prototypes and Criticisms 3

Prototypes and Criticisms • Prototypes are manually selected • There
are many approaches to ﬁnd prototypes • How do we ﬁnd criticisms?: MMD-critic [Kim+, 2016] • Combines prototypes and criticisms in a single framework 4

MMD-critic •Maximum Mean Discrepancy:  discrepancy between two distributions 1. Select
the number of prototypes and criticisms 2. Find prototypes (by greedy) • Selected so that the distribution of the prototypes  is close to the data distribution 3. Find criticisms (by greedy) • Selected so that the distribution of the criticisms  diﬀers from the data distribution 5

MMD-critic: Ingredients •Kernel function: estimate the data density •Witness function:
tell us how diﬀerent two distribution are at a particular data point •Search Strategy: (greedy) 6 witness(x) = 1 n n X i=1 k(x, x0) 1 m m X j=1 k(x, zj) <latexit sha1_base64="DxKGH8HZ7C+GsmqQGw9z0EoETT4=">AAACMHichY7LTgIxGIVbvCHeRl26aSQaSBRn0MQVCdGNS0zkkjgw6ZSCZabTSdtRcTLxPXwMn0ZXxq1PIaMYA5h4Nj39z9+vxw19prRpvsLM3PzC4lJ2Obeyura+YWxuNZSIJKF1InwhWy5W1GcBrWumfdoKJcXc9WnT9c7TvHlLpWIiuNLDkLY57gesxwjWo5FjPN0xHVClCrYr/K4a8tER3ydFVEF2T2ISW0kcJMhWEXdiVrGSToC8qeUDNHnv2KFknBbR4S+D/zAGKYP/w3hInEHRMfJmyfwSmjXW2OTBWDXHeLS7gkScBpr4WKlrywx1O8ZSM+LTJGdHioaYeLhPY8xV+leC9jjWN2o6S4d/Z1Li4STKFcLT2FVJbtTYmu43axrlknVcKl+e5Ktn4+5ZsAN2QQFY4BRUwQWogTogEMB9eARN+Axf4Bt8/17NwPGbbTAh+PEJLlCoiw==</latexit> … Gaussian RBF k(x, x0) = exp( ||x x0||2) ( > 0) <latexit sha1_base64="9m2PxBpiacgEgEgn1J2Uo28YBKw=">AAACW3icbVFNTwIxFOyuX4hfq8aTl0ZiAomSXTTRi4boxSMmAiYskm55QEO7u2m7BrLwJz3pwb9iLLAHUV/SdDpvJn2dBjFnSrvuh2WvrK6tb+Q281vbO7t7zv5BQ0WJpFCnEY/kc0AUcBZCXTPN4TmWQETAoRkM72f95itIxaLwSY9jaAvSD1mPUaIN1XHksOgHEe+qsTBbOpqe4eXzix9LJqCEb7APo7h47veJEARPJsu6839tk8lLpYR9XMxct9gtdZyCW3bnhf8CLwMFlFWt47z53YgmAkJNOVGq5bmxbqdEakY5TPN+oiAmdEj60DIwJAJUO51nM8WnhuniXiTNCjWesz8dKRFqNrVRCqIH6ndvRv7XayW6d91OWRgnGkK6uKiXcKwjPAsad5kEqvnYAEIlM7NiOiCSUG2+I29C8H4/+S9oVMreRbnyeFmo3mVx5NAxOkFF5KErVEUPqIbqiKJ39GVtWDnr016x8/b2QmpbmecQLZV99A2mmLW1</latexit>

7 •Kernel function: estimate the data density •Witness function: tell
us how diﬀerent two distribution are at a particular data point •Search Strategy: (greedy) MMD-critic: Ingredients witness(x) = 1 n n X i=1 k(x, x0) 1 m m X j=1 k(x, zj) <latexit sha1_base64="DxKGH8HZ7C+GsmqQGw9z0EoETT4=">AAACMHichY7LTgIxGIVbvCHeRl26aSQaSBRn0MQVCdGNS0zkkjgw6ZSCZabTSdtRcTLxPXwMn0ZXxq1PIaMYA5h4Nj39z9+vxw19prRpvsLM3PzC4lJ2Obeyura+YWxuNZSIJKF1InwhWy5W1GcBrWumfdoKJcXc9WnT9c7TvHlLpWIiuNLDkLY57gesxwjWo5FjPN0xHVClCrYr/K4a8tER3ydFVEF2T2ISW0kcJMhWEXdiVrGSToC8qeUDNHnv2KFknBbR4S+D/zAGKYP/w3hInEHRMfJmyfwSmjXW2OTBWDXHeLS7gkScBpr4WKlrywx1O8ZSM+LTJGdHioaYeLhPY8xV+leC9jjWN2o6S4d/Z1Li4STKFcLT2FVJbtTYmu43axrlknVcKl+e5Ktn4+5ZsAN2QQFY4BRUwQWogTogEMB9eARN+Axf4Bt8/17NwPGbbTAh+PEJLlCoiw==</latexit> 0  k(x, x0)  1 <latexit sha1_base64="bf/9cte4zM8Ym9c5Shxm1bbPltQ=">AAABzHicbU7LTsJAFL2DL8RX1aWbRmKCiSEtGt0S3bgymMgjsUhmhitOOu3UmcFImsadX+HXuNUf8G8E6QbwbO6559zHYYkUxnreDyksLa+srhXXSxubW9s7zu5ey6ih5tjkSirdYdSgFDE2rbASO4lGGjGJbRZeTfz2C2ojVHxnRwl2IzqIxaPg1I6lnnPuuYHEZzesBEzJvhlF45K+ZifubP8QJFpEeDyd9ntO2at6f3AXiZ+TMuRo9Jy3oK/4MMLYckmNufe9xHZTqq3gErNSMDSYUB7SAaY0MpO/mXsUUftk5r2J+L+nNR3NnmJKhZYyk5XGif35fIukVav6p9Xa7Vm5fplnL8IBHEIFfLiAOlxDA5rA4QM+4Qu+yQ2xJCXZdLRA8p19mAF5/wVQmYG1</latexit> Inﬁnite far apart (?) Equal What would happen to the formula if  we used all n data points as prototypes? … Gaussian RBF k(x, x0) = exp( ||x x0||2) <latexit sha1_base64="qpC3uq9lUxxsvtjWBp0C9XPxV74=">AAACTHicbVBNTwIxFOyiIuIX6tFLIzHBRMkumujFhOjFIyYCJiyQbnlAQ7u7absGsvADvXjw5q/w4kFjTCwfBwFf0nQ6bybvdbyQM6Vt+81KrKyuJddTG+nNre2d3czefkUFkaRQpgEP5KNHFHDmQ1kzzeExlECEx6Hq9W7H/eoTSMUC/0EPQqgL0vFZm1GiDdXM0F7O9QLeUgNhrrg/OsXz74YbSibgBF9jF/ph7sztECEIHg7ndWf/2obDRuGkmcnaeXtSeBk4M5BFsyo1M69uK6CRAF9TTpSqOXao6zGRmlEOo7QbKQgJ7ZEO1Az0iQBVjydhjPCxYVq4HUhzfI0n7F9HTIQar2mUguiuWuyNyf96tUi3r+ox88NIg0+ng9oRxzrA42Rxi0mgmg8MIFQysyumXSIJ1Sb/tAnBWfzyMqgU8s55vnB/kS3ezOJIoUN0hHLIQZeoiO5QCZURRc/oHX2iL+vF+rC+rZ+pNGHNPAdorhLJX8RUtiA=</latexit>

Witness function •Positive value at point x:   the prototype
distribution overestimates the data distribution •Negative value at point x:  the prototype distribution underestimates the data distribution • We look for extreme values of the witness function  in both negative and positive directions 8

MMD2 (squared) 9 MMD2 = 1 m2 m X i,j=1
k(zi, zj) 2 mn m,n X i,j=1 k(zi, xj) + 1 n2 n X i,j=1 k(xi, xj) <latexit sha1_base64="OPzZnrzVimldFHM5ryEPusr853g=">AAACbnichY5bS8MwAIXTepv1VhV8EbE4hIlztFXwaTLUB18GE9wF7FbSLJtdk7Y0qWyW4t/Uf+FPsKuFuTrwvORwTvLlWD6xGVfVT0FcWl5ZXSusSxubW9s78u5ei3lhgHATecQLOhZkmNgubnKbE9zxAwypRXDbcu6mffsVB8z23Cc+8XGXwqFrD2wEeRKZ8ke9ft/TlapiDAKIIi2OaE+PFYOF1IzssjKqanGPKk7JsDzSZxOaHNFbbCZVLhmdSRcZRU8o7gySMiJaTqL/OOOUcz5b4/5ek4LcHGS8GGLKRbWiplL+Gi0zRZCpYcrvRt9DIcUuRwQy9qypPu9GMOA2IjiWjJBhHyIHDnEEKZv+FSunFPIXlu+m4eIuCOBkHmV5nsOhxWIpWazl9/01Lb2iXVb0x6ti7TbbXgCH4ASUgAauQQ08gAZoAiTcCH2BCq7wJR6IR+Lxz1VRyN7sgzmJpW8yV7ve</latexit> 1SPYJNJUZCFUXFFOQSPUPUZQFT 1SPYJNJUZCFUXFFOEBUBQPJOUBOEQSPUPUZQF 1SPYJNJUZCFUXFFOEBUBQPJOUT

When MMD2 is minimized 10 • Answer: when All data
points are prototypes

MMD2 behaviors 11

Advantages • We can focus on typical/edge cases • Remember
Google’s image classiﬁer problem • Participants performed better when the sets  showed prototypes criticisms instead of  random images of class • Suggest that prototypes and criticisms  are informative examples • MMD-critic works with any type of data and  any type of machine learning model 12

Examples: dog breeds classiﬁcation • Left prototypes: The face of
dog • Left criticisms: Without dog faces or in diﬀerent colors • Right prototypes: Outdoor images of dogs • Right criticisms: Dogs in costumes 13

Examples: mnist • Prototypes: Various of ways of writing the
digits • Criticisms: Unusual thick or thin and unrecognizable • (Note: not searched with a ﬁxed number per class) 14

Disadvantages • Hard to choose number of prototypes and criticisms
• Elbow method is useful? • Hard to choose kernel and scaring parameter • Disregard the fact:  “some features might not be relevant” 15

Interpretable Machine Learning 6.3 - Prototypes...

Interpretable Machine Learning 6.3 - Prototypes and Criticisms

himkt

More Decks by himkt

Other Decks in Research

Featured

Transcript

Interpretability  Machine Learning 6.3 Prototype and Criticisms Makoto Hiramatsu <@himkt>

Prototypes and Criticisms •A Prototype is a data instance  that

Prototypes and Criticisms 3

Prototypes and Criticisms • Prototypes are manually selected • There

MMD-critic •Maximum Mean Discrepancy:  discrepancy between two distributions 1. Select

MMD-critic: Ingredients •Kernel function: estimate the data density •Witness function:

7 •Kernel function: estimate the data density •Witness function: tell

Witness function •Positive value at point x:   the prototype

MMD2 (squared) 9 MMD2 = 1 m2 m X i,j=1

When MMD2 is minimized 10 • Answer: when All data

MMD2 behaviors 11

Advantages • We can focus on typical/edge cases • Remember

Examples: dog breeds classiﬁcation • Left prototypes: The face of

Examples: mnist • Prototypes: Various of ways of writing the

Disadvantages • Hard to choose number of prototypes and criticisms