$30 off During Our Annual Pro Sale. View Details »

Interpretable Machine Learning 6.3 - Prototypes and Criticisms

himkt
July 06, 2019

Interpretable Machine Learning 6.3 - Prototypes and Criticisms

himkt

July 06, 2019
Tweet

More Decks by himkt

Other Decks in Research

Transcript

  1. Interpretability
 Machine Learning 6.3 Prototype and Criticisms Makoto Hiramatsu <@himkt>

    * Figures is taken from the book
  2. Prototypes and Criticisms •A Prototype is a data instance
 that

    is representative of all the data •A Criticism is a data instance
 that is not well represented by the set of prototypes 2
  3. Prototypes and Criticisms 3

  4. Prototypes and Criticisms • Prototypes are manually selected • There

    are many approaches to find prototypes • How do we find criticisms?: MMD-critic [Kim+, 2016] • Combines prototypes and criticisms in a single framework 4
  5. MMD-critic •Maximum Mean Discrepancy:
 discrepancy between two distributions 1. Select

    the number of prototypes and criticisms 2. Find prototypes (by greedy) • Selected so that the distribution of the prototypes
 is close to the data distribution 3. Find criticisms (by greedy) • Selected so that the distribution of the criticisms
 differs from the data distribution 5
  6. MMD-critic: Ingredients •Kernel function: estimate the data density •Witness function:

    tell us how different two distribution are at a particular data point •Search Strategy: (greedy) 6 witness(x) = 1 n n X i=1 k(x, x0) 1 m m X j=1 k(x, zj) <latexit sha1_base64="DxKGH8HZ7C+GsmqQGw9z0EoETT4=">AAACMHichY7LTgIxGIVbvCHeRl26aSQaSBRn0MQVCdGNS0zkkjgw6ZSCZabTSdtRcTLxPXwMn0ZXxq1PIaMYA5h4Nj39z9+vxw19prRpvsLM3PzC4lJ2Obeyura+YWxuNZSIJKF1InwhWy5W1GcBrWumfdoKJcXc9WnT9c7TvHlLpWIiuNLDkLY57gesxwjWo5FjPN0xHVClCrYr/K4a8tER3ydFVEF2T2ISW0kcJMhWEXdiVrGSToC8qeUDNHnv2KFknBbR4S+D/zAGKYP/w3hInEHRMfJmyfwSmjXW2OTBWDXHeLS7gkScBpr4WKlrywx1O8ZSM+LTJGdHioaYeLhPY8xV+leC9jjWN2o6S4d/Z1Li4STKFcLT2FVJbtTYmu43axrlknVcKl+e5Ktn4+5ZsAN2QQFY4BRUwQWogTogEMB9eARN+Axf4Bt8/17NwPGbbTAh+PEJLlCoiw==</latexit> … Gaussian RBF k(x, x0) = exp( ||x x0||2) ( > 0) <latexit sha1_base64="9m2PxBpiacgEgEgn1J2Uo28YBKw=">AAACW3icbVFNTwIxFOyuX4hfq8aTl0ZiAomSXTTRi4boxSMmAiYskm55QEO7u2m7BrLwJz3pwb9iLLAHUV/SdDpvJn2dBjFnSrvuh2WvrK6tb+Q281vbO7t7zv5BQ0WJpFCnEY/kc0AUcBZCXTPN4TmWQETAoRkM72f95itIxaLwSY9jaAvSD1mPUaIN1XHksOgHEe+qsTBbOpqe4eXzix9LJqCEb7APo7h47veJEARPJsu6839tk8lLpYR9XMxct9gtdZyCW3bnhf8CLwMFlFWt47z53YgmAkJNOVGq5bmxbqdEakY5TPN+oiAmdEj60DIwJAJUO51nM8WnhuniXiTNCjWesz8dKRFqNrVRCqIH6ndvRv7XayW6d91OWRgnGkK6uKiXcKwjPAsad5kEqvnYAEIlM7NiOiCSUG2+I29C8H4/+S9oVMreRbnyeFmo3mVx5NAxOkFF5KErVEUPqIbqiKJ39GVtWDnr016x8/b2QmpbmecQLZV99A2mmLW1</latexit>
  7. 7 •Kernel function: estimate the data density •Witness function: tell

    us how different two distribution are at a particular data point •Search Strategy: (greedy) MMD-critic: Ingredients witness(x) = 1 n n X i=1 k(x, x0) 1 m m X j=1 k(x, zj) <latexit sha1_base64="DxKGH8HZ7C+GsmqQGw9z0EoETT4=">AAACMHichY7LTgIxGIVbvCHeRl26aSQaSBRn0MQVCdGNS0zkkjgw6ZSCZabTSdtRcTLxPXwMn0ZXxq1PIaMYA5h4Nj39z9+vxw19prRpvsLM3PzC4lJ2Obeyura+YWxuNZSIJKF1InwhWy5W1GcBrWumfdoKJcXc9WnT9c7TvHlLpWIiuNLDkLY57gesxwjWo5FjPN0xHVClCrYr/K4a8tER3ydFVEF2T2ISW0kcJMhWEXdiVrGSToC8qeUDNHnv2KFknBbR4S+D/zAGKYP/w3hInEHRMfJmyfwSmjXW2OTBWDXHeLS7gkScBpr4WKlrywx1O8ZSM+LTJGdHioaYeLhPY8xV+leC9jjWN2o6S4d/Z1Li4STKFcLT2FVJbtTYmu43axrlknVcKl+e5Ktn4+5ZsAN2QQFY4BRUwQWogTogEMB9eARN+Axf4Bt8/17NwPGbbTAh+PEJLlCoiw==</latexit> 0  k(x, x0)  1 <latexit sha1_base64="bf/9cte4zM8Ym9c5Shxm1bbPltQ=">AAABzHicbU7LTsJAFL2DL8RX1aWbRmKCiSEtGt0S3bgymMgjsUhmhitOOu3UmcFImsadX+HXuNUf8G8E6QbwbO6559zHYYkUxnreDyksLa+srhXXSxubW9s7zu5ey6ih5tjkSirdYdSgFDE2rbASO4lGGjGJbRZeTfz2C2ojVHxnRwl2IzqIxaPg1I6lnnPuuYHEZzesBEzJvhlF45K+ZifubP8QJFpEeDyd9ntO2at6f3AXiZ+TMuRo9Jy3oK/4MMLYckmNufe9xHZTqq3gErNSMDSYUB7SAaY0MpO/mXsUUftk5r2J+L+nNR3NnmJKhZYyk5XGif35fIukVav6p9Xa7Vm5fplnL8IBHEIFfLiAOlxDA5rA4QM+4Qu+yQ2xJCXZdLRA8p19mAF5/wVQmYG1</latexit> Infinite far apart (?) Equal What would happen to the formula if
 we used all n data points as prototypes? … Gaussian RBF k(x, x0) = exp( ||x x0||2) <latexit sha1_base64="qpC3uq9lUxxsvtjWBp0C9XPxV74=">AAACTHicbVBNTwIxFOyiIuIX6tFLIzHBRMkumujFhOjFIyYCJiyQbnlAQ7u7absGsvADvXjw5q/w4kFjTCwfBwFf0nQ6bybvdbyQM6Vt+81KrKyuJddTG+nNre2d3czefkUFkaRQpgEP5KNHFHDmQ1kzzeExlECEx6Hq9W7H/eoTSMUC/0EPQqgL0vFZm1GiDdXM0F7O9QLeUgNhrrg/OsXz74YbSibgBF9jF/ph7sztECEIHg7ndWf/2obDRuGkmcnaeXtSeBk4M5BFsyo1M69uK6CRAF9TTpSqOXao6zGRmlEOo7QbKQgJ7ZEO1Az0iQBVjydhjPCxYVq4HUhzfI0n7F9HTIQar2mUguiuWuyNyf96tUi3r+ox88NIg0+ng9oRxzrA42Rxi0mgmg8MIFQysyumXSIJ1Sb/tAnBWfzyMqgU8s55vnB/kS3ezOJIoUN0hHLIQZeoiO5QCZURRc/oHX2iL+vF+rC+rZ+pNGHNPAdorhLJX8RUtiA=</latexit>
  8. Witness function •Positive value at point x: 
 the prototype

    distribution overestimates the data distribution •Negative value at point x:
 the prototype distribution underestimates the data distribution • We look for extreme values of the witness function
 in both negative and positive directions 8
  9. MMD2 (squared) 9 MMD2 = 1 m2 m X i,j=1

    k(zi, zj) 2 mn m,n X i,j=1 k(zi, xj) + 1 n2 n X i,j=1 k(xi, xj) <latexit sha1_base64="OPzZnrzVimldFHM5ryEPusr853g=">AAACbnichY5bS8MwAIXTepv1VhV8EbE4hIlztFXwaTLUB18GE9wF7FbSLJtdk7Y0qWyW4t/Uf+FPsKuFuTrwvORwTvLlWD6xGVfVT0FcWl5ZXSusSxubW9s78u5ei3lhgHATecQLOhZkmNgubnKbE9zxAwypRXDbcu6mffsVB8z23Cc+8XGXwqFrD2wEeRKZ8ke9ft/TlapiDAKIIi2OaE+PFYOF1IzssjKqanGPKk7JsDzSZxOaHNFbbCZVLhmdSRcZRU8o7gySMiJaTqL/OOOUcz5b4/5ek4LcHGS8GGLKRbWiplL+Gi0zRZCpYcrvRt9DIcUuRwQy9qypPu9GMOA2IjiWjJBhHyIHDnEEKZv+FSunFPIXlu+m4eIuCOBkHmV5nsOhxWIpWazl9/01Lb2iXVb0x6ti7TbbXgCH4ASUgAauQQ08gAZoAiTcCH2BCq7wJR6IR+Lxz1VRyN7sgzmJpW8yV7ve</latexit> 1SPYJNJUZCFUXFFOQSPUPUZQFT 1SPYJNJUZCFUXFFOEBUBQPJOUBOEQSPUPUZQF 1SPYJNJUZCFUXFFOEBUBQPJOUT
  10. When MMD2 is minimized 10 • Answer: when All data

    points are prototypes
  11. MMD2 behaviors 11

  12. Advantages • We can focus on typical/edge cases • Remember

    Google’s image classifier problem • Participants performed better when the sets
 showed prototypes criticisms instead of
 random images of class • Suggest that prototypes and criticisms
 are informative examples • MMD-critic works with any type of data and
 any type of machine learning model 12
  13. Examples: dog breeds classification • Left prototypes: The face of

    dog • Left criticisms: Without dog faces or in different colors • Right prototypes: Outdoor images of dogs • Right criticisms: Dogs in costumes 13
  14. Examples: mnist • Prototypes: Various of ways of writing the

    digits • Criticisms: Unusual thick or thin and unrecognizable • (Note: not searched with a fixed number per class) 14
  15. Disadvantages • Hard to choose number of prototypes and criticisms

    • Elbow method is useful? • Hard to choose kernel and scaring parameter • Disregard the fact:
 “some features might not be relevant” 15