Been Kim (original) (raw)

Images

Don't trust your eyes: on the (un)reliability of feature visualizations

Robert Geirhos, Roland S. Zimmermann, Blair Bilodeau, Wieland Brendel, Been Kim
[ICML 2024]

Images

Concept-based Understanding of Emergent Multi-Agent Behavior

Niko Grupen, Natasha Jaques, Been Kim, Shayegan Omidshafiei
[arxiv]

Images

Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero

TL;DR: Pushing the frontier of human knowledge by developing interpretability tools to teach humans something new. This work provides quantitative evidence that learning from something only machines know (M-H space) is possible. We discover super-human chess strategies from AlphaZero and teach them to four amazing grandmasters. The quantitative evidence: we measure grandmasters' baseline performance on positions that invoke the concept. After teaching (shown AZ moves), they can solve puzzles better on unseen positions.

Lisa Schut, Nenad Tomasev, Tom McGrath, Demis Hassabis, Ulrich Paquet, Been Kim
[arxiv 2023]

Images

Impossibility Theorems for Feature Attribution

TL;DR: We can theoretically prove that just because popular attribution methods tell you there is X attribution to a feature, doesn’t mean you can conclude anything about the actual model's behavior.

Blair Bilodeau, Natasha Jaques, Pang Wei Koh, Been Kim
[PNAS 2023]

Images

Socially intelligent machines that learn from humans and help humans learn

TL;DR: We need AI systems that can consider human minds so that they can learn more effectively from humans (as learners) and even help humans learn (as teachers).

Images

Model evaluation for extreme risks

TL;DR: Model evaluation is critical for addressing extreme risks.

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, Allan Dafoe
[arxiv]

Images

Gaussian Process Probes (GPP) for Uncertainty-Aware Probing

TL;DR: A probing method that can also provide epistemic and aleatory uncertainties about its probing.

Zi Wang, Alexander Ku, Jason Baldridge, Thomas L. Griffiths, Been Kim
[Neurips2023]

Images

State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding

TL;DR: Protégé Effect: use joint embedding model to 1) inform RL reward shaping and 2) provide explanations that improves task performance for users.

Devleena Das, Sonia Chernova, Been Kim
[Neurips2023]

Images

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

TL;DR: Surprisingly, localization (where a fact is stored) in LLM has no correlation with editing success.

Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun
[Neurips2023]

Images

On the Relationship Between Explanation and Prediction: A Causal View

TL;DR: There is not much.

Amir-Hossein Karimi, Krikamol Muandet, Simon Kornblith, Bernhard Schölkopf, Been Kim
[ICML 2023]

Images

Subgoal-based explanations for unreliable intelligent decision support systems

TL;DR: Even when explanations are not perfect, some types of explanations (subgoal-based) can be helpful for training humans in complex tasks.

Devleena Das, Been Kim, Sonia Chernova
[IUI 2023]

Images

Beyond Rewards: a Hierarchical Perspective on Offline Multiagent Behavioral Analysis

TL;DR:Treat neural networks as if they were a new species in the wild. Conduct an observational study to learn emergent behaviors of the multi-agent RL system.

Shayegan Omidshafiei, Andrei Kapishnikov, Yannick Assogba, Lucas Dixon, Been Kim
[Neurips 2022]

Images

Mood board search & CAV Camera

TL;DR: Together with artists, designers and ML experts, we experiment with ways in which machine learning can inspire creativity--especially in photography. We open sourced the back-end, and published an Android app.

Images

Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation

TL;DR: If you know what type of spurious correlations your model may have, you can test them using existing methods. But if you don't know what they are, you can't test them. Many existing interpretability methods can't help you either.

Julius Adebayo, Michael Muelly, Hal Abelson, Been Kim
[ICLR 2022]

Images

DISSECT: Disentangled Simultaneous Explanations via Concept Traversals

TL;DR: Can we automatically learn concepts tht are relevant to a prediction (e.g., pigmentation), and generate new set of images that would follow the concept trajectory (more or less concept)? Yes.

Asma Ghandeharioun, Been Kim, Chun-Liang Li, Brendan Jou, Brian Eoff, Rosalind W. Picard
[ICLR 2022]

Images

Acquisition of Chess Knowledge in AlphaZero

TL;DR: How does the super-human self-taught chess play machine--AlphaZero--learn to play chess, and what can we learn about chess from it? We investigate the emergence of human concepts in AlphaZero and the evolution of its play through training.

Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, Adam Pearce, Demis Hassabis, Been Kim, Ulrich Paquet, Vladimir Kramnik
[PNAS] [visualization]

Images

Machine Learning Techniques for Accountability

TL;DR: Pros and cons of accountability methods

Been Kim, Finale Doshi-Velez
[PDF]

Images

Do Neural Networks Show Gestalt Phenomena? An Exploration of the Law of Closure

TL;DR: It does. And it might be related to how NNs can generalize.

Images

On Completeness-aware Concept-Based Explanations in Deep Neural Networks

TL;DR: Let's find set of concepts that are "sufficient" to explain predictions.

Chih-Kuan Yeh, Been Kim, Sercan O. Arik, Chun-Liang Li, Tomas Pfister, Pradeep Ravikumar
[Neurips 20]

Images

Debugging Tests for Model Explanations

TL;DR: Sanity check2.

Julius Adebayo, Michael Muelly, Ilaria Liccardi, Been Kim
[Neurips 20]

Images

Concept Bottleneck Models

TL;DR: Build a model where concepts are built-in so that you can control influential concepts.

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, Percy Liang
[ICML 20] [Featured at Google Research review 2020]

Images

Explaining Classifiers with Causal Concept Effect (CaCE)

TL;DR: Make TCAV causal.

Yash Goyal, Amir Feder, Uri Shalit, Been Kim
[arxiv]

Images

Towards Automatic Concept-based Explanations

TL;DR: Automatically discover high-level concepts that explain a model's prediction.

Amirata Ghorbani, James Wexler, James Zou, Been Kim
[Neurips 19] [code]

Images

BIM: Towards Quantitative Evaluation ofInterpretability Methods with Ground Truth

TL;DR: Datasets, models and metrics to quantitatively evaluate your interpretability methods with groundtruth. We compare many widely used methods and report their rankings.

Sharry Yang, Been Kim
[arxiv] [code]

Images

Visualizing and Measuring the Geometry of BERT

TL;DR: Studying geometry of BERT to gain insights behind their impressive performance.

Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viégas, Martin Wattenberg
[Neurips 19] [blog post]

Images

Evaluating Feature Importance Estimates

TL;DR: One idea to evaluate attribution methods.

Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim
[Neurips 19]

Images

Human Evaluation of Models Built for Interpretability

TL;DR: What are the factors of explanation that matter for better interpretability and in what setting? A large-scale study to answer this question.

Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Samuel Gershman and Finale Doshi-Velez
[HCOMP 19](best paper honorable mention)

Images

Human-Centered Tools for Coping with Imperfect Algorithms during Medical Decision-Making

TL;DR: A tool to help doctors to navigate medical images using medically-relevant similarties. This work uses a part of TCAV idea to sort images with concepts.

Carrie J. Cai, Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda Viegas, Greg S. Corrado, Martin C. Stumpe, Michael Terry
CHI 2019 (best paper honorable mention)
[pdf]

Images

Interpreting Black Box Predictions using Fisher Kernels

TL;DR: Answering "which training examples are most responsible for a given set of predictions?" Follow up of MMD-critic [NeurIPS 16]. The difference is that now we pick examples informed by how the classifier sees them!

Rajiv Khanna, Been Kim, Joydeep Ghosh, Oluwasanmi Koyejo
[AISTATS 2019]

Images

To Trust Or Not To Trust A Classifier

TL;DR: A very simple method that tells you whether to trust your prediction or not, that happens to also have nice theoretical properties!

Heinrich Jiang, Been Kim, Melody Guan, Maya Gupta
[Neurips 2018 ] [code]

Images

Human-in-the-Loop Interpretability Prior

TL;DR: Ask humans which models are more interpretable DURING the model training. This gives us a more interpretable model for the end-task.

Isaac Lage, Andrew Slavin Ross, Been Kim, Samuel J. Gershman, Finale Doshi-Velez
[Neurips 2018]

Images

Sanity Checks for Saliency Maps

TL;DR: Saliency maps are popular post-training interpretability methods that claim to show the 'evidence' of predictions. But it turns out that they have little to do with the model's prediction! Some saliency maps produced from a trained network and a random network (with random prediction) are visually indistinguishable.

Julius Adebayo, Justin Gilmer, Ian Goodfellow, Moritz Hardt, Been Kim
[Neurips 18]

Images

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

TL;DR: We can learn to represent human-concepts in any layer of already-trained neural networks. Then we can ask how important were those concepts for a prediction.

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, Rory Sayres
[ICML 18] [code] [bibtex] [slides]

2 Sundar Pichai (CEO of Google)'s presenting TCAV as a tool to build AI for everyone at his keynote speech at Google I/O 2019 [video]

Images

The (Un)reliability of saliency methods

TL;DR: Existing saliency methods could be unreliable; we can make them show whatever we want by simply introducing constant shift in the input (not even adversarial!).

Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Erhan, Been Kim
[NIPS workshop 2017 on Explaining and Visualizing Deep Learning] [bibtex]

Images

SmoothGrad: removing noise by adding noise

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, Martin Wattenberg
[ICML workshop on Visualization for deep learning 2017] [code]

Images

QSAnglyzer: Visual Analytics for Prismatic Analysis of Question Answering System Evaluations

Nan-chen Chen and Been Kim

[VAST 2017]

Images

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez and Been Kim

Springer Series on Challenges in Machine Learning: "Explainable and Interpretable Models in Computer Vision and Machine Learning" [pdf]

Images

Examples are not Enough, Learn to Criticize! Criticism for Interpretability

Been Kim, Rajiv Khanna and Sanmi Koyejo

[NIPS 16] [NIPS oral slides] [talk video] [code]

Images

Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction

Been Kim, Finale Doshi-Velez and Julie Shah
[NIPS 15] [variational inference in gory detail]

Images

iBCM: Interactive Bayesian Case Model Empowering Humans via Intuitive Interaction

Been Kim, Elena Glassman, Brittney Johnson and Julie Shah
[Chapter X in thesis] [demo video]

Images

Bayesian Case Model:
A Generative Approach for Case-Based Reasoning and Prototype Classification

Been Kim, Cynthia Rudin and Julie Shah

[NIPS 14] [poster] This work was featured on MIT news and MIT front page spotlight.

Images

Scalable and interpretable data representation for
high-dimensional complex data

Been Kim, Kayur Patel, Afshin Rostamizadeh and Julie Shah

[AAAI 15]

Images

A Bayesian Generative Modeling with Logic-Based Prior

Images

Learning about Meetings

Images

Inferring Robot Task Plans from Human Team Meetings:
A Generative Modeling Approach with Logic-Based Prior

Been Kim, Caleb Chacha and Julie Shah

[AAAI 13] [video] This work was featured in:
"Introduction to AI" course at Harvard (COMPSCI180: Computer science 182) by Barbara J. Grosz.
[Course website]
"Human in the loop planning and decision support" tutorial at AAAI15 by Kartik Talamadupula and Subbarao Kambhampati.
[slides From the tutorial] <

Images

PhD Thesis: Interactive and Interpretable Machine Learning Models for Human Machine Collaboration

Images

Multiple Relative Pose Graphs for Robust Cooperative Mapping

Been Kim, Michael Kaess, Luke Fletcher, John Leonard, Abraham Bachrach, Nicholas Roy, and Seth Teller

[ICRA 2010] [video]