Evaluating Attribution Methods in Machine Learning Interpretability

Document Type

Conference Proceeding

Publication Date



Interpretability is a key feature to broaden a conscious adoption of machine learning models in domains involving safety, security, and fairness. To achieve the interpretability of complex machine learning models, one approach consists in explaining the outcome of machine learning models through input features attribution. Attribution consists in scoring the features of an input instance by establishing how important is each feature value in a fixed instance to obtain a specific classification outcome from the machine learning model. In literature, several attribution methods are defined for specific machine learning models (e.g., neural networks) or more general ones that are model agnostic (i.e., can interpret any machine learning models). Attribution is particularly appreciated for its easy understanding of the interpretation, which is the attribution. In domains involving safety, security, and fairness, properties of the explanation such as precision and generality are crucial to establish human trust in machine learning interpretability and then on the machine learning model itself. However, even if precision and generality are clearly defined in rule-based interpretation models, they are not defined or measure on attribution models. In this work, we propose a general methodology to estimate the degree of precision and generality in attribution methods. In addition, we propose a way to measured consistency in attribution between two attribution methods. Our experiments focus on the two most popular model agnostic attribution methods, SHAP and LIME, and we evaluate them to two real applications in the field of attack detection. Our proposed methodology shows in these experiments that both SHAP and LIME lack precision, generality, and consistency and that still more investigation in the attribution research field is required.