Publication Date


Date of Final Oral Examination (Defense)


Type of Culminating Activity


Degree Title

Master of Science in Computer Science


Computer Science

Major Advisor

Michael D. Ekstrand, Ph.D


Maria Soledad Pera, Ph.D.


Hoda Mehrpouyan, Ph.D.


Recommender systems are software applications deployed on the Internet to help people find useful items (e.g. movies, books, music, products) by providing recommendation lists. Before deploying recommender systems online, researchers and practitioners generally conduct offline evaluations to compare the accuracy of top- recommendation lists among candidate algorithms using users’ history consumption data. These offline evaluations typically use metrics and methodologies borrowed from machine learning and information retrieval and have several well-known biases that affect the validity of their results, including popularity bias and other biases arising from the missing-not-at-random nature of the data used. The existence of these biases is well-established, but their extent and impact are not as well-studied. In this work, we employ controlled simulations with varying assumptions about the distribution and structure of users’ preferences and the rating process to estimate the distributions of the errors in recommender experiment outcomes as a result of these biases. We calibrate our simulated datasets to mimic key statistics of existing public datasets in different domains and use the simulated data to assess the error in estimating true accuracy with observable rating data. We find inconsistency of the evaluation metric scores and the order in which they rank recommendation algorithms in the synthetic true preference and the observation dataset. Simulation results show that offline evaluations are sometimes fooled by intrinsic effects in the data generation process into mistakenly ranking algorithms. The extent of this effect is sensitive to assumptions.