Title

Data Clustering Using MapReduce

Type of Culminating Activity

Graduate Student Project

Graduation Date

5-2009

Degree Title

Master of Science in Computer Science

Department

Computer Science

Major Advisor

Amit Jain

Abstract

MapReduce is a software framework that allows certain kinds of parallelizable or distributable problems involving large data sets to be solved using computing clusters. MapReduce is attractive because it abstracts parallel and distributed concepts in such a way that it allows novice programmers to take advantage of cluster computing without needing to be familiar with associated complexities such as data dependency, mutual exclusion, replication, and reliability. However, the challenge is that problems must be expressed in such a way that they can be solved using MapReduce. This often involves carefully designing inputs and outputs of MapReduce problems as often outputs of one MapReduce are used as inputs to another.

Data clustering is a common computing task that often involves large data sets for which MapReduce can be an attractive means to a solution. This report presents a case study of clustering Netflix movie data using K-means, Greedy Agglomerative, and Expectation Maximization clustering algorithms using Apache Hadoop MapReduce framework. Netflix is a large online DVD rental service. A major part of Netflix's revenue generation can be directly attributed to providing movie recommendations to customers based on movies they have seen and rated in the past. As part of an ongoing effort to improve this movie recommendation system, Netflix is sponsoring a competition for the best movie rating predictor. Netflix provides a data set containing over 100 million ratings to competition participants, which we use in our MapReduce clustering case study. Root Mean Square Error (RMSE) is used to compare actual ratings versus predictions made on this data set. Netfiix's predictor achieves a RMSE of 0.9525 on the provided data set. A demonstration is provided on how the resulting clustered data can be used to create a simple movie rating predictor that is able to achieve a RMSE of 1.0269 which is within 8% of Netfiix's predictor. The predictor demonstrated also highlights an interesting collaboration of MapReduce and Database Management Systems, namely Hadoop and MySQL.