Type of Culminating Activity

Graduate Student Project

Graduation Date

8-2012

Degree Title

Master of Science in Computer Science

Department

Computer Science

Major Advisor

Amit Jain

Abstract

While high-performance, cost-effective data management solutions, such as Hadoop, exist for Big Data analysis, small and medium businesses with moderate-sized data sets would also like to implement low budget data management systems that will perform well on existing data and scale as the amount of accumulated data increases. Parallel database management systems may provide a high-performance solution, but are expensive and complex to implement. The purpose of this project was to compare the scalability of open-source relational database management systems and distributed data management systems for small and medium data sets. To make this comparison, a business intelligence case study was investigated using three data management solutions: MySQL, Hadoop MapReduce, and Hive. This experiment involved a payment history analysis which considers customer, account, and transaction data for predictive analytics. Experiments were executed on data sets ranging from 200MB to 10GB. The results show that the single server MySQL solution performs best for trial sizes ranging from 200MB to 1GB, but does not scale well beyond that. MapReduce outperforms MySQL on data sets larger than 1GB and Hive outperforms MySQL on sets larger than 2GB. This demonstrates MapReduce and Hive as viable techniques for small and medium businesses who want to implement scalable data management techniques.

Share

COinS