Publication Date

8-2018

Date of Final Oral Examination (Defense)

6-19-2018

Type of Culminating Activity

Thesis

Degree Title

Master of Science in Computer Science

Department

Computer Science

Major Advisor

Amit Jain, Ph.D.

Advisor

Tim Andersen, Ph.D

Advisor

Steven Cutchin, Ph.D.

Abstract

As storage needs continually increase, and network file systems become more common, the need arises for tools that efficiently copy to and from these types of file systems. Traditional copy tools like the Linux cp utility were originally created for traditional storage systems, where storage is managed by a single host machine. cp uses a single-threaded approach to copying files. Using a multi-threaded approach would likely not provide an advantage in this system since the disk accesses are the bottleneck for this type of operation. In a distributed file system the disk accesses are spread across multiple hosts, and many accesses can be served simultaneously. Volumes in a distributed file system still look like a single storage device to the operating system of a client machine, so traditional tools like cp can still be used, but they cannot take advantage of the performance increase offered by a distributed, multi-node approach. While some research has been done in this area and some tools have been created there are still shortcomings in the area, particularly when it comes to scalability and robustness to process failure. The research presented here attempts to adress these shortcomings.

The software created in this project was tested in a variety of enviornments and compared to other other tools that have attempted to make improvements in these areas. In the area of scalability, this software performed well from a small cluster to large high-performance-computing cluster. In comparisons to other tools, the performance was comparable to or better than the other tools measured.

The unique contribution of this project is in the area of robustness, which other tools have not attempted to address. The software allows the use to specify a minimum number of processes that can fail without losing progress or requiring a restart of the copy job. This means process death can be tolerated, or resources can be diverted after the start of the copy job. This is especially helpful in very large, long-running jobs. Many clusters are made with low-cost, consumer-grade hardware which is susceptible to failure. A study entitled "Failure Trends in a Large Disk Drive Population" [2], published in the proceedings of the USENIX FAST'07 conference, looked at the failure rate of disk drives across Google's various services. It found that a large number of drives fail within a few years. Drive failure rate was somewhat high within the first few months, weeding out the lower quality drives, but at the two and three year mark as many as eight percent of drives failed per year. This high rate shows the importance of software like this that can tolerate these types of failures.

DOI

10.18122/td/1444/boisestate

Share

COinS