Type of Culminating Activity

Graduate Student Project

Graduation Date


Degree Title

Master of Science in Computer Science


Computer Science

Major Advisor

Amit Jain


Parallel distributed files systems are increasingly being used on clusters to allow greater throughput of data to the many compute nodes. They are also an effective way to store massive amounts of data. However, using the standard core utility cp does not make good use of the potential parallelism of the file systems. Using multiple cp commands has inherent problems too.

Two utilities were created to help recursively copy directories containing large amounts of data on parallel distributed file systems. One of the test data sets contains very many files, and the other contains large files. One utility is a C program that submits a single job on a user specified number of nodes. The work of copying the files is dynamically distributed among those nodes using MPI communications. Multiple threads are used to traverse the directories. Speedups of 9.57 and 7.36 were attained for the many files set and the large files set, respectively. A second utility is written in Java. It also uses multiple threads to traverse the directories, but it performs the copying by creating Bash scripts and submitting them to the job scheduler. The work is balanced among those scripts and the number of jobs is specified by the user. It reached speedups of 3.67 and 7.32 for the same two data sets. Both utilities can also be used to track the progress of the jobs they have submitted.