Publication Date


Date of Final Oral Examination (Defense)


Type of Culminating Activity


Degree Title

Master of Science in Biology



Major Advisor

Sven Buerki, Ph.D.


James Smith, Ph.D.


Stephen Novak, Ph.D.


Humans have become a major factor in reshaping the Earth’s biosphere. One of the major effects of human changes to the environment is an increase in the rate of species extinction as compared to background rates. Biodiversity hotspots are areas whose species assemblages are very rich (50% of the world’s plants and 42% of land vertebrates) yet very threatened with extinction ( > 70% habitat destruction), and which ought to be foci for conservation efforts. The intense peril in which the flora of these endangered regions are requires an equally intense response from the scientific community. This study investigated the benefits of adding genomic information to voucher specimens to alleviate the Linnaean (lack of species description), Wallacean (lack of data on species distribution) and Darwinian (lack of data on species evolution) shortfalls.

An open-source R bioinformatic pipeline was developed to determine the percentage of vascular plant species present in biodiversity hotspots with at least one reproducible DNA sequence deposited on GenBank. Reproducible DNA sequences were defined as being underpinned by traceable material and methods and accurate taxonomic identifications. A vascular plant species checklist for the 36 biodiversity hotspots was inferred using 32,914,892 GBIF occurrences, comprising 204,044 species. A total of 736,532 GenBank accessions (representing DNA barcodes) were downloaded for those species. Associated abstracts and metadata were mined from 3,127 publications deposited on PubMed to assess DNA sequences reproducibility. The reproducibility of each study was tested by a sentiments (natural language processing) analysis.

Overall, the analyses indicated that the reproducibility crisis also extended to the realm of biodiversity. There was a significant shortfall in genetic information available for biodiversity hotspots, where 80.3% of the sequences produced (591,431) were not reproducible. This meant that only 19.7% of sequences—representing only 37,637 species (18% of the total)— were reproducible. This phenomenon was named the Wu-Meyersian shortfall to recognize that we are critically lacking DNA sequence data for threatened biodiversity. This shortfall was named in honor of Ray Wu (the father of DNA sequencing; 1928-2008) and Norman Meyers (a pioneer in establishing biodiversity hotspots; 1934-2019). Working on this shortfall could contribute to alleviating the Linnean, Wallacean and Darwinian shortfalls and support conservation. Information was particularly lacking in tropical biodiversity hotspots, but no biodiversity hotspot other than Japan had > 50% of its flora reproducibly sequenced. Older biodiversity hotspots were less known than those established more recently. This is concerning since those are among the most diverse and threatened (e.g. Madagascar, Sundaland). From a DNA region perspective, ITS (23,422 species), matK (17,164 species), and rbcL (16,509 species) were the most commonly used barcodes. From a lineage perspective, gymnosperms (N=895) are exceptionally well-sequenced, with three quarters of their species having been reproducibly sequenced. Angiosperms are comparatively poorly sequenced (18%), but this may be explained by their extreme diversity (N=195,433). Finally, ferns and their allies (N=7,716) are poorly sequenced (22%). This is especially troubling because extinction of these species would represent the loss of hundreds of millions of years of unique evolutionary history. This study finally proposed best practices to ensure maximizing reproducibility of DNA sequences produced by the scientific community.

The bioinformatic pipeline can be applied to systems at multiple geographical scales and any taxonomic groups and is therefore appealing to a wide range of stakeholders. We recommended using it periodically to monitor progress towards alleviating the Wu-Meyersian shortfall.