Computer Science Faculty Publications and Presentations

Statistical Unigram Analysis for Source Code Repository

Weifeng Xu, Bowie State University
Dianxiang Xu, Boise State University
Abdulrahman Alatawi, Bowie State University
Omar El Ariss, Texas A&M University
Yunkai Liu, Gannon University

Document Type

Article

Publication Date

6-2018

Abstract

Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical properties regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. We describe a probabilistic model which relies on these properties for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. Our empirical study shows that using the unigrams extracted from source code repository outperforms the using of the natural language corpus by 21% when solving the domain specific problems.

Publication Information

Xu, Weifeng; Xu, Dianxiang; Alatawi, Abdulrahman; El Ariss, Omar; and Liu, Yunkai. (2018). "Statistical Unigram Analysis for Source Code Repository". International Journal of Semantic Computing, 12(2), 237-260. https://doi.org/10.1142/S1793351X18400123

Link to Full Text

Find in your library

COinS

ScholarWorks

Computer Science Faculty Publications and Presentations

Statistical Unigram Analysis for Source Code Repository

Document Type

Publication Date

Abstract

Publication Information

Browse

Links

Search

Author Corner

ScholarWorks

Computer Science Faculty Publications and Presentations

Statistical Unigram Analysis for Source Code Repository

Authors

Document Type

Publication Date

Abstract

Publication Information

Share

Browse

Links

Search

Author Corner