Computer Science Faculty Publications and Presentations

Statistical Unigram Analysis for Source Code Repository

Document Type

Contribution to Books

Publication Date

2017

Abstract

Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultralarge source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.

Publication Information

Xu, Weifeng; Xu, Dianxiang; El Ariss, Omar; Liu, Yunkai; and Alatawi, Abdulrahman. (2017). "Statistical Unigram Analysis for Source Code Repository". Proceedings: 2017 IEEE Third International Conference on Multimedia Big Data: BigMM 2017, 1-8. https://doi.org/10.1109/BigMM.2017.13

Link to Full Text

Find in your library

COinS

Computer Science Faculty Publications and Presentations

Statistical Unigram Analysis for Source Code Repository

Document Type

Publication Date

Abstract

Publication Information

Browse

Links

Search

Author Corner

Computer Science Faculty Publications and Presentations

Statistical Unigram Analysis for Source Code Repository

Authors

Document Type

Publication Date

Abstract

Publication Information

Share

Browse

Links

Search

Author Corner