Publication Date

8-2020

Date of Final Oral Examination (Defense)

6-24-2020

Type of Culminating Activity

Thesis

Degree Title

Master of Science in Computer Science

Department

Computer Science

Major Advisor

Francesca Spezzano, Ph.D.

Advisor

Edoardo Serra, Ph.D.

Advisor

Steven Cutchin, Ph.D.

Abstract

Wikipedia is a free and open-collaboration based online encyclopedia. The website has millions of pages that are maintained by thousands of volunteer editors. It is part of Wikipedia’s fundamental principles that pages are written with a neutral point of view and are maintained by volunteer editors for free with well-defined guidelines in order to avoid or disclose any conflict of interest. However, there have been several known incidents where editors intentionally violate such guidelines in order to get paid (or even extort money) for maintaining promotional spam articles without disclosing such information.

This thesis addresses for the first time the problem of identifying undisclosed paid articles in Wikipedia. We propose a machine learning-based framework that uses a set of features based on both the content of the articles as well as the patterns of edit history of users who create them. To test our approach, we collected and curated a new dataset from English Wikipedia with ground truth on undisclosed paid articles and a history of users who created those articles. Our experimental evaluation shows that we can identify undisclosed paid articles with an AUROC of 0.98 and an average precision of 0.91. Moreover, our approach outperforms ORES, a scoring system tool currently used by Wikipedia to automatically detect damaging content, in identifying undisclosed paid articles.

We further propose recurrent neural network-based frameworks, that are variants of Long Short-Term Memory (LSTM), using a set of features based on the patterns of edit history of users. Our experimental evaluation also shows that we can identify undisclosed paid editors with an AUROC of 0.93 and an average precision of 0.90 outperforming existing approaches while also outperforming other baseline approaches in early detecting undisclosed paid editors. Finally, we show that our proposed approaches can also be used to address other similar tasks achieving the maximum AUROC score of 0.96, average precision score of 0.97, and accuracy score of 0.90. Also, in this thesis, we show that our approaches are able to outperform other baseline approaches in early detecting both Undisclosed Paid Editors and Wikipedia vandal editors surpassing the performance scores with as little as just two edits.

This thesis is an extension of our work that was published in WWW '20: The Web Conference 2020 held in Taipei, Taiwan in April 2020. Wikipedia have shown significant interest in our published work and we are currently collaborating for possible deployment of our system directly into their platform.

DOI

10.18122/td/1712/boisestate

Share

COinS