Leveraging Machine Learning for Automatically Classifying Fake News in the COVID-19 Outbreak

Additional Funding Sources

The project described was partially supported by the National Science Foundation under Award No. 1943370.

Abstract

Fake news, spreading its disinformation, is a plague to modern journalism and the media. Poisoning the reliability of sources, accuracy detection is necessary. In this research, we use machine learning to automatically classify COVID-19 related fake news' validity and to find the most important features in the headlines used in determining accuracy. We used a dataset crawled from Politifact.com between March and June 2020 and contained 299 fake news and 100 truthful news as determined by the website's fact-checkers. We extracted different features from the news headlines, including features from the Linguistic Inquiry and Word Count Engine to be used in different machine learning models from the scikit-learn API. The model with the highest average precision was the Decision Tree Classifier, achieving 79% on five-fold cross-validation. The top features used by the classification model included the number of motion words, number of relativity words, number of prepositions in the headline, the authenticity of the tone in the headline, and the word count. Fake news outlets commonly try to have more description in their headlines to convince users that a headline is true, which explains an increase in prepositions, motion, and relativity words, and overall word count in fake news.

This document is currently not available here.

Share

COinS
 

Leveraging Machine Learning for Automatically Classifying Fake News in the COVID-19 Outbreak

Fake news, spreading its disinformation, is a plague to modern journalism and the media. Poisoning the reliability of sources, accuracy detection is necessary. In this research, we use machine learning to automatically classify COVID-19 related fake news' validity and to find the most important features in the headlines used in determining accuracy. We used a dataset crawled from Politifact.com between March and June 2020 and contained 299 fake news and 100 truthful news as determined by the website's fact-checkers. We extracted different features from the news headlines, including features from the Linguistic Inquiry and Word Count Engine to be used in different machine learning models from the scikit-learn API. The model with the highest average precision was the Decision Tree Classifier, achieving 79% on five-fold cross-validation. The top features used by the classification model included the number of motion words, number of relativity words, number of prepositions in the headline, the authenticity of the tone in the headline, and the word count. Fake news outlets commonly try to have more description in their headlines to convince users that a headline is true, which explains an increase in prepositions, motion, and relativity words, and overall word count in fake news.