Machine Learning Feature Selection for Similar Data
The ability to use machine learning to simplify many complex tasks allows researchers more time for analysis and less time required constructing models. This project will focus on how a well selected feature in data can drastically improve machine learning algorithms. Data being used is binary matrix data from steganographic images, however these techniques can apply to a wide range of data. Steganography is the art of concealing information inconspicuously. It differs from cryptology in that its goal is to communicate a secret message to a recipient without arousing suspicion that a secret message is being sent. The purpose of this project will focus on classifying data encoded using the method of least significant bit (LSB) encoding. In this method data, in this case an image, is converted into a matrix of binary values. From here the value, or bit, that changes the appearance of the picture the least is changed to encode the desired information. This data is hidden within a lot of other data and relies on security through obscurity. Certain features are unique to this data, however not all features are equally interesting or useful. Several different methods of feature selection will be tested alongside algorithms including support vector machines and decision trees. While the mean may appear to be a good candidate to get a broad overview of data it performs poorly as a feature in this case. When training transitioned from one image of test data to two images the accuracy dropped from 87.4% from a decision tree algorithm to 60.4% from a linear support vector machine algorithm. It is clear that feature selection must be chosen in such a way to be resistant to this type of event.
Huelsenbeck, Rob, "Machine Learning Feature Selection for Similar Data" (2016). 2016 Undergraduate Research and Scholarship Conference. Paper 62.