Faculty Mentor Information
Dr. Jyh-Haw Yeh (Mentor), Boise State University
Additional Funding Sources
Supported by National Science Foundation award #2244596.
Abstract
Generative models can be used to augment the training data for robust image-based malware classification models. For this method to be effective, the quality of the synthetic malware images must be high to represent plausible training data. Our research provides direction in terms of assessing and improving the quality of malware images synthesized by generative models. In this research, we train generative adversarial networks (GANs) and diffusion models on the SOREL-20M dataset to synthesize malware images in various image formats. Besides evaluating Inception Score and Fréchet Inception Distance for these synthetic malware images, we employ metrics from more recent image generation literature that have yet to be applied to the cybersecurity domain, including generative precision, generative recall, GAN-test, and neural network divergence. We identify which metrics most strongly correlate with improved malware classification model performance when trained on synthetic malware images. We further identify which generative models and image formats achieve the best results, highlighting future avenues of exploration for the improvement of malware image generation methods.
Assessing and Improving the Quality of AI-Generated Images for Malware Classification
Generative models can be used to augment the training data for robust image-based malware classification models. For this method to be effective, the quality of the synthetic malware images must be high to represent plausible training data. Our research provides direction in terms of assessing and improving the quality of malware images synthesized by generative models. In this research, we train generative adversarial networks (GANs) and diffusion models on the SOREL-20M dataset to synthesize malware images in various image formats. Besides evaluating Inception Score and Fréchet Inception Distance for these synthetic malware images, we employ metrics from more recent image generation literature that have yet to be applied to the cybersecurity domain, including generative precision, generative recall, GAN-test, and neural network divergence. We identify which metrics most strongly correlate with improved malware classification model performance when trained on synthetic malware images. We further identify which generative models and image formats achieve the best results, highlighting future avenues of exploration for the improvement of malware image generation methods.
Comments
Supervised by Md Mashrur Arifin (Boise State University) and Dr. Jyh-Haw Yeh (Boise State University)