Faculty Mentor Information

Dr. Jyh-Haw Yeh (Mentor), Boise State University

Additional Funding Sources

Supported by National Science Foundation award #2244596.

Abstract

Generative models can be used to augment the training data for robust image-based malware classification models. For this method to be effective, the quality of the synthetic malware images must be high to represent plausible training data. Our research provides direction in terms of assessing and improving the quality of malware images synthesized by generative models. In this research, we train generative adversarial networks (GANs) and diffusion models on the SOREL-20M dataset to synthesize malware images in various image formats. Besides evaluating Inception Score and Fréchet Inception Distance for these synthetic malware images, we employ metrics from more recent image generation literature that have yet to be applied to the cybersecurity domain, including generative precision, generative recall, GAN-test, and neural network divergence. We identify which metrics most strongly correlate with improved malware classification model performance when trained on synthetic malware images. We further identify which generative models and image formats achieve the best results, highlighting future avenues of exploration for the improvement of malware image generation methods.

Comments

Supervised by Md Mashrur Arifin (Boise State University) and Dr. Jyh-Haw Yeh (Boise State University)

Share

COinS
 

Assessing and Improving the Quality of AI-Generated Images for Malware Classification

Generative models can be used to augment the training data for robust image-based malware classification models. For this method to be effective, the quality of the synthetic malware images must be high to represent plausible training data. Our research provides direction in terms of assessing and improving the quality of malware images synthesized by generative models. In this research, we train generative adversarial networks (GANs) and diffusion models on the SOREL-20M dataset to synthesize malware images in various image formats. Besides evaluating Inception Score and Fréchet Inception Distance for these synthetic malware images, we employ metrics from more recent image generation literature that have yet to be applied to the cybersecurity domain, including generative precision, generative recall, GAN-test, and neural network divergence. We identify which metrics most strongly correlate with improved malware classification model performance when trained on synthetic malware images. We further identify which generative models and image formats achieve the best results, highlighting future avenues of exploration for the improvement of malware image generation methods.

 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.