Android Malware Identification and Polymorphic Evolution via Graph Representation Learning

Document Type

Conference Proceeding

Publication Date



Developing techniques to identify malware is critical. The polymorphic nature of malware makes it difficult to detect, especially if the detection is done with Hash-based based techniques. Image-based binary representations have been shown to be more robust to popular polymorphic obfuscation techniques. In contrast to image-based techniques, in this paper, we employed a graph-based technique that extracts control flow graphs from Android APK binary. To process the resulting graph, we use a procedure combining a new graph representation learning method, called Inferential SIR-GN for Graph representation, that preserves graph structural similarities, with XGboost, which is a standard machine learning model. Then, we apply this procedure to MALNET, which is a publicly available cybersecurity database that provides image and graph-based Android APK binary representations for a total 1,262,024 million Android APK binary with 47 types and 696 families. Experimental results show that this graph-based procedure is even more accurate than the image-based approach. Moreover, this paper provides a procedure that, by leveraging Inferential SIR-GN is able to create malware polymorphic evolution representations to use during the train of the XGboost that strengthens the malware classification tasks when the train and test datasets are split temporally according to the binary creation date. This means that our procedure can predict malware polymorphic evolution.