Publication Date

5-2023

Date of Final Oral Examination (Defense)

3-10-2023

Type of Culminating Activity

Thesis

Degree Title

Master of Science in Computer Science

Department

Computer Science

Supervisory Committee Chair

Edoardo Serra, Ph.D.

Supervisory Committee Member

Francesca Spezzano, Ph.D.

Supervisory Committee Member

Liljana Babinkostova, Ph.D.

Abstract

Detecting malicious behavior is becoming increasingly crucial as the internet becomes more prevalent. This problem can be formulated as an anomaly detection task on provenance data, where attacks are detectable as anomalies in the behavior of the system. The availability of system-level data in comparison to network data is quite limited and so is the research carried out on system-level logs. However, monitoring the operating system's processes during program execution and identifying anomalous behavior in system calls can be beneficial since it can provide broad coverage and generality, as a variety of malicious applications could be identified. Furthermore, logs like system processes and events are provenance data- a graph that describes the relationship between all the elements that contributed to the creation of the data, making use of a Graph Neural Network (GNN) better suited for the task. Moreover, such data may contain metadata, which in general tends to be complex and make feature engineering more difficult resulting in limited usage of such features.

In this thesis, we address these issues by first utilizing the graph-like structure of logs, in which processes enact events and generate additional processes. Then we use a graph neural network to create representations of each event, encoding information about their neighboring events in a way that is unsupervised. The second is to make use of complex features such as command arguments which vary widely and cannot be used in the presented format as features in typical machine learning algorithms. If these features are instead encoded using a system composed of transformer and Variational Auto Encoder models, they can then be used in other algorithms such as a GNN or anomaly detector. These two approaches combined improve anomaly detection AUCROC for the BETH dataset by around 8 percent as compared to the manually engineered features alone.

DOI

https://doi.org/10.18122/td.2065.boisestate

Available for download on Thursday, May 01, 2025

Share

COinS