M.Sc. Zafar Hussain defends his doctoral thesis "Learning Command Syntax and Detecting Similarities for Enhanced Cybersecurity through Data Analysis" on Friday the 14th of February 2025 at 13 o'clock in the University of Helsinki Exactum building, Auditorium B123 (Pietari Kalmin katu 5, 1st floor). His opponent is Professor Juha Röning (University of Oulu) and custos Professor Jukka K. Nurminen (University of Helsinki). The defence will be held in English.
The thesis of Zafar Hussain is a part of research done in the Department of Computer Science and in the Empirical Software Engineering group at the University of Helsinki. His supervisors have been Professor Jukka K. Nurminen (University of Helsinki) and Professor Tommi Mikkonen (University of Jyväskylä).
Learning Command Syntax and Detecting Similarities for Enhanced Cybersecurity through Data Analysis
In cybersecurity, accurately distinguishing between legitimate and malicious command-line commands is crucial for protecting computer systems. This research addresses the intricate challenge of analyzing command syntax and structure, a task complicated by the diverse and constantly evolving nature of command formats. Understanding command syntax is vital for detecting malicious activities, but the absence of a universally accepted standardized syntax makes this task particularly challenging. Leveraging the vast amounts of data available to cybersecurity organizations, we propose a hybrid approach that combines rule-based systems with machine learning models to detect command similarities. To further enhance our understanding of command syntax and structure, we explored deterministic and probabilistic methods, including regular expressions, Markov models, and large language models.
To understand the syntactic and semantic meanings of command-line commands, we developed a rule-based system utilizing expert opinions. It classified command-line commands into similar and not-similar classes, transforming the data into a binary format. We trained a logistic regression model along with two deep learning models (a document classifier and a sentence-pair classifier), and then evaluated their performance using the Matthews Correlation Coefficient (MCC). The logistic regression model achieved an MCC score of 0.85, while both deep learning models scored above 0.90 on unseen data. Our proposed hybrid approach effectively addresses the complex challenge of detecting command similarities, in the absence of a definitive ground truth.
To learn the syntax and structure of commands more thoroughly, we assessed three approaches: a language model fine-tuned on command data, a second-order Markov Model, and a regular expression-based system. The evaluation demonstrated the superiority of the fine-tuned language model over both the Markov Model and the regular expression system. The regular expression-based approach struggled with the unique values in each command and lacked appropriate expressions to match these random values. While the Markov Model showed some improvement by detecting certain random tokens, it still had difficulty recognizing diverse patterns. By employing clustering algorithms such as DBSCAN, HDBSCAN, and OPTICS, we successfully categorized command-line commands based on their syntactical similarities, revealing the language model’s exceptional ability to comprehend sequences and detect syntax with minimal noise. Statistical analyses of command syntax, coupled with BERTScore assessments, consistently yielded metrics exceeding 0.90 for precision, recall, and F1-score. These robust results unequivocally affirm the fine-tuned language model’s high accuracy and effectiveness in learning command syntax. This model is particularly valuable for cybersecurity organizations that handle millions of commands, as it can be trained and retrained on massive datasets likely encompassing all possible commands. While the other methods performed reasonably well, they have clear limitations, manual effort is needed for regular expressions and the Markov Model struggles with long expressions, making the language model the more robust solution.
Our research has significant implications for cybersecurity, especially in contexts requiring a deep understanding of command syntax. The methods explored enhance the ability to distinguish between legitimate and malicious commands, define user groups by unique syntax patterns, and identify prevalent execution paths. By enhancing command analysis capabilities, our research contributes to the development of more robust and resilient systems that are better equipped to safeguard against a wide range of threats. These advancements have the potential to significantly bolster cybersecurity defenses and mitigate the risks posed by malicious actors.
Availability of the dissertation
An electronic version of the doctoral dissertation will be available in the University of Helsinki open repository Helda at http://urn.fi/URN:ISBN:978-952-84-0768-3.
Printed copies will be available on request from Zafar Hussain: zafar.hussain@helsinki.fi.