The digital transformation of electric power and energy systems (EPES) provides opportunities for improved monitoring, automation, and control of these systems, while simultaneously introducing additional interdependencies and a wide range of cybersecurity threats to these systems. Using the example of detecting domain generation activity, we explain how machine learning can be used to combat these threats while considering the problem of sensitive training data.
The digitalisation of electric power and energy systems (EPES), as well as other industry and infrastructure environments, sparked a drastic increase in the use of digital systems and their interconnectedness. While this development provides rich opportunities (for example in monitoring, automation, and control), it also introduces a wide range of cybersecurity threats to these systems. To efficiently combat these threats, the use of machine learning gained significant popularity, since it is leveraged by the large-scale collections of data that became accessible through the digitalisation. As one example for machine learning based detection of cybersecurity threats, we consider the detection of domain generation algorithm (DGA) activity.
A specific scenario: Domain generation algorithms
In a malicious context, DGAs are used to evade blocklists in botnet communication. Once a device with connection to the internet (such as an IoT device or server) gets corrupted, it uses a DGA to generate domain names, which it tries to connect to. The desired communication counterpart, the command-and-control server, is registered by the adversary under a domain name, which is likely to be generated by the DGA used by the corrupted device. DGAs consequently enable corrupted devices to communicate with the adversary via non-static domains. In the context of EPES, smart meters can be attractive targets for malicious actors, due to their large-scale roll-out, limited computing power to run security software, and their connection to critical infrastructure systems.
Using machine learning for DGA detection
While the properties of a DGA assure a likely collision of an algorithmically generated domain (AGD) with the domain name under which the command-and-control server is registered, it is not guaranteed that communication can be established for the first AGD. It hence is expected that the corrupted device has to generate several AGDs, to which it tries to connect, leading to non-existent (NX) domain requests.
To detect corrupted machines in this setting, machine learning based classification of NX domain requests has been proposed. For this, a classifier is trained on benign NX domain requests (as benign training samples) and AGDs generated by DGAs for seeds known to be used by malware (as malicious training samples). In an operational environment, a trained model can be used to detect malicious AGDs in NX domain requests observed in the monitored network to identify potentially corrupted machines. Respective machines can then be isolated before successfully establishing a connection to the adversary, preventing further damage and/or malicious use of machines.
The problem of privacy-sensitive training data
The use of NX domain requests as benign samples, however, also introduces privacy concerns when considering the sharing of training data or a trained classifier. As such requests can reveal information about browsing behaviour, misconfigured devices and security software in a network, the use of privacy-enhancing approaches becomes relevant. As there is no one-fits-all solution in privacy, the choice of the specific privacy-enhancing technology commonly depends on the specifics of the considered scenario.
Initiated in the H2020 project SAPPAN and extended in the CyberSEAS project, have evaluated privacy-enhancing technologies targeting the classification process in a classification-as-a-service scenario using secure multiparty computation (SMPC) and homomorphic encryption (HE) , and the obfuscation of training samples via similarity-preserving Bloom encodings in a scenario for sharing training data .
An evaluation of privacy-enhancing technologies
While classifiers for DGA detection have shown to provide good performance when trained without measures taken to explicitly enhance the privacy of the training samples, the results of our evaluations revealed that the application of privacy measures does impact utility.
In the evaluation of cryptographic approaches, the inference latency and communication overhead were the primary concerns. By applying modifications to the model under an acceptable accuracy penalty, we managed to reduce the inference latency by up to 95% and the volume of transmitted data by up to 97% for experiments using SMPC. For a hybrid approach using SMPC and HE, we achieved a reduction in inference latency of up to 84% and a reduction in volume of transmitted data of up to 94%. Nevertheless, the remaining inference latency and communication overhead remain in a range considered to be impractical for real-world applications.
In contrast to this, the obfuscation of training samples via Bloom encodings does not suffer from issues related to communication complexity, but provides notably weaker privacy guarantees. Our evaluation showed that a privacy-aware choice of parameter values and the use of randomisation as an additional obfuscation measure take a toll on the classification performance of DGA classifiers trained on the encodings. While the precision of models trained on the encodings held up very well despite a high degree of noise added, we have seen a notable reduction in classification accuracy and recall for models under stronger privacy settings.
This motivates further research targeting approaches to improve cybersecurity and protect user privacy, while aiming to find an acceptable trade-off between these two goals, which we expect to only gain in relevance going forward. Acknowledging this trade-off, the CyberSEAS project dedicates one of its strategic objectives to the protection of consumers against personal data breaches and cyber attacks.
 Arthur Drichel, Mehdi Akbari Gurabi, Tim Amelung, and Ulrike Meyer. 2021. Towards Privacy-Preserving Classification-as-a-Service for DGA Detection. In 2021 18th International Conference on Privacy, Security and Trust (PST), December 13–15, 2021, Auckland, New Zealand. 1–10. https://doi.org/10.1109/PST52912.2021.9647755
 Lasse Nitz and Avikarsha Mandal. 2023. DGA Detection Using Similarity-Preserving Bloom Encodings. In European Interdisciplinary Cybersecurity Conference (EICC 2023), June 14–15, 2023, Stavanger, Norway. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3590777.3590795