Safety is about protecting humans from machines while cybersecurity is about protecting machines from humans. One may argue on this statement whose source I was unable to find anymore, but we cannot negate that humans and machines are getting stronger interlinked every day. Moreover, the futurist and inventor Ray Kurzweil predicts that the human-machine civilization is our destiny.

Given that interactions between computers and human (natural) languages are based on language processing, it became increasingly important to upskill the computers in processing and analyzing large amounts of language data. Thus, the role of Natural Language Processing in the humans-machines interaction, and consequently also in the cybersecurity-world starts to gain traction. So what may be the role and applications of NLP in cybersecurity?

Human-Machine Teaming in Cybersecurity

There are two main drivers nurturing the human-machine teaming in cybersecurity activities, and language is the main tool for both of them:

 

  • Communication — people started communicating with machines through constructed language (e.g. programming languages, etc.) but are increasingly using the natural language to do it (e.g. chatbots, virtual assistants, etc.)
  • Automation efficiency — may be considered through implementation of technologies such as robotic process automation (RPA) or AI workers; the language, whether formal or natural, keeps its role as main interface.

 

It could be empirically deducted from the recent years progresses, and from the exploding numbers of applications and researches, that machine learning for language processing plays an increasingly important role in the human-machine teaming. Its analytical and generative capabilities for speech and text are thus making it an important tool in the hands of all types of actors around cybersecurity.

To stress this idea, here are some NLP tasks that may be considered in creating threat or attack tools, or in defending humans and machines:

 

  • Language Analytics: language modelling, sentiment analysis, text classification, named-entity recognition, natural language inference, relation extraction, semantic parsing, co-reference resolution, entity linking, relational reasoning, semantic composition, language identification and translation, entity and information extraction, intent detection and classification, stance and fake news detection, rumor detection, hate speech detection, clickbait detection, abuse detection.
  • Language Generators: question-answering systems, text and dialogues generation, text summarization, slot filling for knowledge base population tasks, scripts and programming code generation.
These tasks should be generally considered in a more complex context, where they can (automatically) use one each others capabilities as well as other AI tools, such as sound and voice processing and generation, image processing, etc.
Language Analytics & Language Generators

As you may see on paperswithcode.com/area/nlp, or on statista.com, or on many other sources, the number and the market of NLP applications is in a constant grow. Given that any IT system or application must consider the cybersecurity component nowadays, the role of NLP for this domain is also increasing. The plethora of researches and algorithms released every day may be therefore used for a suite of cybersecurity applications. The diagram below depicts some of them. It is split in 3 different types of usages, as follows:

  • the yellow boxes represent applications using automated information gathering and generation that may be used to attack or defend the human side of he human-machine teaming;
  • the blue boxes are for applications where NLP may be used by threat actors to leverage the attacks against machines, respectively by the defenders to strengthen the defense posture of their machines;
  • the green boxes depict applications supporting the automation of some core security operations and compliance activities.
 NLP Tooling
Further on, we discuss shortly on each of these applications, stressing out the contribution of the NLP tooling to them.

The yellow boxes:

  • According to CrowdStrike’s CTO Dmitri Alperovitch, “the power of cyber, isn’t the ‘cyber Pearl Harbor’ scenario — which we’ve been talking about for 25 years now and hasn’t happened. The real power is in information”. Advances in machine learning and processing power mean it is now possible to process vast amounts of information in order to solve specific practical problems. Some of them are information gathering and information generation. Lots of big private companies and state agencies are already masters on information gathering (no need to exemplify here). Over the past years, we have seen amazing results in information generation with machine learning, such as text, sound or images. Just very recently, the NLP GPT-2 model from OpenAI has been apparently only partially released “due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale”. OpenAI further states that they “are aware that some researchers have the technical capacity to reproduce and open source our results”. But probably it rests also in the power of machine learning models and algorithms to identify ways to uncover at scale the fake and deceptive information.
Audio And Textual Impersonation
  • NLP may have a leveraging role when combined with acoustics, and sound generation. This brings audio and textual impersonation into a whole new era, challenging our human senses and capabilities on how we can tackle this type of (dis-)information. Here is an example from a Chinese company called iFlyTek that used AI to create a clip of Donald Trump speaking in English and Mandarin. More over, these kind of technologies may be used together with GAN algorithms and body key points to generate fake videos with faces speaking over a specific speech.
  • While fake videos are yet a bit challenging to generate, the website thispersondoesnotexist.com utilizes artificial intelligence to endlessly generate the faces of people who don’t actually exist. This is just one component out of the existing capabilities to create identities and private information that aren’t existing at all. Using NLP, the data created by bots can achieve high scales. In an interesting paper from Estée van der Welt and Jan Eloff called Using Machine Learning to Detect Fake Identities: Bots vs Humans, the authors conclude among others that “existing features and machine learning models used to detect bot accounts are not suited to detect fake human accounts”. While further work is on-going in the area of identity deception, there is a need for better mechanism to automatically identify malicious purposes through the usage of fake identities.
  • (Spear-) phishing occurs when scammers send tailored emails, text messages, or phone calls to swindle victims. With social media as a prime resource, spear phishers are becoming even more efficient, and the NLP toolbox is perfect in leveraging these applications. Using NLP information gathering tasks, the attackers automate the gathering of personal information through social media and other sources, making the attack harder to be identified. The defenders could however use the same toolse tdd against these attacks. An interesting MIT paper written by Kotson and Schulz shows how the NLP can be used in the identification of particular phishing campaigns against an organization. According to the authors, as long as an analyst is able to obtain text-based profiles characterizing both the phishing identities and their targets, the similarity analysis and Latent Semantic Analysis could be applied to any spear phishing problem to identify the adversaries’ intent and capabilities.
  • When it comes to censorship and disinformation, we can see already two different types of applications. One is the usage at scale of NLP in performing the censorship of “non-compliant content”, particularly in those countries where criticism of the government, military, or ruling family is subject to censorship. Another one is discussed by Heng Ji and Kevin Knight in their paper on Creative Language Encoding under Censorship. According to them, “encoding tools, if successful, will also be able to fast co-evolve over time with the semi-automatic censorship system’ own evolutionary processes and ultimately defeat it. In the meanwhile it will bring new challenges and opportunities for adapting downstream natural language processing techniques to understand coded languages”.

The blue boxes:

  • A wide array of possible web attacks and defense mechanism can use deep learning and NLP, whether they are time-series, or HTTP Request/Response based. While the attackers may use information gathering techniques (see above) or adversarial machine learning, the defenders may strengthen their posture by using demonstrated applications of NLP such as seq2seq atoencoders and other models for web security. Arseny Reutov, Irina Stepanyuk, Fedor Sakharov, and Alexandra Murzina have shown how it can be applied for Network Traffic Analysis and protection of web applications while other developers have already extended their work with Anomaly Detection capabilities. Yet another practical example comes from Dominc Puzio, who created during his career at Capital One a system to analyze enterprise-scale network traffic in real time, render predictions, and raise alerts for cyber security analysts to evaluate. Thus, we can only assume that despite the ever growing technical challenges, market leaders in WAF and anomaly detection are considering implementations of such capabilities in their products. Applications of Anomaly Detection and Domain Classification have been researched and tested in detecting malicious URLs. A Colombian researcher team has shown that discerning URLs by their patterns is a good predictor of phishing websites, achieving 95% accuracy using NLP Long Short Term Memory (LSTM). Researchers Sandeep Yadav and Ranjan propose a methodology for detecting algorithmically generated domain names used for “domain fluxing” by Botnets. One more example comes from Walid Daboubi who open-sourced on GitHub an autoencoder neural network to detect malicious URLs.
autoencoder neural network to detect malicious URLs
  • When it comes to Malware and Code Analysis, the data scientists at Endgame are building upon advanced NLP techniques to better identify and understand malicious code. They have created a framework for malware analysis called Malicious Language Processing. Its goal is to operationalize NLP by automating and expediting the identification of malicious code hidden within benign code. Using the lexical parsing concept, they consider the malicious binaries as a large body of text. This way the machines could “understand” a code without the need to execute it. Similarly, the NLP techniques can be extended towards Vulnerability Assessment by using the vulnerability extrapolation. According to Fabian Yamaguchi, Felix Lindner, and Konrad Rieck, machine learning may be applied to determine functions similar to vulnerable ones for application-specific API usage patterns. Therefore, starting from a known vulnerability, patterns can be exploited to guide the auditing of code and to identify potentially vulnerable code with similar characteristics.

The green boxes:

  • Probably the most important role of Machine Learning is to leverage the human-machine efficiencies. In cybersecurity, it is the Event Detection and Prediction domain where humans are trying to understand what is happening and what might happen with a certain IT environment. Additionally, NLP deep analytics, classification and co-reference resolution leverages inference and orchestration. One interesting solution in this area comes from Empow Security and Elastic. They use NLP to integrate advanced analytics into a rules-free SIEM solutions. Using the NLP correlation capabilities, their solution performs cause-and-effect analytics covering all stages of an attack lifecycle. In terms of overall cybersecurity operational solutions, the Atos SOC team went a step forward through the implementation of a prescriptive analytics solution, where data scientists work together with cybersecurity personnel. According to them, there is a need to go beyond predictive analytics by specifying both the actions necessary to achieve predicted outcomes, and the interrelated effects of each decision.
  • A short look on the Gartner Peer Insights page on Threat Intelligence market will show you a quite crowded market with currently at least 24 products available. What you may have not known is that many of them are implementing natural language processing to enhance their product capabilities. Josefin Ondrus from Recorded Future and their team explain how NLP is used to understand 350 facts per second. Using NLP, the threat intelligence products read and understand not just the meaning of words and technical data in multiple languages, but also uses billions of data points to identify patterns. With always better accuracy, the machine learns the language of threats to cut through the noise of potential false positives. By using NLP, machines are able to crunch more and more data sources, in different languages, and automatically highlight to us, the humans, notable intelligence.
  • NLP based ontologies support automation and integration of IT risk management and cyber resiliency. Given that risk analysis and risk assessment involve also textual information, IT risk management products using NLP can correlate the requirements of different frameworks and methodologies against aggregated data from an IT environment. It can enhance product capabilities to support ensuring regulatory and legal compliance and it may even facilitate communications with regulators. It can support execution of qualitative risk assessments and provide hints on the business consequences for some risks. Furthermore, NLP’s content analytics capabilities can efficiently track changes to regulatory requirements and support evaluation of compliance related costs. NLP can be used to reduce cybersecurity model risk through improved risk models.
One category that I’ve left apart from the diagram above is scripts and code generation. Already back in 2015, Andrej Karpathy, the head of AI at Tesla, took all the source and header files found in the Linux repo on Github, concatenated all of them in a single giant file and trained an LSTM model which eventually generated some syntactically correct though not executable code. I wonder what would a machine generate, let’s say in Python, if all the model would be trained on the entire Python code available on GitHub? What if we would go a step further, and we would train it on the entire malware corpus available already on the web. We might be yet a bit far from that point, but the accelerating change may allow us achieve it in the not so distant future.
Machine Learning
We have seen that Machine Learning, generally, and Natural Language Processing, specifically, are tools used with increasing intensity in the cybersecurity world. Whether we are talking about different actor types, risk scenarios such as threats and events, about security operations or about compliance, these tools are bringing new capabilities supporting both attackers, and defenders in performing their activities.
My final point uses two more citations. Peter J. Denning and Craig Martell have stated in the Great Principles of Computing: “The principles of a field are actually a set of interwoven stories about the structure and behavior of field elements”. Gilles Laurent, associate professor of biology and computational and neural systems at the California Institute of Technology, observes in the Nature article called What does ‘understanding’ mean?: “In most cases, a system’s collective behavior is very difficult to deduce from knowledge of its components”. The NLP analytics or generator techniques may be interesting, but if they are taken only as singular tasks, the story might not complete. Should they be combined in each of the exemplified applications in cybersecurity, and of course way beyond these examples, then the behavior becomes more complex, and as such, more difficult to deduce only from knowledge of its components. As the number of researches, the power of computing, and the available data are increasing, I am persuaded that NLP will continuously bring breakthroughs to the cybersecurity area. Given the attackers are never losing any opportunity to enhance them, it rest only in the hands of the defenders and their providers to also adapt themselves and see NLP as a possible solution for their challenges. At least until machines will learn and adapt by themselves…