ML for Cybersecurity


Ever since we started CYR3CON, we are continually surprised at the hype cycle surrounding the application of machine learning.



Many marketers opportunistically see it as another angle to entice buyers — and in talking to CSO’s — their frustration with the hype, equivocation, and machine learning marketing charlatans is palpable.

That said, there are some great technologies out there. So, we’ve been doing machine learning for a while and thought writing a blog about what to look for might help separate the hype from the cool stuff.


 1. Where is machine learning being used?

There are many claims in cybersecurity marketing about the use of machine learning. Yet many of these products either don’t use machine learning or use it in a trivial manner. The vendor should be able to clearly explain where machine learning is being used. For example, visualizing the counts of data, searching through large amounts of data, and matching signatures are not machine learning — they are data mining (still useful in many cases). There are also some solutions where machine learning is “bolted on” in a side feature — and not really contributing to the overall value. Ask to be shown exactly where and how machine learning is being used.


 2. Is the technology peer-reviewed?

We demand peer-review from the medical industry as we want our medical devices and drugs to be independently validated by experts. Further such validation is dependent on metrics — and it is important to have transparency in the experiments that produced such metrics. Machine learning should be no different as there are many ways to produce precision, recall, accuracy, and false positive rates that are not relevant to cyber-security operations. For example, earlier work on predicting exploitation of vulnerabilities was shown to have serious barriers to deployment by an MIT Lincoln Labs study. The previous studies had important methodological shortcomings.


 3. Is the data feeding the solution really indicative   of future events?

If you work for a dairy company, you may be interested in software to predict the consumption of cheese. But would you buy a tool to make such a prediction based on the number of people who die by becoming tangled in their bedsheets? Regardless of how fancy an algorithm or piece of software is, it’s making the prediction based on some piece of data — and you should ask the vendor what that is and ask him or her why it makes sense.


ML for Cyber_blog graph

From Spurious Correlations, used under the Creative Commons License


 4. Can we look inside the “black box”?

Much of machine learning today is a black box — meaning that it makes a prediction and you are left to either accept it or not. However, when you get into the realm of security, it becomes important to have some intuition of howthe software reached a given prediction. In enterprise networks, new software is constantly being added, new business units get acquired, and policy changes often occur. On the other side, the adversary is constantly changing and adapting. So, when a machine learning solution for cyber provides a seemingly strange insight, it becomes important to understand the logic used to arrive there and how that plays in the context of the larger enterprise. With a black box, you will never get that insight. No machine learning system will function indefinitely without some tuning or refreshing — so when false positives inevitably occur, it will be important to understand why.


 5. Is the model updated?

We’re doing security — which means the adversary is constantly adapting. In some cases, the adversary may even adapt to the vendor’s method. But from a machine learning standpoint, such changes mean that the underlying distribution of the data may also change — and this is something that can stop a machine learning algorithm dead in its tracks. That said, the solution is for the vendor to be refreshing the model used to make the predictions on a regular basis.


If the vendor you are dealing with is doing proper machine learning that truly delivers value, then these should be fairly easy questions. On the other hand, if “machine learning” is being used as shallow marketing term, you may find a vendor struggle to provide an answer that makes sense. A poor answer to any of the above questions should definitely be cause for concern and additional follow-up questions.