Enterprises around the globe are creating “data science” teams to apply machine learning to cybersecurity. This is a positive development as the use of machine learning and other artificial intelligence techniques will continue to be instrumental in setting a more proactive cybersecurity posture. In this article, we discuss three tips to improve such programs.
- Aggressively root out temporal intermixing
In many machine learning applications, we are aiming to predict some event. Clearly, you want to predict the event with data that is known prior to that event occurring – and this is rather uncontroversial. However, in practice this becomes difficult to ensure when evaluating an algorithm on historical data. Part of this is due to an issue of data sparsity – when modelers do not have a large amount of data. Modelers are often tempted to work around such issues – assuming temporal independence when it is not there. This leads to very good results during model training, but models trained with such mistakes perform poorly in practice.
- Understand what the confidence score returned by the algorithm really means
Most machine learning algorithms will provide a “confidence score” associated with the results. The higher the value of the score, the more likely it is for the result to be true. However, for nearly all machine learning algorithms, the confidence score does not map to the precision (or fraction of results returned that are correct). Further, the confidence scores often do not conform to precision in a linear relationship. For example, you may get a confidence score of 0.3 that provides 10% precision and a confidence score of 0.4 that provides 60% precision. A user can better make decisions such as trading off true positives for false positives if the confidence score is mapped to something more useful.
- Lead time analysis
Metrics such as precision, recall, true positives, and false positives are all very important. However, in an application like cybersecurity, understanding the lead time is also important. Often in cybersecurity a prediction is telling us we have to do work – patch a vulnerability, block a port, etc. These actions may take time, and it is important to know how they should be prioritized – i.e. causing someone to work on a weekend. Further, when analyzing lead time, engineers can work to maximize it – leading to more useful predictions.
Machine learning is not easy and there are many facets one must consider when applying it to a complex application like cybersecurity. Continue to follow our blog to learn more about the nuances of this exciting new area.