Thinking about automation: network debugging

I’m hoping my generic model of a security automat (Levers, Data, Malice, Controls, Signals) will help me think about how tools can contribute to network security and operations. It produces the ideas I’d expect when applied to areas that I already know about, but the acid test is what happens when I use it to think about new applications. So here are some thoughts prompted by an automat that helps identify and debug wifi (and other) network problems: Juniper’s Mist AI. This was chosen purely because a colleague mentioned their Networkshop presentation: I wasn’t there, so the following is based on the published documentation.

Levers. Mist AI seems to produce two distinct kinds of output (noticing that got me thinking about different modes of automation). In a few cases – upgrading firmware the main one highlighted in the product videos – it offers to actually perform the task for the human operator. But more often it suggests to the human what the root cause of a problem might be – “this DHCP server seems to be linked to a lot of other failures” – and relies on human expertise to diagnose and fix the problem. So humans make, or at least approve, all changes. There are interesting comments in one of the videos that more automation would be possible: “it’s not a technical issue, more one of building trust” and “we want no false positives”. The latter seems a very high bar, since most automats will be based on statistics and correlations, but the trust point is really important. There’s a challenging balance in any new source of “authority” (see the introduction of printing!) to make humans trust it enough that it does save time/improve quality, but not so much that they follow it blindly even when its recommendations – for example in a situation it wasn’t designed for – don’t make sense.

Data. Mist AI takes data from a huge range of sources: pretty much any compatible network equipment. The one obvious gap is the computer/laptop/phone/printer/thing that is trying to connect to and use the network. Given the challenges of installing agents on such devices, that’s probably sensible. But it probably is something that needs to be flagged to the system’s operators: when it says “most likely cause is a misconfigured device, but I don’t know”, that’s a design choice, not a system failure. Otherwise that huge range of input data enables another mode of human/machine partnership: helping humans visualise and navigate. The interface provides timelines and other visual representations to help humans spot patterns, as well as the ability to “pivot” through different sources – “show me what else is going on at that time/that device/etc.” – that has been used in many of the most impressive examples of security diagnosis I’ve seen.

Malice. The few fully-automated actions envisaged for Mist AI seem to give an attacker little opportunity, though anything with administrator access needs to be designed carefully. There’s a steady trickle of attacks that use auto-update functions to either downgrade security or install entirely new software. Perhaps more interesting is whether an automat could detect malicious activity at the network level: sending requests with odd parameters to get more than your fair share of bandwidth, or to disconnect others; or perhaps trying to confuse or overload network devices to redirect traffic or hide malicious activity in a storm of noise. This probably isn’t something you’d want to respond automatically to (that could easily provide a disruptive attacker with a useful tool) but an automat that alerts when things simply don’t make sense could be a useful function.

Controls. The executive powers of the current Mist AI system are very limited, and human operators have to approve individual actions (e.g. a firmware upgrade). One reason that works is that the system is addressing the kind of problems that users often assume will “go away”: network slow, took longer than usual to connect to wifi, etc. Indeed one stated aim of Mist is to help operators fix problems (or contact the user) before they are reported. On that time-scale, human-in-the-loop is fine, and worth it for the confidence it adds. Given the aim to fix problems before they are noticed, I wondered whether the system could check that the fix did actually reduce the problem: if not, the operator might want to automatically revert an approved change, or go back to a “last known good” state. Perhaps undoing a change is more readily automatable than doing it? And, since there is “machine learning” going on, maybe invoking those controls could be input to the next learning cycle (“OK, that didn’t work out”)?

Signals. One of the interesting displays I spotted on the video was strength-of-correlation: “100% of DHCP failures involve this server” suggests we have found the root cause, 30% and I might do a bit more digging. Confidence is a useful signal in pretty much any recommendation system, not least to highlight occasions when the machine, programmed to find something it can recommend, has searched ever more widely and ended up with a suggestion that’s little better than a random guess. Any fully-automated system should probably have a lower confidence threshold beyond which it will still seek human confirmation. And, as in Controls above, the timeline displays in Mist AI got me thinking whether we could use automated before/after comparisons to evaluate the effectiveness of changes: did the original problem reduce? Did anything else start happening? And I like the option to “list clients experiencing problems”: sometimes talking to a person will be the quickest way to work out what is going on; and to show we aren’t completely mechanical, too!

By Andrew Cormack

Leave a Reply Cancel reply