Re-architecting a traditional IT Automation strategy to take advantage of ML involves three important steps.
First step: Set up AIOPS learning algorithms to trigger downstream automations rather than specific event signals. There are two reasons for this. First, only a small percentage of the billions of conceivable occurrences are sufficiently specific to identify a specific automation path; thus, the automation potential is highly limited. Second, any circumstance that such an event reflects is likely to be so far along in its lifecycle that the customer’s underlying impact has been persistent and long-standing. These limitations are minimized, if not eliminated, by triggering automation from a machine-recognized pattern that has been properly identified (i.e., a link-down in Asia).
The second step in implementing AIOPS + Automation is to ensure that the AIOPS decision context has all of the parameters and context that are needed to choose the best action. Diagnostic workflows, for example, should be avoided in incident response because rule-based diagnostics are rigid and prone to error over time as configuration drifts from baseline. Instead, a classification model should be employed to learn to identify the unique type of incident, its importance, and the service restoration or preventative response.
The third step is to model response selection as a sequence of candidate procedures. Each step in the sequence has a stop condition (success->condition required, failure->critical error, human alerted). Sequence selection should be learned, with a learning algorithm predicting the sequence based on the outcomes of prior sequences. To cluster labeled incident types, response sequences that (1) minimize steps and (2) result in success are used.
A labeling schema for response variants that covers multiple action sequences should be developed.
For a really simple explanation of sequence learning, check out this link: https://www.youtube.com/watch?v=kqSzLo9fenk.