A hot topic within AIOps is without a doubt the promised land of self-healing, where an AIOps solution is assisting engineers and SRE’s with automatic actions. But just how efficient is the technology of self-healing? Can it be relied upon or is it merely a buzz-word with little to no practical use? This introduction article from Einar & Partners covers the art of self-healing and what you can expect of it.
History and background
Typically an engineer or SRE has a busy job. One of the more intense positions within a company is having to be available at uncommon hours and fighting outages with unhappy end-users waiting for updates. How often have we not heard the joke of “never push to production on a Friday afternoon” to be followed up by a weekend of technical troubleshooting and pulling out hair in frustration? A horror scenario indeed but more and more common when organizations have to be extremely agile.
As indicated by the name, SRE – or site reliability engineer, has to be on-call and fix issues as they arise while ensuring the business runs smoothly. Statistics indicate that an SRE spends at best 50% of their time fixing issues (like at Google) and at most organizations significantly more. But zooming in on that statistics, the question asked by big organizations like Google and Amazon is, how is the time fixing spent?
The answer is quite simple, whereas most of the “troubleshooting” time is unfortunately spent on a concept called “TOIL“.
TOIL & DevOps – What is TOIL?
Toil is the repetitive, the mundane, the tedious and unproductive work that an SRE has to execute daily. In other words, the tasks that can be automated and create the most significant overhead in terms of time-investment for an organization. Some examples include fetching log-files, rebooting services, running scripts, finding information, service checks, applying configurations or copying and pasting commands from a playbook.
Unless an engineering department is not careful, too much TOIL can easily result in a burnout due to its dull and repetitive nature. Simultaneously, as engineers have to deal with TOIL overload, they are expected to contribute to the development and code-base of applications and services. This situation can easily create confusion about what an SRE is supposed to do. Fighting endless fires or contributing to design and optimization?
Why is TOIL bad?
- Slows down innovation & progress
- Reduction of quality due to manual work
- Never-ending list of manual tasks that takes a long time to teach new resources
- High OPEX due to low efficiency
“TOIL is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical devoid of enduring value and that scales linearly as a service grows“.Vivek Rau, Google
Best strategy for automatic remediation?
How can organizations leverage modern solutions and technology to reduce the TOIL with the previous introduction in mind? In a DevOps world where any given application may have hundreds of microservices, states to keep track of, and endless dependencies; automation is vital.
A common misconception is that self-healing and automatic remediation will replace the in-depth troubleshooting that SRE’s and Ops perform. This is not the case, as fixing more complicated issues will always require skilled engineers for the foreseeable future. Implementing auto-remediation has a different focus and concentrates on automating the many small tasks rather than the few big.
Auto remediation and realistic use cases in AIOps
The philosophy of auto-remediation and self-healing is to shift the model that any given alert from an application should start with a human response. It flips this equation in favor of AIOps as the first point of contact rather than a person. Most applications and alerts have a set of standard steps to resolve a given issue. Sometimes, the steps are simple, like restarting a service or gathering data. Other times the fix can be to change a configuration or starting a workflow (think a decision tree).
On top of this Ops teams and SRE’s have the normally expected tasks, like acknowledging alerts, categorizing issues, prioritizing incidents and update tickets. Individually the tasks are very small but put them together and you end up with most of the time spent just repeating the same steps. Over and over again. Self-healing aims to remove the element of repetition from the equation.
How organizations can save time (for real)
The data suggests that a significant portion of the time that engineers spend is related to repetitive tasks. As such AIOps & automatic remediation are about helping relieve the pressure of these types of tasks. That way SRE’s and OP teams can focus on what really is essential, which is the troubleshooting and investigations where AI and automation fall short. The work only fit for the eyes and brain of a person.
Auto-remediation is there for merely another tool in the toolbox of engineers. A right AIOps solution should analyze historical solutions of issues, see what worked, and suggest appropriate actions for the engineers. With enough confidence (based on data) AIOps can start automatic workflows and trigger actions to assist the engineer in his work.
This way, engineers are allowed to focus on the work which matters and free up headspace from the manual tasks. Ultimately this will enable organizations to lower operational expenditure and have a more innovative workforce. The time saved on automating can be re-invested in further automation, creating a positive feedback loop.
Risks and pitfalls with self-healing
Unfortunately, not everything is as picture-perfect as the hypothetical world that AIOps often suggests. To fully realize the value of automatic remediation, several pre-conditions must be met, such as:
- Having core data connected to the AIOps platform. Without historical knowledge of how incidents were resolved, what solution worked, and the relation to infrastructure changes, AIOps will have a difficult time suggesting actions.
- Connecting monitoring data. Having alerts and monitoring data feeding the AIOps system is crucial to reduce volume and correlate which remediation fits to what alert type.
- Culture of automation. The cultural aspect of automation must not be forgotten. Allowing employees to dedicate time to create automation workflows that can be used by AIOps is crucial.
In the end automatic remediation is about handling expectations about what it can and can’t do. We’re quite not at the stage yet where it replaces the role of an operator completely. Yet what an organization should expect is for AIOps to help significantly with the workload and to reduce operational tasks.
Always keep in mind that “anything a human can do, a machine can also do.”
Moving beyond self-healing
So far we’ve covered the concept of TOIL and how it relates to self-healing. But what comes after automatic remediation? There are many paths a successful rollout of AIOps can take, but the holy grail (at least in the year 2021) will be in anomaly detection and proactive alerts. Ideally, SRE’s and operators should focus on proactive alerts rather than reactive alerts. Meaning that anomalies and deviant behaviors can be detected early in logs, metrics, and infrastructure through machine learning. Hopefully, before a P1 ticket has been created.
Anomaly detection is not just buzz-words but one of the few areas where machine learning can be applied in the real world. Detecting outliers based on historical patterns is an area almost impossible for a human operator to engage in, as the sheer volume of metrics & alerts is simply too high. When SRE’s moves from just reacting to alerts to proactively observing the state and behavior of application and services – a technical wonder is in the making.
Getting to that stage is a maturity process just like anything else. A maturity process which more often than not starts with the organization and culture. If the mindset around how SRE’s spend their time does not change, and if TOIL is allowed to wreak havoc, the tools are of little importance at the end of the day.
Is automatic remediation a bit of hype? It depends.
Focusing on the real-world use cases and managing the expectations accordingly, one can quickly see that there is also truth to the story. Self-healing was never about replacing the complex and intrinsic nature of human troubleshooting abilities. It is about freeing up the time to allow people to focus on what matters.
Starting to automate basic tasks such as gathering information and fetching data is a significant first step to self-healing. With connected monitoring systems and core data (like incidents, changes and problems) AIOps is allowed to form a better contextual awareness to automate remediation. Sometimes much better than what an operator ever could do on her own.
With the right investment, SRE teams’ costs can be significantly reduced and a culture of continuous improvement and automation is allowed to flourish. Spending time on reducing alert fatigue and TOIL will have resounding positive effects, both in terms of employee satisfaction and performance.
A happy SRE team is a team that is allowed to be innovative and creative. Creative brains are a valuable, limited resource. They shouldn’t be wasted on re-inventing the wheel when there are so many fascinating new problems waiting out there.
Wouldn’t you agree?