Operational Excellence. This term is not new and certainly you’ve heard it several times before, haven’t you? But before introducing concepts and definitions of what Operational Excellence is, it is important to quote some testimonials from people who have adopted DevOps technologies without being prepared for it.
- “I bought DevOps and have not seen great results yet.”
- “I’ve adopted a proper monitoring solution for this new world and now I have one more dashboard to take care of. I do not know where to look!”
- “My operating team is overwhelmed with the number of alerts and notifications that all these monitoring tools are generating!”
“I have it all and it does not improve anything.”
- “Incidents now take a long time to be solved because there is a lot of information, a greater complexity, and there is not a good visibility of what is really happening in the environment.”
I suppose it occurs or has already occurred in your work environment (or your friend’s) and the question that remains is: what is the problem?
Years of dealing with our customers’ problems and talking to people who have experienced this kind of situation have shown that this is because companies first seek solutions through tools and forget that those who operate such tools and platforms are people.
Although some presentations and some vendors might try to convince you otherwise, the truth is that tools are not magic. They exist to make people’s lives better if, and only if, there is a clear goal to be achieved and there is a well-defined plan for doing so.
OK, but what Operational Excellence?
Well, it’s a method to make systems (and why not, services?) better and more sustainable. It’s part of your mission to make them more reliable and scalable by optimizing architectures, processes and, of course, the people who build and operate it.
Make no mistake, this state will not be reached by mistake. The roadmap needed to get to the state of the art involves learning a new set of everyday skills; update processes and tools; and embrace a mindset with three fundamental principles, which will allow you to “dive into this pool of warm water”.
Principle #1 — Drift into failure
If you have (or work) in a business that cannot stop, you’ve probably asked yourself, “What happens when your service is not available?” Or “What happens when users cannot interact online with your application?”. Events that cause an unavailability of your services are quite impacting, either by the financial side as well as by the image and / or credibility of your company.
On the other hand, due to the complexities of the systems as well as the constant changes that take place day by day, we also know that failures will always exist. The discussion here is not about IF there will be any problem in your application that will leave it unavailable, slow or with errors, but WHEN.
As Bejamin Treylor Sloss (VP 24×7 Engineering of Google) says: “100% is the wrong reliability target for basically everything”. The focus is not to avoid failures at any cost, what should be constantly sought is:
How do you prepare for it beforehand?
How to mitigate risk?
How fast can your system recover?
Lesson: Expect failures. A component may fail or be interrupted at any time. Dependent components will fail at any time. There will be network failures. The discs will run out of space.
Discover the weaknesses of your system / process / operation, reveal the risks, be part of the blameless culture and most importantly:
Principle # 2 – Data is the most valuable asset
We already agreed that systems are always failing, right? But it is not because they are always failing that you should seek services to be perfect 100% of the time.
Are your users happy?
It is at this point that SLI / SLO / Error Budget concepts come into play. If you haven’t read our article on this topic, please stop here, read and then come back again. No problem, I wait. Read it here.
Basically, it is possible to know, through these approaches and measures, if the quality of service is good enough, that is, it evaluates if the service is in a good state for the end user or not.
What is your users’ behavior?
In addition to knowing whether a service is in an untrusted state or not, you must also track what each user performs on your system. For each step, you must measure: duration, result, operation, logs, errors, arguments, segments, and other relevant details in the execution of your product.
With this set of data you can talk about the errors that your users receive most often, thus facilitating your conversation with the business areas of your company, and also creating a process of continuous improvements or incremental innovation of your company’s product.
Lesson: Orchestrate everything. Use production data to find problems in production. James Hamilton already said, “Quality assurance of a critical system is a matter of data mining and visualization.”
Principle #3 — Ordering the alphabet soup
You should already have read in articles, heard at events or even in a bar conversation, terms like DevOps, DevSecOps, BizDevOps, Lean, Agile, SRE, UX, Chaos Engineering, Kubernetes, Serverless, Microservices, Big Data Systems, Immutable Infrastructure , UX, Infra as Code, NoSQL, ChatOps, Operation as a Service, etc. It is important to stress that all these movements, practices, approaches, disciplines and technologies are very recent. And all this sounds very good to our ears.
But there is something that is often not admitted: building and maintaining complex systems in production is very difficult, time-consuming, and stressful.
In fact, our day to day life is not simple, and much is due to two factors:
- #1 Yak shaving: an endless series of small tasks we have to do before we can work on what really matters
- • #2 Endless checklists to maintain productive systems are very long and complex
Whoever takes care of the operation has enough things to learn from those who develop a system (and vice versa). Managing all this day-to-day life is getting better and better thanks to a trend that is: to increasingly treat infrastructure as code.
Lesson: Automate your repetitive tasks. People make mistakes, need sleep, and forget things. Some of the benefits of turning your operation / infrastructure into code are:
- Reduction of manual intervention need
- More people can perform tasks
- Increased reuse
- Reliability in process and execution
- Version control
- Everything is documented
Keep in mind that automation should be seen as a team partner, not a substitute for the activities / actions of human beings.
Thiago Maciel, Tech specialist Inmetrics
Want to know more about it? Contact us.