Blog

Life Cycle Of A System Issue In Production

Issues, and possible downtime, happen in cloud, on-prem and hybrid systems. It can be micro-services, monolithic systems or a hybrid architecture. All may suffer issues in production and go through downtime.

In new micro-services cloud architectures, as they should be designed for failure, if an issue turns the system unavailable, that is always the software’s responsibility: if a region fails it should be able to continue running in another region or a different datacenter; if a service fails it should fail requests fast and elegantly, access an alternative service or continue running with reduced functionality; if the load increases it should scale horizontally; if a bottleneck affects a functionality, it should continue running slow, increasing the response time, and not crash. Well … it is the software’s responsibility … almost always. What if we have an expired certificate? In some cases, there is little we can do in software to continue working with insecure communications.

Something that is true is that architectures not designed for failures will always have a tight dependency on the infrastructure. The infrastructure can fail but we could keep our service, and business, running.

Anyway, issues in our systems will always happen: software not design for every possible failure, bugs, configuration errors, infrastructure failures, certificate errors, etc. and we need to monitor our system to detect the issue as soon as possible, identify the root cause as soon as possible, and solve the issue … also as soon as possible ?.

Every issue has a life cycle. Differentiating every phase would help us understand how current application performance monitoring tools and network performance monitoring tools help us.

An issue in our system goes through the following life cycle:

Introduction

This is the moment an issue is introduced. It could be due to:

  • A design or coding bug.
  • A design not considering an environment change. For example, a failover from one region to other.
  • A configuration error.
  • Other

If you are in any form responsible for the product quality, this is the time when you lost a discussion with the developer or the scrum master, when you have to suffer the consequences of not having test automated, we had to deliver a feature fast, or just shitssue happens …

Dormant period

During this time, the issue could exist and not be noticed because the right sequence of events that would trigger it didn’t occur yet. It could be noticeable with small events that don’t trigger any alarm in our monitoring system or don’t draw enough attention from devops. It could also trigger an alarm lost among hundreds of other alarms, without causing a major incident.

If you are in any way responsible for the product quality, usually this is the time during which the developer or scrum master or anybody else tells you “Do you see? That was not a problem … “. Anyway, it’s the time to enjoy the glory time before worse times come.

Emergence period

This is the time where the issue generates a noticeable damage:

  • Data inconsistency in a big number of users
  • System unavailable
  • Unreliable state
  • Performance deterioration
  • Other

Detection and acknowledgement

This is the phase when the issue was noticed and Operations/Devops/SRE acknowledged it is an issue that needs to be solved.

The goal is to shift this phase left making the time between Emergence and acknowledgement tend to zero. The ideal is to detect an issue in the dormant period. Even better if the issue (bug, etc) is not introduced but I don’t think this is possible.

Once the issue is introduced, it is essential to have the proper monitoring tools and processes in place.

If you are in any form responsible for the product quality, never give in to the temptation of saying “I told you”. This is a time of firefighting together, hopefully followed by a reasonable post-mortem, lessons learnt, and a root cause analysis report when the crisis is solved.

Root cause analysis

This period begins after we detect the issue, acknowledge it and decide to devote the required effort. During this period Devops/Operations/SRE are looking at the available data provided by Observability tools, usually in a top-down methodology trying to understand the problem and identify the root cause.

Root cause identification

This is the desired end of the Root Cause Analysis period. Ops/Devops/SRE people have gone through a lot of data, used their experience to identify the relevant data and delve into it, elaborated hypotheses, maybe executed some tests to evaluate the hypotheses, checked different components and finally have identified the root cause.

We want the time between root cause identification and issue detection tend to zero.

Solution phase

The solution will take the system to the correct state but it could also be sub-optimum, taking the system to an acceptable state where the most important functionality can be accomplished while the complete solution is implemented.

The solution could imply a temporary patch until a complete solution is ready and deployed. Or simply a deployment rollback, if we have the proper continuous deployment design, processes and tools.

After the solution, a post-mortem should take place to discuss the root cause and future actions to avoid similar issues.

Reducing Mean Time To Repair (MTTR)

Today we have great Observability products that provide us with almost all system data to analyze our system, understand what is failing, and hopefully find the root cause of an incident.

We can count on New RelicDatadog , open source PrometheusGrafana dashboardsOpenTelemtry standards tools and SDKs, and a plethora of other products.

These tools give you all or almost all data we will need to analyze an issue and find out the root cause. More data than what we can analyze in years. However, when you have an incident in production, the first problem is to figure out where to look at. There are logs and metrics of everything and even when you’re experienced many times you won’t know where to begin, you will struggle to determine the correct data to look at and analyze. Also, more often than not, you won’t know if the data represent a normal state or it is telling you about the issue that emerged. Of course, there is the very first request that an alarm is triggered on time and is recognized before a customer calls or the CEO himself experiences the issue.

This is where we all struggle even though we, and our company, made a significant investment to set up the proper tools and collect more data than we can ever handle. This investment can be to buy a product, to pay the monthly subscription to monitor our systems, to configure all alarms we can, and to have always somebody looking at the alarms.

This is the area where Wayaga is committed to innovate and improve the current state-of-the-art to find out these issues while they are still invisible with current tools, and to produce a tool that automatically detects an issue, acknowledges it, investigates it, and identifies its root cause as soon as it is noticeable.


0 Comments