Root Cause Analysis with 5 Whys and Fishbone Diagram for Software Architects
One of the critical responsibilities of a software architect is to solve the correct business problems and do it most effectively. Peer developers and other architects propose solutions to technical problems, which you, as the seasoned software architect, should evaluate, communicate, and negotiate before even thinking of their implementation. The project sponsor, customer, or product owner states the problems and even defines solutions so that all that’s left for the development team is to implement them as stated. Sometimes the proposed solution works, and the implemented option satisfies the development team, product people, and users. But sometimes, the implemented option doesn’t solve the right problem or solves the problem incorrectly. As a result, it causes the rework towards a better solution, which increases the frustration of users, product people, and development teams. These negative scenarios can be mitigated by questioning the problem and the proposed solution before the implementation. Root cause analysis is a technique that could help a software architect to apply their expertise to investigate and analyze the problem and its potential solution in a structured way.
What is Root Cause Analysis?
Root cause analysis (RCA) is a process of identifying and understanding of challenges, problems or issues that occur within a software system. The goal of RCA is to go beyond the proposed solutions and symptoms of problems and instead focusing on actual reasons and issues that led to those solutions and symptoms.
In general RCA takes the following steps:
- Identify the problem. Clearly identify the problem that you need to address.
- Gather data. Collect relevant information related to the problem. The possible sources of information could be relevant project stakeholders, common domain knowledge, observed system behavior, available documentation, source code, etc.
- Analyze effects. Analyze effects of the problem to gain a clear understanding of what is happening.
- Identify root causes. Determine the primary factors contributing to the problem.
- Recommend solutions. Propose appropriate solutions to the problem.
This process starts when a stakeholder challenges you with a problem and sometimes immediately proposes a solution to the problem. You interview the stakeholder and gather more information about the situation to understand the stakeholder’s point of view and needs. Then having enough information on your hands, you perform the analysis and identify the actual root causes of the problem. The results of the analysis and defined root causes help you recommend the stakeholder a better solution to the problem or confirm that the solution proposed by the stakeholder works.
Putting the organization of interviews with the stakeholder aside for now, efficient analysis of the effects of the problem and identification of its root causes are the most challenging parts of the RCA process. There are different techniques, such as Fault Tree Analysis or Root Cause Mapping to support effect analysis and root cause identification. However, the most efficient by expressiveness/complexity ratio is the combination of a 5 Whys approach and a Fishbone diagram.
What is “5 Whys” Approach?
It is a simple but powerful technique to dig deep into the problem or situation until you reach the core reasons behind it. You start by asking the stakeholder exploratory questions like “Why is that a problem?” or “Why are we not in the desired situation today?”. You continue asking the “why” questions until you completely understand the problem and its causes. Often this conversation unveils hidden causes that the stakeholder wouldn’t think of initially.
What is Fishbone Diagram?
The Fishbone diagram helps with the identification, sorting, categorization, and visualization of possible causes of a problem. The Fishbone diagram is also known as the Ishikawa diagram. It got its name from its form, which resembles a fish skeleton. It displays the problem or effect at the head of the fish. Major categories of causes form the large “bones” of the fish. The smaller “bones” that branch off the large “bones” of the fish list the causes contributing to the categories.
Example:
Example: Application Stability Problem
Suppose you support an e-commerce web application serving thousands of users which you and your team developed in .NET or Java for a large retail company. It implements the features typical for an e-commerce product: product catalog, discounts, ordering, delivery tracking, order returns, etc. It works well most of the time. However, users complain that they very rarely see incorrect order total prices. You did your investigation and found that this is because of the corrupted state in the data context used to access data in the database. State corruption happens because of sharing the same context instance for the whole application instead of the current request, so multiple request threads intervene with the same data. You would like to find out why it happens and what to do to avoid these kinds of issues in the future.
A possible exploration of this stability problem with your development team can sound like this:
Q: Why do users experience incorrect application state?
A: Because we share the context to access the database for all web requests.
Q: Why is sharing the context a problem?
A: Because now multiple requests from different users change the same pieces of data. Changes from different users are not isolated anymore from each other.
Q: Why has it become a problem now?
A: Because we did a minor refactoring last sprint and accidentally changed the isolation level of the data context from Per-Request to Per-Application.
Q: Why hadn’t we discovered the issue before submitting it to the team code repository?
A: Because this piece of code missed our formal code review.
Q: Why was the fix missed the code review?
A: Because it was a side effect of a quick fix of another issue implemented by our technical lead, who committed it directly to the repository.
Q: Why didn’t our quality assurance team discover the problem?
A: Because the problem triggers when multiple users operate over the same piece of data. Our quality assurance team doesn’t simulate such a situation today.
Analyzing the answers, you can reveal reasons from multiple categories that caused the problem with the application stability:
- Technical aspects. A developer extended the lifecycle of the data context component. They shared the state of the data context and degraded data isolation. It might happen because the developer didn’t take time to think carefully about the changes or the gaps in the developer’s knowledge.
- Development process. It turned out that the code review is optional in some cases. There are also possible issues in the release management process, as the team wasn’t available right after the release time.
- Quality assurance process. The team’s test setup doesn’t correspond to the application usage in production. They missed checking for tricky cases of testing the application in the multi-user environment.
You can visualize the results of the analysis on the following Fishbone diagram:
Now you’re ready to work on a mitigation plan which you then communicate to other stakeholders.
Example: Making an Architecture Decision
Suppose you are working on a high-level architecture of a multi-tenant event-driven e-commerce web application. The application functionality is similar to what we discussed in the previous example. You decided to host your application on a public cloud such as Amazon Web Services. Your fellow architect is presenting you with an idea to store all the data from different users in a DynamoDB database. DynamoDB is a high-performance key-value storage offering from Amazon. You wonder about the reasons for this decision and would like to explore this solution in detail.
Q: Why do you think DynamoDB would be a good option to keep users’ data? A: Because DynamoDB is a highly available and scalable database offering from AWS.
Q: Why are high availability and scalability quality attributes important for our e-commerce web application? A: Because we expect thousands of active users during business hours.
Q: Why is DynamoDB the option to model and query the application data? A: Because it works well for the majority of the access patterns for our application.
Q: Why should we care about exact access patterns to the data? A: Because DynamoDB doesn’t work well for ad-hoc queries to the data, especially if they depend on multiple data joins.
Q: Why ad-hoc queries are a problem for DynamoDB? A: Because it is not flexible on data indexing and relies on data pre-joining and pre-aggregation, which requires careful thinking of future data access patterns.
Q: Why flexibility of data querying is not critical in the future for our domain? A: Flexible data querying is critical for analytics and reporting, so it is critical for us. However, we are developing an event-driven application to meet scalability and availability demands. So, we can use DynamoDB as our scalable, performant, and highly available event store.
You can get two types of concerns from looking into the answers:
- Quality attributes. DynamoDB meets all requirements for the solution qualities: scalability, performance, and high availability.
- Data modelling. Modeling a domain for DynamoDB requires careful thinking of data access patterns for the application. It might be difficult to find a data model or models that will completely cover all application use cases. However, we mitigate it by using DynamoDB as an event store which we can describe by a small set of possible access scenarios.
Now you can visualize the results of the analysis as the Fishbone diagram:
You can find more concerns if you dig deeper into the problem. However, the exploratory analysis we’ve just done provides you with a solid starting point.
Conclusions
In this post, we looked into two separate but highly related techniques for exploratory analysis. Their beauty lies in their simplicity. As the architect, you can apply them both one after another to quickly attack a problem space, get initial results and present them to your stakeholders in a graphical form for better understanding.
You can improve the efficiency of both techniques by getting some prior knowledge about the problem and its context. Development of your skills in critical thinking would also help.
You can use the “5 Whys” approach and Fishbone diagram for communication and facilitation outside and inside the development team. They could be a great tool to facilitate a brainstorming session or discuss the code review results.