Submitted
By Boštjan Kaluža
A report by W. Cappelli, published by Gartner in October 2015, states that “although availability and performance data volumes have increased over the last ten years, enterprises find data in their possession insufficiently actionable. Root causes of performance problems have taken an average of seven days to diagnose, compared to eight days in 2005. Furthermore, only three percent of incidents were predicted, compared to two percent in 2005.”
One of the main reasons root cause analysis in IT Operations (ITO) has not progressed in ten years is that IT still operates in silos. Different components of the infrastructure are monitored with different tools such as APM, logs, deployments, and change requests. The data/information is stored in physically separated units and in separate information schema at the application layer.
Bridging the Gap Between Silos
To effectively process information we need to combine data sources and answer the questions: What is the relationship between data? How they are correlated? How do they affect each other? What are the interesting patterns and insights? How do you identify root causes?
Combining data into a common information schema is an engineering task, whereas understanding data and extracting actionable insights is an analytical task.
The Flaw of the Existing Correlation Approach
In the past, a common correlation technology referred to as an event correlation engine, handled event filtering, aggregation, and masking. Another technique uses statistical analysis and signal processing to compare different time series to detect correlated activities. Recently, new machine learning algorithms based on clustering analysis and self-organizing maps can apply smart filtering able to identify incident-related event storms.
While these techniques are useful and reduce the number of events, they do not identify problem root causes. However, understanding which events are correlated does not identify the source of the issue. To proceed, we need to understand data source cause-effect relationships.
There is context in data! Why don’t we use it?
IT data silos are based on the typical application lifecycle management process. For instance, to deploy a new service, a new change request is opened and executed via an automated deployment script. Once the application is up and running, performance and availability is monitored with logs, network activities, and key APM metrics.
These silos can generally be organized as follows:
1. The IT Context of introducing changes to a system.
2. Changes which cause a system to work differently.
3. Monitoring change impact and reporting symptoms for analysis.
The change in the system is the link between data sources in IT context and symptoms. It serves as the correlation anchor that helps identify the cause and impact of a change.
Gartner estimates that approximately 85 percent of performance incidents can be traced back to recent environmental changes.
This helps establish a direct cause-effect relationship between IT Context and observed symptoms. But how do you automatically establish these relationships?
Machine Learning can Help
Machine learning studies the design of algorithms which learn through data observation. It has been traditionally used to discover new data insights or develop systems to automatically adapt and customize themselves or reduce complexity and expense—for example, search engines, and self-driving cars.
ITO domain is a good fit for machine learning due to the large amounts of data available for analysis. Given the growth of machine learning theory, algorithms, and on demand computational resources, it is understandable that more machine learning applications are being developed in ITO analtyics.
Causal Analysis
Effective root cause analysis depends on establishing relationships between data sources. Correlating events, tickets, alerts, and changes can identify cause-effect relationships. However, when dealing with unstructured data, the linking process is not obvious. Machine learning infers relationships among different data sources and determines how to link them to environments. Algorithms include fuzzy matching rules, association rules identifying events which frequently occur at the same time, linguistic analysis of data in natural language, and prediction models estimating system change effects. This process yields a set of data samples semantically annotated across silos.
The final step is to establish an environmental dependency model leveraging topology and component/configuration dependencies to produce causal reasoning for effective root cause analysis. Suppressing unrelated elements from analysis is critical. The dependency diagram can be modeled after the probabilistic Bayesian network, which utilizes probabilities of error propagation, influence, and spillover detection. Using machine learning and vast amounts of data, we can automatically estimate the required probabilities for root cause and update these on the fly.
Conclusion
Leveraging context in cross-silo data establishes cause-effect relationships automatically via machine learning. Root cause analysis gains new perspectives by accessing data previously stored in different silos as well as semantically annotated event relations. This analysis significantly limits the “short list” of possible root causes using probabilistic matching, fuzzy logic, linguistic correlation, and frequent pattern mining. This process provides insight into probable root causes factoring in the environmental dependency structure and previous historical incidents. SW
Boštjan Kaluža, Phd, is the chief data scientist at Evolven Software. He has published numerous articles in professional journals and delivered conference papers and is author of Instant Weka How-to, and working on his 2nd book Machine Learning in Java. Boštjan is also the author and contributor to numerous patents in the areas of anomaly detection and pattern recognition.
May2016, Software Magazine