Anatomy of an Outage: Awareness

/ / Uncategorized

Awareness is defined as “knowledge or perception of a situation or fact”. So knowing something happened doesn’t necessarily mean people understand. That’s why its important to assess the entire situation. There are many things that can happen such as human error, application problems, hardware failures, network interruptions, cyber attacks, natural disasters and data corruption.

Why is having a complete view into the incident so important? It provides perspective as you will soon need to get to a root cause analysis. Without understanding the whole situation, at the very least you experience an outage that is already well underway but at worst downstream side consequences could happen.

What’s the Worst that Could Happen?

Case in point, a server bluescreens on a critical application server. Eventually a catastrophic event happens that brings down the entire mission critical application. At a minimum, you find out that your system is going offline when users start to call the help desk. Once that happens your probably many minutes into the event. At worst, you don’t see the downstream consequences of this one event and other systems are effected by this single event causing a much broader outage. The total time could actually take minutes to hours to fully assess what the full extent of the outage is before any steps can be taken to recover the service.

Ideally, you would want automation or application awareness to test for and detect these types of events and initiate a fast resolution. This type of intelligence can speed up getting to the recovery process by removing humans as much as possible from the mix. Billions have been spent on infrastructure monitoring software which can detect issues like this but in most cases, it still takes human intervention to move to the next step in the Anatomy of an Outage.

In an outage situation, having the best awareness of the situation can enable organizations to react more efficiently and perhaps allow for improved process after the event during a retrospective. What comes next? Once you have awareness of a situation, you can start the resolution process. This is the subject of the next eLesson series.

For more information on how Neverfail can add awareness to your continuity strategy, please reach out to the Neverfail Sales Team at sales@neverfail.com or call us direct at US Sales: +1 (888) 988-8647 and
UK Sales: +44 (0870) 777-1500.

Previous Anatomy of an Outage Articles

Leave a Reply

Your email address will not be published.