Blog

October 5th - 6th Incidence Report

This week GBC server was unavailable for quite some time. We understand how much you rely on it and consider the availability of our service one of the core features we offer.

Over the last ten years we have made considerable progress towards ensuring that you can depend on GBC to be there for you and and your customers/clients, but this week we failed to maintain the level of uptime you rightfully expect.

We are deeply sorry for this, and would like to share with you the events that took place and the steps we’re taking to ensure you're able to access GBC servers and services.

The Event

The Primary server suffered a severe service outage that began Thursday, Oct 05, 2017, and persisted intermittently until Friday, Oct 6, at approximately 2000 hours. Most of the service outage was characterized by the apache web service being unavailable. Clients experienced unavailability of all associated http/tcp services.

Root cause:

After hours of investigations and audit, we discovered that a Mod_Security log file had grown exponentially due to a network flood and bot crawls. and therefore causing resource usage issues. We have cleaned this out and after restarting the service the load has already dropped below 1. We’ve also updated our current configuration for this log file for apache to reasonably manage or purge data from it.

Action steps taken during incident:

We isolated the incident to the apache process and focused our efforts on narrowing down to the root cause. We re-routed most of the services to the backup server to balance the load. bare minimum operation. Staff worked with provider to assist in identifying the problem, which led to the root cause until a solution was reached.

Remediation effort to avoid future similar incidents:

With mod_security providing security for most applications, depending on the network traffic, logs files tend to grow to unimaginable sizes. Apache tries to purge data from secdatadir to process protection rules. Scanning large logs would take the server extra time and resources.

As this is a crucial feature of our services, we opted to leave it in place and adjust to fit our client needs as far as traffic input goes, taking into account DDOS and network flooding.

The new configuration has been implemented on the servers and all services are operational as of Friday 6th 20.03pm.

The following hours were spent confirming that all systems were performing normally, and verifying there was no data loss from this incident. We are grateful that much of the disaster mitigation work put in place by our engineers was successful in guaranteeing that all of your website data, databases, and other critical data remained safe and secure.

Future Work

Complex systems are defined by the interaction of many discrete components working together to achieve an end result. Understanding the dependencies for each component in a complex system is important, but unless these dependencies are rigorously tested it is possible for systems to fail in unique and novel ways. Over the past hours, we have devoted significant time and effort towards understanding the nature of the cascading failure which led to GBC being unavailable .

Not only do we believe it is possible to prevent the events that resulted in a large part of our services being down, we also will take steps to ensure recovery occurs in a fast and reliable manner. We shall also take steps to mitigate the negative impact of these events on our users.


In Conclusion

We realize how important GBC is to the cycle that enable your websites and services/products to succeed. All of us at GBC would like to apologize for the impact of this outage. We will continue to analyze the events leading up to this incident and the steps we took to restore service. This work will guide us as we improve the systems and processes that power GBC servers.