Five steps every developer team should take to combat cloud service outages

As an industry, software development teams continue to adopt cloud-based tool chains. This trend makes a lot of sense for companies trying to boost productivity, efficiency, and speed of development in the age of hybrid and asynchronous working. But as we saw with the recent Jira outage, relying on a cloud-based technology stack creates risks. I’m not pointing fingers here. My own company offers a cloud-based productivity platform, and we, like all other cloud providers, have experienced outages. These events are unavoidable, as we become increasingly reliant on the cloud-based software model to run our business, it’s critical that teams understand the steps to take to deal with outages when they occur.
Not all failures are created equal. Jira’s was high in terms of severity but low in terms of customers impacted. The reverse might be true for the next one you might encounter. This is why it is essential to consider the possibility of failures when selecting your software vendors. There are several important considerations to keep in mind. We’ve narrowed it down to three different primary considerations.
Prepare for the inevitable
If you’re using a cloud-based solution, you know an outage is imminent, but there’s no telling when, so make a plan. Internally, this means setting up a single point of contact – an incident manager – who helps coordinate activity during the event, documents important information, etc. Getting buy-in from all stakeholders in your organization is essential when an outage occurs, so everyone agrees on the next steps to resolve the issue as quickly as possible.
Have a workaround (if possible)
Having a viable alternative available in the event of a failure is nice, but obviously not always possible, but striving to provide some level of productivity will help, at the very least, to mitigate some of the progress lost in the event of a failure. Speaking from personal experience, my team has had to deal with GitHub outages several times. Knowing this will happen, we are working to provide a workaround to allow our team to do something in the meantime. Before that happens, you should ask yourself if there is a self-hosted possibility to get the benefits of the cloud without relying on infrastructure.
Choose a cloud-based provider that communicates status updates clearly and regularly
Due to the nature of cloud-based software, it would probably be impossible to choose a company that will never experience an outage. However, you can examine how companies have handled outages in the past, the reliability of their software, and their typical response time. The SaaS industry is small, so be sure to ask your network about their experience with different companies and how they handle outages. Go for organizations that quickly document an outage, provide regular and transparent updates, and take these service interruptions seriously.
Communicate status updates clearly and regularly to internal stakeholders
In addition to your own team, internal stakeholders and upstream managers also need to understand what is happening with the outage. They shouldn’t have to ask your team if there’s a problem when something isn’t working as it should. They may be the first to know, but more often than not, the organization experiencing the outage should communicate first about what is happening. There should be a single source of truth that delivers all of your official communications about the event. It’s OK if it’s multiple channels, but it should come from one source to ensure consistency and accuracy of information.
Write down what you would do differently
Dealing with an outage that negatively impacts your team’s productivity can be frustrating. Especially if all you can do is wait for it to be fixed. However, these outages provide a great opportunity to think about what your business would do in the event of your own outage. As we mentioned earlier, outages are a danger to doing business in the SaaS industry, and we can learn a lot from how our peers handle these situations. Whether it’s good or bad, take notes on how you felt as a customer navigating the situation and embrace it when your product suffers its own breakdown.
Good luck!
Hopefully these points will make you and your team more resilient to the upcoming outage. Although some of them may seem obvious, I’ve always found it helpful to make the implicit advice explicit, especially since it’s helpful to have specific steps to take in the face of chaos. It reduces confusion, calms nerves, and paves the way for productivity.