Effective Strategies for Reducing the Blast Radius in Deployments

Shay Dratler
4 min readFeb 7, 2024

--

nigthcafe (nuclear explosion getting out from computer keyboard in corporate office)

Your company suddenly informed you that you need to do a critical change like moving from a Data Center into a cloud solution , moving between clouds , changing authentication & authorization providers etc…

Meaning big changes that might impact your applications \ services etc …

So what options do you have on the table ? and how we can mitigate such issues.

Say it out load

Before taking any action, it is crucial to communicate with your customers that there is a planned maintenance. This is the most critical step in building credibility with your customers, and failure to do so could result in damaging your reputation. Even if you are confident that there will not be any downtime, neglecting to inform customers of potential risks can be a costly mistake. As we all know, Murphy’s Law dictates that if anything can go wrong, it will, and usually at the worst possible time. Therefore, communicating transparently and proactively with your customers must be a top priority

Deployment strategies

There are two primary deployment strategies when it comes to upgrading systems. The first is an immediate approach, often referred to as “Big Bang.” This strategy involves deploying everything at once, without any incremental changes. The second approach is a gradual deployment, where changes are made incrementally over time. Both strategies are related to the level of urgency in deploying the upgrade. It is important to note that, regardless of the chosen strategy, it is always possible to reduce the impact or “blast radius” of any issues that arise during deployment. Now, let’s examine both strategies more closely.

Going with a Big Bang — Doing it in one shut

While deploying all at once is not the ideal option, sometimes it’s necessary due to time constraints or because it’s critical to follow a specific path. If you find yourself in this situation, there are several steps you can take to reduce the risk involved:

  1. Tests: Make sure you have full test coverage, including all possible scenarios, hard cases, regression testing, and smoke testing. This will limit the blast radius and provide some reliability.
  2. Proactive Alerts: Set up alerts that provide advanced warning of any issues, including both technical and business alerts. For example, if there’s a daily process, create an alert to ensure that it’s running correctly.
  3. Proactive Monitoring: Implement proactive monitoring to quickly respond to any issues that may arise. This will help reduce risks and ensure that any problems are addressed in a timely manner.
  4. Behavior Analytics: Measure your application/system behavior after deployment and pay attention to customer feedback. This will help you identify possible issues or regressions and address them before they become bigger problems.
  5. Listen to Your Customers: If you have informed your customers of maintenance and they experience issues after deployment, reach out to them through support channels such as Slack or Discord. This will allow you to quickly address any unexpected challenges and maintain your reputation and credibility.

However, even with all of these steps in place, it’s important to keep in mind that it may take time for shockwaves to settle after a big deployment. Be prepared for the possibility of days or even weeks of potential issues and be ready to respond quickly and efficiently.

Splitting it into small

Opting for a gradual deployment approach can provide you with more flexibility and reduce the risks involved. While it’s still important to follow all the steps of the “big bang” approach, there are additional techniques you can use to ensure that the deployment process runs smoothly:

  1. Gradual Deployment: Split the changes into several sub-deployments to reduce noise. For instance, if you need to transition to a cloud solution from a data center, move some traffic to the cloud and measure its performance.
  2. Wide Deployment: Warn one region and deploy some changes there. If there are issues, you can still rollback with minimal impact. This approach works best if you have multiple clusters located in different regions.
  3. Blue-Green Deployment: Duplicate the current system, make the necessary alterations in the duplicate environment, and then direct traffic from the old environment to the new one. This approach reduces downtime and allows for quick rollback in the event of any issues.
  4. Canary Release: instead of B\G Deployment you can do it via small subset of users and gradually roll them out to the majority of users. This approach makes it easy to detect any issues and quickly fix them before affecting the entire user base.

However, it’s imperative to remember that even with smaller risks, big changes can have significant impacts and still go wrong. Be prepared for unforeseen challenges and have a contingency plan in place in case things do not go as planned.

It’s a wrap!

After deploying your changes, it is critical to remain vigilant. Stay aware of any customer issues that may arise, whether they are communicated through Slack, Discord, phone calls, or other means. Be sure to actively listen and take note of any concerns, as crucial details may have been overlooked during the deployment process. Additionally, it is essential to check dashboards, alerting systems, and monitoring tools in the following days. Significant changes sometimes require time to show side effects, so it is imperative to be prepared. Ideally, the upgrade was successful on the first attempt, but even if a second attempt was required, remain calm and positive. Wishing you good luck and happy coding!

--

--