How an OpEx mindset can give you a better night’s sleep

7 min readNov 3, 2022

You are a talented software developer who is on-call for the night.. It’s past midnight, and you are suddenly notified that your system just reached a fatal error and hasn’t recovered. “How did we get here,” you think to yourself. “What have I done wrong?” You finally fix the issue, and in the morning your manager asks you what went wrong, but your not quite sure yourself. What steps can you take to stop the middle-of-the-night wake-up calls?

My name is Shay, and I’m a software engineer with over 10 years of experience, and I’m a big champion of operational excellence (or OpEx). OpEx can help you sleep at night, and when you do have to wake up in the middle of the night, at least you’re not clueless.

From personal experience, I’ve gathered some useful OpEx tips for improving system stability and system resiliency. These can help you resolve many issues and prevent many more.

Alerting and monitoring

I know this is a broad category, so let’s go over the basics and then dive into some aspects of alerting and monitoring that require a little more thought..

Memory and CPU

These basic metrics will help you to understand if your application needs more resources. They are low hanging fruits and include things like checking if your garbage collection is working as expected or if you need to increase pods. Conversely, if you use only 10% CPU and 12% memory on a virtual machine, you can also look at using smaller pod types to save money.

Let’s review this example: We can see we have a system that reaches ±94% of memory usage. It’s clear that this is not sustainable ,and our system is at risk. But what is the pattern, and how can we understand what went wrong? In order to have a better understanding, we need to zoom in to get more insights.

On this print screen, we can see that we have a pattern of sharp turns. This could be caused by a couple of things:

We are getting massive traffic, and the system needs to react.
We have a code issue, for objects not being released, and we need to do some memory dump analysis.

From what I see on this graph, it’s more likely to be a code issue, and I’ll explain why.

The curve between Point A and Point B is too sharp; it’s like hitting a wall. Plus, Point C to Point D memory is not being released, meaning something or someone is locking up the memory.

Why? This is something that can o be better understood with other tools for memory analysis.

Latency and lag metrics

These metrics express symptoms of system behavior, and problems with these metrics can lead to deeper issues. For example, inefficient code combined with major traffic issues can turn milliseconds into seconds; the CPU and memory might be fine, but the lag will increase.

Lag or latency can be found on response time span the duration it took to dequeue messages from message queue or how much time it took an api to handle a request. The duration can be related to many issues, from connectivity issues and the queue to database query optimizations issues and requests to remote servers. Tracking is the easy part; optimizing can be alot tricker.

We can see here that we have two lines that overlap almost all the way. The purple is for messages received, and the blue is for messages sent. We can see that the system is working fine, and we don’t see any major gaps in the Dequeue system since we don’t have any lag here.

We can see trends here for 0 to 30 seconds latency. From here, we need to check if we have timeout issues or if we have infancy code or a bad query.

Understanding errors and exceptions

I consider these metrics to be the most important piece in understanding how your application works and the user experience. Let me explain: If you have a system that is not a human-oriented service, like an API, and you are seeing a lot of client errors (4xx http error code), then something is possibly wrong with your documentation. If you’re getting a lot of time-out errors, then something might be wrong with your query, or perhaps your database reached a tipping point where it cannot provide answers quickly enough. The errors will only increase over time, so don’t ignore them.

The other aspect of understanding errors is knowing where they come from. For example, let’s say you get a SQL error indicating that the value that was sent is forbidden for your query. This implies that your validation and sanitation needs to be refined.

This print screen shows error distributions. The errors are there, they just need some attention and understanding, and they can shed a new light on things.

We can also see that we have one error that holds 23%, and it’s a 403 error code (Forbidden). So, we need to understand what it is and why.

Backup and recovery

It doesn’t matter whether you are using an on-premise service or a cloud-based one. The expectation is that you’re available 99% of the time. If you use Amazon Web Services, Google Cloud Platform, or Azure, it’s very easy. All of these services have regions where you can move your backend services and pods. Performing drills on how to move the services from one region to another will provide you with some great insights on where you can make improvements.

Relying on third-party services

You have your backups, and you’ve fixed many known issues, but suddenly, one of your critical third-party vendors crashes.You have a couple of options to fix the outage.

Critical flow

You may have a real-time service that provides information to other parts of the organization and relies on a third-party API. If it crashes or is not responding, then you need to ask yourself if you can use another API. If not, can you monitor the API and notify the API owners and downstream users in the case it goes down?This will reduce the error rate and improve customer satisfaction. Plus, these actions can be automated.

Non-critical flow

Some APIs are important to the business, but it’s not critical to get a response from them immediately. In this case, you can create a queue for it and store it in a dead-letter queue. Once the API recovers, you can then send the backlog to the API.

For both flows, I recommend adding metrics and KPIs to measure your customers’ impact. This can open up some new pathways you hadn’t considered, including moving to other third-party vendors.

Documentation and runbooks

So far, we’ve focused on prevention, but systems will eventually fail regardless, and you will be paged at night at some point in your career as a developer. So, now, let’s focus on troubleshooting.

As I see it, it’s important to ensure that someone other than you can handle a malfunction quickly and get the system back up and running in no time. When creating runbooks for other developers at your organization, you need to be operational. Don’t tell a story; just write out instructions step by step, as if you were writing out a recipe.

Your API documentation should define the input for your application, system, or API. The main idea is to help the user send you the correct input and, therefore, reduce the number of errors and issues your application will encounter.

There is a phrase I truly believe: “Eat your own dog food.” This means that you need to write your documentation as clearly as you would if you were the reader! And be sure to add a print screen.

Automation

Many (and I mean many!) actions that are usually done manually can and should be done automatically to reduce human error. It all comes down to confidence in writing the right flow for recovery. Have you included every small detail?

If your answer is yes, then start implementing an automated flow, but do so gradually as you monitor it. Gradual implementation will help you catch errors before they cause a major issue. This is critical because any small error can jeopardize the application, making the recovery flow longer and more complex.

Some final words

I know I’ve shared only a small subset of the OpEx mindset, but I hope it can help your company save money by optimizing your services, using the right database(s), and reducing unnecessary workloads.

When building a system, designing it is far easier than actually maintaining it. Getting up at night due to malfunction is annoying and can be limited, but it takes time and understanding. Once you’ve taken on this OpEx mindset, however, you and your team will have better nights with more sleep and more confidence that you’re using the right tools.

Happy coding!