On monitoring from a (slightly) different point of view
You’ve just launched an incredible service, taking precautions to debug it, but something’s not right. You’ve read all the logs, leveraged Kubernetes log or serverless solution, and you still can’t figure out where it all fell apart or worst why. So, what’s your next move?
As a software engineer with more than 10 years of industry experience, I’ve learned over time that it’s critical to check what’s being monitored, how it’s being monitored, and — most critically — why.
What is being monitored?
It’s not enough to only monitor system infrastructure and business metrics. It’s also important to monitor system resources such as databases, queues, etc. You may monitor a database for resources and connection uses, but do you have a clear picture of the number of connections each part of the service is using and for how long? These factors are mission-critical, as they provide a better understanding of the status of the system, allowing you to more easily pinpoint any issues that occur.
It may also allow you to identify unwanted or unexpected anomalies. Port scanning, for instance, can identify other services using the same cluster, which leads to connection drain and other issues. Currently, almost every resource can integrate with almost every monitoring tool, and the leading ORMs expose APIs that can work with Micrometer, DropWizard, or the application metrics you prefer.
We’ve talked about getting your system up, but getting it up and running is a little more involved. If you’re not measuring system latency and success rates, you can’t know what’s happening under the hood or correlate multiple metrics to get to the right values.
For example, let’s say we have an asynchronous service that exposes a restful API on one service but leverages a message queue between all other services for a persistent solution. The REST API might respond with a 2XX status outcome (OK statuses family), but it will fail on other services if not correlated. By using a key, you will not know what went wrong, and it will impact your customer experience.
Finally, the most important step is to collect and classify the errors you encounter along the way. Some errors might be inconsequential, but others can indicate major flaws in service. The wrong user input, for instance, could indicate one of two major problems:
- There may be a mistake at the user interface level, such as a poorly written web page or API, which can ultimately result in the loss of clients.
- You may be dealing with poor data sanitation and validation, the severity of which depends on when an error occurs and why.
In addition to taking these approaches, it’s important to consider what reflects the system success rate, where the system encounters errors, and what types of errors you’re dealing with.
Monitoring collection techniques and tool belts
There are two types of tools used for monitoring, log collection tools and application metrics tools, and they differ greatly.
The market leaders in app metrics tools are Micrometer and Dropwizard. Both can be used with a Prometheus server or another tool to expose metrics. These metrics reveal “live” data metrics or metrics with low time delay, and they can help catch more real-time issues, such as spikes in error or trends in system usage, performance, and more.
In order to keep these metrics lean, you can add more metadata, such as tags, to create some data slices. These metrics should be viewed continuously to receive alerts on any major flaws. You can do so with a minimal slice-and-dice technique:
example :
http_server_requests_seconds_count{caller="xxxxx",exception="None",method="POST",status="200",uri="/fake,} 12.0http_server_requests_seconds_count{caller="yyyy",exception="None",method="POST",status="200",uri="/fake,} 1.0http_server_requests_seconds_count{caller="zzzzz",exception="FatalHTTPException" ,method="PUT",status="500",uri="/fake,} 10.0http_server_requests_seconds_count{caller="xxxxx",exception="HTTPException" ,method="GET",status="500",uri="/fake,} 2.0
Here, you can see one metric but with several tags. This metric is part of most HTTP frameworks and has five tags, showing that different callers are having different experiences.
Let’s break it down to the following. The URI : “/fake” has three callers:
- the caller zzzzz — invoked 10 times and failed on PUT method on FatalHTTPException
- the caller xxxxx — invoked 12 times, succeeded on POST method, and failed twice on GET method on HTTPException
- the caller yyyy — invoked once on POST method and succeed
The same metrics with different tags can provide insights on user experience, system abuse, and more.
Log collection tools, on the other hand, provide deeper insights regarding service flow to make sense of when clients spend time using a service and why. Since the data flow is not in real-time, it’s impossible to react properly to errors.
Self-reflection
After understanding what you’re monitoring and which tools you have at your disposal, consider what else you can monitor and what benefits additional monitoring will provide. Take your time, write it all down, and share your insights with colleagues. This is the most critical part of the process, as it will provide you with a solid base for any new metrics and features you plan to add. By understanding what you’ve missed, you will be able to avoid the same mistakes in the future.
Monitor wisely, and happy coding!