The Challenges That Come with Monitoring for Distributed Systems
The monitoring of a system starts simple enough. Let’s say you’re running a simple web application that’s attached to a server. You can monitor it easily based on a checklist of three questions:
Is the web server up?
Is the database up?
Can the two communicate?
As long as the answers to all those questions are yes, then your application is running just fine. However, when even one of the answers is no, then your system has failed, and you need to look into it and resolve the matter.
It gets a little complicated as a system grows, however. In fact, the ultimate complexity today comes with the distributed system. To put it simply, a distributed system is one where there are many different processes constituting the application, and they are running on different machines. Some of them can run on-site while others run in the cloud. Some of them can run on actual machines while others only run on virtual machines. For some of these processes, the specific machine they run on doesn’t even matter, such as when you’re dealing with a platform-as-a-service. The process of monitoring for distributed systems is focused on making sure all these processes are not just running but running well.
When your system is complicated enough that it can be distributed over multiple processes, the network of communication pathways grows in complexity on an exponential scale. If you’re dealing with five processes, for example, any two of them may need a reliable channel of communication. That means there should be 10 point-to-point communication connections. Any of these could potentially fail. When you first deploy the system, of course, some of these connections may be more important than others. However, in time, as the system grows, the connections that get used more often will likely change as well.
As it turns out, complexity isn’t the only challenge you will deal with when you’re monitoring a distributed system. There are other patterns and techniques common to distributed systems that you should consider when you think about monitoring your own system.
When You’re Not Sure What a Failure Looks Like
When you design a distributed system, you want to build it in such a way that it can tolerate failure. Usually, there is some system that queues processes so that they can easily communicate with each other without all the processes having to be available and online at the same time. It makes your system stable. However, it may also make it harder for you to detect some problems.
If your server needs to apply some operating system updates and restarts, the rest of your system is just fine. The server will do its thing and then come back online in a while. If any of your processes aren’t running at the moment, it doesn’t necessarily mean it has failed. It may be conducting maintenance. However, it could potentially also mean there is a problem you need to think about.
If you’re not monitoring your system, especially if you don’t have appropriate distributed system monitoring software, a process could go offline longer than it should, and no one will notice. Your website could be online, and taking your customers’ orders while your backend isn’t charging their credit cards and debit cards. It might be a while before you realize you haven’t been receiving any money.
Sometimes, these problems happen at an even more granular level than the process level, such as when the messages themselves are failing, like when there is a malformed message or programming error.
Large Distributed Systems
The way modern distributed systems are built, they should scare easily. With a surge in virtualization and cloud infrastructure, spinning new processes is becoming a lot easier than commissioning expensive servers for a single system. For you to know which part of your system to scale, when to scale it and how much to scale it, you need to collect and analyze a large amount of data.
Once you achieve the scale you want, the amount of data you collect and analyze also grows larger. Now, you have to collect data from many more servers, but you also have to get this data together so that your analysis treats your system as a single whole, rather than a collection of parts.
The key to monitoring a distributed system is partly about being able to collect large amounts of data but it is also about knowing how to query that data, and then interpret it to get the right solutions.
With all that complexity, you don’t want to do it alone. It is better to get a distributed system monitoring service to do the heavy lifting for you. That way, you can focus on what matters, which is making the system work for your customers.