Building scalable systems has become more accessible over the past decade thanks to immutable infrastructure, containers, and orchestration platforms such as Kubernetes. As the complexity of these applications continues to accelerate the industry will need to embrace a culture of observability in order to keep these systems operating efficiently and effectively. This means developing software with modern frameworks, practices, and tools which allow monitoring systems to gather the data needed for troubleshooting problems and analyzing performance.
This is the final post in our Scaling Microservices series, and if you’ve been following along, you’re keenly aware of the tremendous challenges with managing the performance of distributed microservice-based systems. To understand the motivation behind undertaking the task of building scalable systems, we need only ask one question: “Are my users happy?”
We will unpack this question and look at this problem from the server-side (API) in this post and client-side (browser) in the next post and along the way share our perspectives on what it takes to monitor modern cloud-native applications.
Are my users happy?
In order to answer that question, we must be able to ask if the scalable systems we’re building meet the needs of the business and her customers. The past several posts in this series have outlined many of the best technical approaches to building scalable software but we haven’t conveyed why we do it and how we measure the success of our efforts.
If you’re responsible for the management and scaling of those systems, you already know the reason why systems are decomposed and distributed in the first place:
- Individual components inherit less responsibility and shared internal dependencies making iterative changes and testing easier to execute
- Scalable systems enable the business to serve more customers and features without a degradation in performance
Two distinct problems emerge from these requirements:
- How do we manage the complexity of microservices and the rate of change when the system has become decoupled and distributed?
- How do we ensure new functionality and scale-out/up events are not impacting the customer experience?
Let’s dive into the first problem statement: How do we understand and manage the complexity of microservices when the system has been designed to be decoupled and distributed?
The answer to this question is surprisingly simple – Instrumentation! This concept isn’t new or foreign – it’s been around since the 60’s, and a great example can be illustrated by looking at how NASA managed to put a spacecraft on the moon with a computer that had the equivalent processing power of an Apple II. When scientists were trying to solve the problem of processing telemetry from the hundreds of sensors on a spacecraft they came to discover the problem wasn’t that difficult. Here’s what they came up with:
- The amount of data needed to fly the spacecraft turned out to be relatively manageable – even with the flight computer onboard the Apollo spacecraft. It turns out that in the vacuum of space the environment is consistent enough for high sample rates.
- Telemetry which was deemed non-critical wasn’t processed on board, it was sent back to earth for processing and analysis.
Let’s jump forward 50 years or so into the present and understand how control theory applies today. Control theory tells us that in order to understand a system we must analyze the data gathered from its inputs and outputs. In modern distributed systems that can be achieved with distributed tracing. This is achieved by injecting headers with correlation IDs at run time. With the data gathered from every transaction in your application, this allows advanced monitoring tools, such as Instana, to build an entity-based model of the monitored environment in real-time. To learn more about distributed system tracing be sure to check out our documentation on how distributed tracing works.
How do I know if the users are happy?
The task of instrumenting your applications can range from trivial to astonishingly complex, and it all depends on your choice of technology stack along with your monitoring solution. In an ideal world, you’re running a modern language that has mechanisms that allow dynamic loading of instrumentation libraries. Instana’s AutoTrace functionality allows for our agent to instrument your applications at runtime with several supported languages. Once your applications are properly instrumented and dispatching trace and metrics data we can then get down to organizing and then analyzing the data.
Using the data generated from the inputs and outputs of our system we can not only build a visual representation of the relationships between the individual entities but we can also examine key performance indicators to determine the health and responsiveness of those interactions.
Visual representation of the interactions between dependencies in a production system
Examining latency of interactions on downstream components
In the above examples, we can quickly examine areas where increases in latency or error rates may be causing poor experiences for your customers. The only way to extract this level of detail from your monitoring solution goes back to control theory, we must collect every individual interaction which occurs between the many different services that exist within an application’s lifetime and then aggregate and analyze those interactions to deliver actionable insights.
This actually turns out to be a tremendous amount of data, all of which must be processed, analyzed, and stored. This is a good opportunity to call back to the Apollo mission, where the majority of the data being collected wasn’t necessary for the fly-by-wire systems. Most data was sent back to mission control where it was aggregated and analyzed by much more powerful systems.
One of the more powerful aspects of collecting this data is to analyze the metrics aggregated from these transactions using machine learning and apply change-point detection for anomaly detection. This allows the SRE team to no longer have to specify an alert for every service and its endpoints, which becomes a near-impossible task when microservice environments often include hundreds of services and thousands of endpoints with more typically being added regularly.
Instana processes and delivers interactive analytics dashboards based on non-sampled trace aggregates
Many solutions will sample in order to alleviate the processing and storage burdens brought on by the sheer amount of data collected from distributed tracing, either through deterministic mechanisms such as pre-determined sample values – which requires additional input from the operators or through adaptive mechanisms which analyze the qualities of the payload or obtain feedback from the system which collects the data to determine sample rates.
P99s are lost with solutions which rely on sampling
There is always a trade-off with sampling, and that happens when the resolution of the highly dynamic and volatile environments you’re monitoring is diminished. If we’re eliminating 999 out of 1000 transactions there are details in those transactions which may have been important, such as a sudden spike in latency or fallbacks being triggered. Just because a transaction is non-erroneous doesn’t necessarily mean it’s boring or worthless.
While distributed tracing gives you deep insights into the behavior of your applications, there is tremendous value in collecting metrics and then correlating those entities to the data collected by distributed tracing. Ultimately, metrics generated by your runtimes such as CPU usage, I/O wait, Garbage Collection will play a huge factor in the performance of your applications and can help you understand why performance begins to degrade in a distributed system.
With complex systems comes the requirements that patchwork monitoring and alerting just isn’t enough to help manage and understand the tremendous amount of interactions that occur with just a single user request. This can be addressed by leveraging machine learning with the data collected by every request including metrics such as throughput, latency, and error rate. Instana has implemented AI-based anomaly detection through the aggregation and analysis of metrics at the service layer.
It’s important to measure the impact of a slow request to the user and understand why the transaction was slow in the first place. With both metric and tracing data collected by Instana you can create custom dashboards that visualize both the performance of your application endpoints and the underlying metrics from the backend components. It is also possible to aggregate your error and warning logs to easily visualize how often these events are occurring and to investigate the transactions in which they occurred.
In summary, ask yourself if your monitoring solution can do the following, ideally, you will have all these capabilities for monitoring your production systems in place before day 1 operations.
- Can you easily define an application based on a group of services, endpoints, and their backends and measure the performance as an aggregate?
- Can you quickly find what the throughput (QPM), error rate, and P99/P50/mean latencies are for every one of your services and their endpoints?
- Can you easily jump from viewing service and endpoint aggregates to the underlying infrastructure metrics of those services and endpoints?
- Can you quickly view error and warn logs for a given application service or endpoint and analyze the trace details or infrastructure metrics for those log artifacts?
- Can you depend on a library of over 200 pre-defined health checks for popular languages, frameworks, databases, and message queues?
- Does your solution leverage machine learning for anomaly detection on data collected from metrics and traces?
- Does your solution automatically generate interactive dashboards which surface important outliers and will allow you to drill down, slice and dice, and quickly understand the context of the information being delivered?
Having the right tooling in place to support scalable microservice architectures is a requirement with modern application stacks. Without proper monitoring, it’s impossible to quickly diagnose and understand problems in complex distributed systems. Instana is the only solution capable of answering the problems modern software operators confront when monitoring their systems. Sign up for a free trial today at https://staging.instana.com/trial