Life of an SRE at Instana – Handling AWS EC2 Instance Retirements

Blog illustrations Instana

This is the second post in a series on the Life of an SRE at Instana. Check out the first post here.

Handling AWS EC2 Instana Retirements

Most people running a production system on AWS have received an email from AWS before informing them about an “Instance Retirement”. The emails typically have the following format:

"AWS Amazon EC2 Instance Retirement [AWS Account ID: XXXXXXXXXXXX]"

Hello,

EC2 has detected degradation of the underlying hardware hosting your 
Amazon EC2 instance (instance-ID: i-08c21960e74b7a276) associated 
with your AWS account (AWS Account ID: XXXXXXXXXXXX) in the eu-west-1 region. 
Due to this degradation your instance could already be unreachable. 
We will stop your instance after 2020-03-06.

...

At Instana we run about 1000 EC2 instances which results in 1-2 retirement emails per week across multiple AWS regions. There can be many reason for such incidents. 

  • cloud hardware failures
  • network component failures
  • software upgrades

As an SRE in charge of a SaaS platform we typically have a few questions that we want to have answered right away when receiving these emails:

  1. Show impacted host (it is hard to remember the AWS instance IDs in your head, i.e. i-08c21960e74b7a276).
  2. Was the host already rebooted?
    1. Depending on the answer we need to check various metrics to see the impact or schedule a manual reboot before the deadline specified in the email
  3. What impact does this EC2 instance retirement have on my system?

These are all questions that Instana can help to answer and much more. So here is a typical flow what we would do once we receive an instance retirement email from AWS.

Step 1: Show impacted host

We take the instance ID and copy it to the search bar on the infrastructure map. Within a few seconds the EC2 instance will appear on the map and you can start with a drill down to all relevant information.

Infrastructure Map

Step 2: Was the host already rebooted?

Looking at the host dashboard we can easily see if the instance was already rebooted (uptime) and what the current CPU Load is. Once an instance gets rebooted, the Instana Agent will start monitoring the host again and detect all running processes. This allows us to quickly see if all processes are up and running.

Host Dashboard

By selecting the timeframe you are interested in you can go through the product and check other charts and metrics.

CPU usage

Step 3: What impact does this EC2 instance retirement have on my system?

In our case, the instance retirement was for a Cassandra node. As a first step we check if the Cassandra node is running fine, if latencies and compactions are in good shape and if the Cassandra cluster was negatively impacted. 

Cassandra Node Dashboard

Summary

After a few checks in Instana, we are confident that the reboot has not caused any outages. Therefore, we can archive the email and continue with our daily work. Instana’s 1 second metrics resolution and auto-detection of components makes it easy to get answers for the most pressing questions when hosts get rebooted without prior notice. 

Play with Instana’s APM Observability Sandbox

Start your FREE TRIAL today!

Instana, an IBM company, provides an Enterprise Observability Platform with automated application monitoring capabilities to businesses operating complex, modern, cloud-native applications no matter where they reside – on-premises or in public and private clouds, including mobile devices or IBM Z.

Control hybrid modern applications with Instana’s AI-powered discovery of deep contextual dependencies inside hybrid applications. Instana also gives visibility into development pipelines to help enable closed-loop DevOps automation.

This provides actionable feedback needed for clients as they to optimize application performance, enable innovation and mitigate risk, helping Dev+Ops add value and efficiency to software delivery pipelines while meeting their service and business level objectives.

For further information, please visit instana.com.