Skip to main content
Version: v2.0-v2.4

Cluster Metrics

Available as of v2.2.0

Cluster metrics display the hardware utilization for all nodes in your cluster, regardless of its role. They give you a global monitoring insight into the cluster.

Some of the biggest metrics to look out for:

  • CPU Utilization

    High load either indicates that your cluster is running efficiently or that you're running out of CPU resources.

  • Disk Utilization

    Be on the lookout for increased read and write rates on nodes nearing their disk capacity. This advice is especially true for etcd nodes, as running out of storage on an etcd node leads to cluster failure.

  • Memory Utilization

    Deltas in memory utilization usually indicate a memory leak.

  • Load Average

    Generally, you want your load average to match your number of logical CPUs for the cluster. For example, if your cluster has 8 logical CPUs, the ideal load average would be 8 as well. If you load average is well under the number of logical CPUs for the cluster, you may want to reduce cluster resources. On the other hand, if your average is over 8, your cluster may need more resources.

Finding Node Metrics

  1. From the Global view, navigate to the cluster that you want to view metrics.

  2. Select Nodes in the navigation bar.

  3. Select a specific node and click on its name.

  4. Click on Node Metrics.

Get expressions for Cluster Metrics

Etcd Metrics

Note: Only supported for Rancher launched Kubernetes clusters.

Etcd metrics display the operations of the etcd database on each of your cluster nodes. After establishing a baseline of normal etcd operational metrics, observe them for abnormal deltas between metric refreshes, which indicate potential issues with etcd. Always address etcd issues immediately!

You should also pay attention to the text at the top of the etcd metrics, which displays leader election statistics. This text indicates if etcd currently has a leader, which is the etcd instance that coordinates the other etcd instances in your cluster. A large increase in leader changes implies etcd is unstable. If you notice a change in leader election statistics, you should investigate them for issues.

Some of the biggest metrics to look out for:

  • Etcd has a leader

    etcd is usually deployed on multiple nodes and elects a leader to coordinate its operations. If etcd does not have a leader, its operations are not being coordinated.

  • Number of leader changes

    If this statistic suddenly grows, it usually indicates network communication issues that constantly force the cluster to elect a new leader.

Get expressions for Etcd Metrics

Kubernetes Components Metrics

Kubernetes components metrics display data about the cluster's individual Kubernetes components. Primarily, it displays information about connections and latency for each component: the API server, controller manager, scheduler, and ingress controller.

Note: The metrics for the controller manager, scheduler and ingress controller are only supported for Rancher launched Kubernetes clusters.

When analyzing Kubernetes component metrics, don't be concerned about any single standalone metric in the charts and graphs that display. Rather, you should establish a baseline for metrics considered normal following a period of observation, e.g. the range of values that your components usually operate within and are considered normal. After you establish this baseline, be on the lookout for large deltas in the charts and graphs, as these big changes usually indicate a problem that you need to investigate.

Some of the more important component metrics to monitor are:

  • API Server Request Latency

    Increasing API response times indicate there's a generalized problem that requires investigation.

  • API Server Request Rate

    Rising API request rates usually coincide with increased API response times. Increased request rates also indicate a generalized problem requiring investigation.

  • Scheduler Preemption Attempts

    If you see a spike in scheduler preemptions, it's an indication that you're running out of hardware resources, as Kubernetes is recognizing it doesn't have enough resources to run all your pods and is prioritizing the more important ones.

  • Scheduling Failed Pods

    Failed pods can have a variety of causes, such as unbound persistent volume claims, exhausted hardware resources, non-responsive nodes, etc.

  • Ingress Controller Request Process Time

    How fast ingress is routing connections to your cluster services.

Get expressions for Kubernetes Component Metrics

Rancher Logging Metrics

Although the Dashboard for a cluster primarily displays data sourced from Prometheus, it also displays information for cluster logging, provided that you have configured Rancher to use a logging service.

Get expressions for Rancher Logging Metrics

Finding Workload Metrics

Workload metrics display the hardware utilization for a Kubernetes workload. You can also view metrics for deployments, stateful sets and so on.

  1. From the Global view, navigate to the project that you want to view workload metrics.

  2. From the main navigation bar, choose Resources > Workloads. In versions before v2.3.0, choose Workloads on the main navigation bar.

  3. Select a specific workload and click on its name.

  4. In the Pods section, select a specific pod and click on its name.

    • View the Pod Metrics: Click on Pod Metrics.
    • View the Container Metrics: In the Containers section, select a specific container and click on its name. Click on Container Metrics.

Get expressions for Workload Metrics