Thursday, May 23, 2024
HomeMachine LearningOpen supply observability for AWS Inferentia nodes inside Amazon EKS clusters

Open supply observability for AWS Inferentia nodes inside Amazon EKS clusters

Current developments in machine studying (ML) have led to more and more giant fashions, a few of which require tons of of billions of parameters. Though they’re extra highly effective, coaching and inference on these fashions require vital computational assets. Regardless of the provision of superior distributed coaching libraries, it’s frequent for coaching and inference jobs to want tons of of accelerators (GPUs or purpose-built ML chips resembling AWS Trainium and AWS Inferentia), and due to this fact tens or tons of of cases.

In such distributed environments, observability of each cases and ML chips turns into key to mannequin efficiency fine-tuning and value optimization. Metrics enable groups to grasp workload conduct and optimize useful resource allocation and utilization, diagnose anomalies, and improve total infrastructure effectivity. For information scientists, ML chips utilization and saturation are additionally related for capability planning.

This submit walks you thru the Open Supply Observability sample for AWS Inferentia, which exhibits you methods to monitor the efficiency of ML chips, utilized in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, with information aircraft nodes based mostly on Amazon Elastic Compute Cloud (Amazon EC2) cases of kind Inf1 and Inf2.

The sample is a part of the AWS CDK Observability Accelerator, a set of opinionated modules that can assist you set observability for Amazon EKS clusters. The AWS CDK Observability Accelerator is organized round patterns, that are reusable items for deploying a number of assets. The open supply observability set of patterns devices observability with Amazon Managed Grafana dashboards, an AWS Distro for OpenTelemetry collector to gather metrics, and Amazon Managed Service for Prometheus to retailer them.

Answer overview

The next diagram illustrates the answer structure.

This answer deploys an Amazon EKS cluster with a node group that features Inf1 cases.

The AMI kind of the node group is AL2_x86_64_GPU, which makes use of the Amazon EKS optimized accelerated Amazon Linux AMI. Along with the usual Amazon EKS-optimized AMI configuration, the accelerated AMI contains the NeuronX runtime.

To entry the ML chips from Kubernetes, the sample deploys the AWS Neuron gadget plugin.

Metrics are uncovered to Amazon Managed Service for Prometheus by the neuron-monitor DaemonSet, which deploys a minimal container, with the Neuron instruments put in. Particularly, the neuron-monitor DaemonSet runs the neuron-monitor command piped into the companion script (each instructions are a part of the container):

neuron-monitor | --port <port>

The command makes use of the next elements:

  • neuron-monitor collects metrics and stats from the Neuron purposes working on the system and streams the collected information to stdout in JSON format
  • maps and exposes the telemetry information from JSON format into Prometheus-compatible format

Knowledge is visualized in Amazon Managed Grafana by the corresponding dashboard.

The remainder of the setup to gather and visualize metrics with Amazon Managed Service for Prometheus and Amazon Managed Grafana is just like that utilized in different open supply based mostly patterns, that are included within the AWS Observability Accelerator for CDK GitHub repository.


You want the next to finish the steps on this submit:

Arrange the atmosphere

Full the next steps to arrange your atmosphere:

  1. Open a terminal window and run the next instructions:
export ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output textual content)

  1. Retrieve the workspace IDs of any present Amazon Managed Grafana workspace:
aws grafana list-workspaces

The next is our pattern output:

  "workspaces": [
      "authentication": {
        "providers": [
      "created": "2023-06-07T12:23:56.625000-04:00",
      "description": "accelerator-workspace",
      "endpoint": "",
      "grafanaVersion": "9.4",
      "id": "g-XYZ",
      "modified": "2023-06-07T12:30:09.892000-04:00",
      "title": "accelerator-workspace",
      "notificationDestinations": [
      "standing": "ACTIVE",
      "tags": {}

  1. Assign the values of id and endpoint to the next atmosphere variables:
export COA_AMG_WORKSPACE_ID="<<YOUR-WORKSPACE-ID, just like the above g-XYZ, with out citation marks>>"
export COA_AMG_ENDPOINT_URL="<<https://YOUR-WORKSPACE-URL, together with protocol (i.e. https://), with out citation marks, just like the above>>"

COA_AMG_ENDPOINT_URL wants to incorporate https://.

  1. Create a Grafana API key from the Amazon Managed Grafana workspace:
export AMG_API_KEY=$(aws grafana create-workspace-api-key 
--key-name "grafana-operator-key" 
--key-role "ADMIN" 
--seconds-to-live 432000 
--workspace-id $COA_AMG_WORKSPACE_ID 
--query key 
--output textual content)

  1. Arrange a secret in AWS Techniques Supervisor:
aws ssm put-parameter --name "/cdk-accelerator/grafana-api-key" 
--type "SecureString" 
--value $AMG_API_KEY 
--region $AWS_REGION

The key will likely be accessed by the Exterior Secrets and techniques add-on and made out there as a local Kubernetes secret within the EKS cluster.

Bootstrap the AWS CDK atmosphere

Step one to any AWS CDK deployment is bootstrapping the atmosphere. You employ the cdk bootstrap command within the AWS CDK CLI to organize the atmosphere (a mixture of AWS account and AWS Area) with assets required by AWS CDK to carry out deployments into that atmosphere. AWS CDK bootstrapping is required for every account and Area mixture, so in the event you already bootstrapped AWS CDK in a Area, you don’t must repeat the bootstrapping course of.

cdk bootstrap aws://$ACCOUNT_ID/$AWS_REGION

Deploy the answer

Full the next steps to deploy the answer:

  1. Clone the cdk-aws-observability-accelerator repository and set up the dependency packages. This repository accommodates AWS CDK v2 code written in TypeScript.
git clone
cd cdk-aws-observability-accelerator

The precise settings for Grafana dashboard JSON recordsdata are anticipated to be specified within the AWS CDK context. You might want to replace context within the cdk.json file, situated within the present listing. The placement of the dashboard is specified by the fluxRepository.values.GRAFANA_NEURON_DASH_URL parameter, and neuronNodeGroup is used to set the occasion kind, quantity, and Amazon Elastic Block Retailer (Amazon EBS) dimension used for the nodes.

  1. Enter the next snippet into cdk.json, changing context:
"context": {
    "fluxRepository": {
      "title": "grafana-dashboards",
      "namespace": "grafana-operator",
      "repository": {
        "repoUrl": "",
        "title": "grafana-dashboards",
        "targetRevision": "important",
        "path": "./artifacts/grafana-operator-manifests/eks/infrastructure"
      "values": {
        "GRAFANA_NODES_DASH_URL" : "",
      "kustomizations": [
          "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/infrastructure"
          "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/neuron"
     "neuronNodeGroup": {
      "instanceClass": "inf1",
      "instanceSize": "2xlarge",
      "desiredSize": 1, 
      "minSize": 1, 
      "maxSize": 3,
      "ebsSize": 512

You’ll be able to change the Inf1 occasion kind with Inf2 and alter the dimensions as wanted. To verify availability in your chosen Area, run the next command (amend Values as you see match):

aws ec2 describe-instance-type-offerings 
--filters Title=instance-type,Values="inf1*" 
--query "InstanceTypeOfferings[].InstanceType" 
--region $AWS_REGION

  1. Set up the undertaking dependencies:
  1. Run the next instructions to deploy the open supply observability sample:
make construct
make sample single-new-eks-inferentia-opensource-observability deploy

Validate the answer

Full the next steps to validate the answer:

  1. Run the update-kubeconfig command. It’s best to be capable of get the command from the output message of the earlier command:
aws eks update-kubeconfig --name single-new-eks-inferentia-opensource... --region <your area> --role-arn arn:aws:iam::xxxxxxxxx:position/single-new-eks-....

  1. Confirm the assets you created:

The next screenshot exhibits our pattern output.

  1. Make sure that the neuron-device-plugin-daemonset DaemonSet is working:
kubectl get ds neuron-device-plugin-daemonset --namespace kube-system

The next is our anticipated output:

neuron-device-plugin-daemonset   1         1         1       1            1           <none>          2h

  1. Affirm that the neuron-monitor DaemonSet is working:
kubectl get ds neuron-monitor --namespace kube-system

The next is our anticipated output:

neuron-monitor   1         1         1       1            1           <none>          2h

  1. To confirm that the Neuron gadgets and cores are seen, run the neuron-ls and neuron-top instructions from, for instance, your neuron-monitor pod (you may get the pod’s title from the output of kubectl get pods -A):
kubectl exec -it {your neuron-monitor pod} -n kube-system -- /bin/bash -c "neuron-ls"

The next screenshot exhibits our anticipated output.

kubectl exec -it {your neuron-monitor pod} -n kube-system -- /bin/bash -c "neuron-top"

The next screenshot exhibits our anticipated output.

Visualize information utilizing the Grafana Neuron dashboard

Log in to your Amazon Managed Grafana workspace and navigate to the Dashboards panel. It’s best to see a dashboard named Neuron / Monitor.

To see some attention-grabbing metrics on the Grafana dashboard, we apply the next manifest:

curl | kubectl apply -f -

This can be a pattern workload that compiles the torchvision ResNet50 mannequin and runs repetitive inference in a loop to generate telemetry information.

To confirm the pod was efficiently deployed, run the next code:

It’s best to see a pod named pytorch-inference-resnet50.

After a couple of minutes, wanting into the Neuron / Monitor dashboard, it’s best to see the gathered metrics just like the next screenshots.

Grafana Operator and Flux all the time work collectively to synchronize your dashboards with Git. In the event you delete your dashboards by chance, they are going to be re-provisioned robotically.

Clear up

You’ll be able to delete the entire AWS CDK stack with the next command:

make sample single-new-eks-inferentia-opensource-observability destroy


On this submit, we confirmed you methods to introduce observability, with open supply tooling, into an EKS cluster that includes an information aircraft working EC2 Inf1 cases. We began by deciding on the Amazon EKS-optimized accelerated AMI for the info aircraft nodes, which incorporates the Neuron container runtime, offering entry to AWS Inferentia and Trainium Neuron gadgets. Then, to reveal the Neuron cores and gadgets to Kubernetes, we deployed the Neuron gadget plugin. The precise assortment and mapping of telemetry information into Prometheus-compatible format was achieved through neuron-monitor and Metrics had been sourced from Amazon Managed Service for Prometheus and displayed on the Neuron dashboard of Amazon Managed Grafana.

We advocate that you simply discover further observability patterns within the AWS Observability Accelerator for CDK GitHub repo. To be taught extra about Neuron, discuss with the AWS Neuron Documentation.

In regards to the Creator

Riccardo Freschi is a Sr. Options Architect at AWS, specializing in utility modernization. He works carefully with companions and prospects to assist them rework their IT landscapes of their journey to the AWS Cloud by refactoring present purposes and constructing new ones.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments