Failed to execute command group with error API queried with a bad parameter: {"message":"unknown or invalid runtime name: nvidia"} for my azureml compute instance type Standard_NC16as_T4_v3

Damarla, Lokesh 0 Reputation points
2024-11-05T10:29:13.46+00:00

I am creating custom azureml pipeline using python sdk this is my code

from azureml.core import Environment, Workspace, ComputeTarget, Experiment
from azureml.pipeline.core import Pipeline, StepSequence
from azureml.pipeline.steps import PythonScriptStep
from azureml.core.runconfig import RunConfiguration
env = Environment("stage-env")
env.docker.base_image = "<acr_name>.azurecr.io/sw-predc-stage-env:0.2"
env.python.user_managed_dependencies = True
env.docker.base_image_registry.address = "<acr_name>.azurecr.io"
env.docker.base_image_registry.username = "<registry_name>"
env.docker.base_image_registry.password = "<Password>"
# Set up the workspace
ws = Workspace.from_config()
# Define the compute target
compute_target = ComputeTarget(workspace=ws, name="gpu-compute1")
# Set up RunConfiguration with the new environment
run_config = RunConfiguration()
run_config.environment = env
run_config.target = compute_target
# Define the PythonScriptStep for Apprhs Prophet Model Training
cayuga_prophet_step = PythonScriptStep(
name="Cayuga Prophet Model Training",
script_name="Cayuga_prophet_Model.py",
compute_target=compute_target,
runconfig=run_config,
source_directory="/mnt/batch/tasks/shared/LS_root/mounts/clusters/gpu-compute1/",
allow_reuse=True
)
# Define the PythonScriptStep for Apprhs RandomForest Model Training
cayuga_rforest_step = PythonScriptStep(
name="Cayuga RForest Model Training",
script_name="Cayuga_randomforest_Model.py",
compute_target=compute_target,
runconfig=run_config,
source_directory="/mnt/batch/tasks/shared/LS_root/mounts/clusters/gpu-compute1/",
allow_reuse=True
)
# Define the PythonScriptStep for Apprhs RandomForest Model Training
cayuga_deepar_step = PythonScriptStep(
name="Cayuga DeepAr Model Training",
script_name="Cayuga_DeepAR_Model.py",
compute_target=compute_target,
runconfig=run_config,
source_directory="/mnt/batch/tasks/shared/LS_root/mounts/clusters/gpu-compute1/",
allow_reuse=True
)
# Define the PythonScriptStep for Model Selection
cayuga_by_hospital_step = PythonScriptStep(
name="Cayuga By Hospital",
script_name="Cayuga_By_Hospital.py",
compute_target=compute_target,
runconfig=run_config,
source_directory="/mnt/batch/tasks/shared/LS_root/mounts/clusters/gpu-compute1/",
allow_reuse=True
)
# Define the PythonScriptStep for Model Selection
cayuga_model_selection_step = PythonScriptStep(
name="Model Selection and Moving Inference Files",
script_name="model_selection_accuracy_comparision.py",
compute_target=compute_target,
runconfig=run_config,
source_directory="/mnt/batch/tasks/shared/LS_root/mounts/clusters/gpu-compute1/",
allow_reuse=True
)
# Make the steps run sequentially
step_sequence = StepSequence(steps=[cayuga_prophet_step, cayuga_rforest_step, cayuga_deepar_step, cayuga_by_hospital_step, cayuga_model_selection_step])
# Define the pipeline
pipeline = Pipeline(workspace=ws, steps=step_sequence)
# Submit the pipeline
experiment = Experiment(workspace=ws, name="Cayuga-Models-Training-and-Selection-Pipeline")
pipeline_run = experiment.submit(pipeline, tags={"pipeline_name": "Cayuga Unified Model Training Pipeline"})
pipeline_run.wait_for_completion(show_output=True)

My Dockerfile looks like below

# Base image with Python 3.10 and pre-installed Miniconda
FROM mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04
# Set up working directory
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    git \
    && rm -rf /var/lib/apt/lists/*
# Add conda to PATH (optional if not already in PATH)
ENV PATH /opt/miniconda/bin:$PATH
# Initialize conda
RUN conda init bash
# Copy the environment.yml file into the container
COPY pdc_dev_env.yml /app/environment.yml
# Create the conda environment based on environment.yml
RUN conda env create -f /app/environment.yml
# Set the default shell to bash to allow 'conda activate' to work
SHELL ["/bin/bash", "--login", "-c"]
# Activate the environment by default
RUN echo "conda activate pdc_dev_env" >> ~/.bashrc
# Ensure the conda environment stays activated
ENV PATH /opt/miniconda/envs/pdc_dev_env/bin:$PATH
# Copy the project files into the working directory
COPY . /app
# Set the entry point (replace 'your_script.py' with the actual script)
ENTRYPOINT ["conda", "run", "-n", "pdc_dev_env"]

and my yaml file

name: pdc_dev_env
channels:
  - conda-forge
dependencies:
  - python=3.10
  - numpy
  - pip
  - scikit-learn
  - scipy
  - pandas
  - pip:
    - azureml-core
    - plotly
    - kaleido
    - azure-ai-ml
    - azureml
    - inference-schema[numpy-support]==1.3.0
    - mlflow==2.8.0
    - mlflow-skinny==2.8.0
    - azureml-mlflow==1.51.0
    - psutil>=5.8,<5.9
    - tqdm>=4.60
    - ipykernel~=6.0
    - matplotlib
    - prophet
    - azure-storage-blob
    - darts==0.30.0

so i tried with CPU compute_instance with Dockerfile as

# Base image with Python 3.10 and pre-installed Miniconda
FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
# Set up working directory
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    git \
    && rm -rf /var/lib/apt/lists/*
# Add conda to PATH (optional if not already in PATH)
ENV PATH /opt/miniconda/bin:$PATH
# Initialize conda
RUN conda init bash
# Copy the environment.yml file into the container
COPY pdc_dev_env.yml /app/environment.yml
# Create the conda environment based on environment.yml
RUN conda env create -f /app/environment.yml
# Set the default shell to bash to allow 'conda activate' to work
SHELL ["/bin/bash", "--login", "-c"]
# Activate the environment by default
RUN echo "conda activate pdc_dev_env" >> ~/.bashrc
# Ensure the conda environment stays activated
ENV PATH /opt/miniconda/envs/pdc_dev_env/bin:$PATH
# Copy the project files into the working directory
COPY . /app
# Set the entry point (replace 'your_script.py' with the actual script)
ENTRYPOINT ["conda", "run", "-n", "pdc_dev_env"]

with the same YAML file and the same pipeline code, it ran successfully but when I used gpu and gpu based Dockerfile and the image and pipeline failed. I am getting the below error

Failed to execute command group with error API queried with a bad parameter: {"message":"unknown or invalid runtime name: nvidia"}

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,976 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Damarla, Lokesh 0 Reputation points
    2024-11-13T09:00:12.2666667+00:00
    1. created a GPU-based Docker image
    2. Check if Nvidia runtime is available in our GPU instance   
      1. nvidia-smi
    3. I got the output below
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Tesla T4                       On  | 00000001:00:00.0 Off |                  Off |
    | N/A   32C    P8               9W /  70W |      2MiB / 16384MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+ | Processes:                                                                            | |  GPU   GI   CI        PID   Type   Process name                            GPU Memory | |        ID   ID                                                             Usage      | |=======================================================================================| |  No running processes found                                                           | +---------------------------------------------------------------------------------------+
    
    
    1. It seems that nvidia-smi is working, which suggests that the GPU driver is properly installed.
    2. nvidia container toolkit is not installed 
    3. so installed it sudo apt-get install -y nvidia-container-toolkit
    4. Then restart the docker 
    5. Now the pipeline is running
    0 comments No comments

  2. santoshkc 9,715 Reputation points Microsoft Vendor
    2024-11-13T09:35:58.26+00:00

    Hi @Damarla, Lokesh,

    I'm glad to hear that your issue has been resolved. And thanks for sharing the information, which might be beneficial to other community members reading this thread as solution. Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", so I'll reiterate your response to an answer in case you'd like to accept the answer. This will help other users who may have a similar query find the solution more easily.

    Query: Failed to execute command group with error API queried with a bad parameter: {"message":"unknown or invalid runtime name: nvidia"} for my azureml compute instance type Standard_NC16as_T4_v3

    Solution:

    1. created a GPU-based Docker image
    2. Check if Nvidia runtime is available in our GPU instance   
      1. nvidia-smi
    3. I got the output below
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Tesla T4                       On  | 00000001:00:00.0 Off |                  Off |
    | N/A   32C    P8               9W /  70W |      2MiB / 16384MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+ | Processes:                                                                            | |  GPU   GI   CI        PID   Type   Process name                            GPU Memory | |        ID   ID                                                             Usage      | |=======================================================================================| |  No running processes found                                                           | +---------------------------------------------------------------------------------------+
    
    
    1. It seems that nvidia-smi is working, which suggests that the GPU driver is properly installed.
    2. nvidia container toolkit is not installed 
    3. so installed it sudo apt-get install -y nvidia-container-toolkit
    4. Then restart the docker 
    5. Now the pipeline is running

    If you have any further questions or concerns, please don't hesitate to ask. We're always here to help.


    Do click Accept Answer and Yes for was this answer helpful.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.