Azure Kubernetes Service (AKS) ノードの Node Problem Detector (NPD)

[アーティクル]
08/02/2024

Node Problem Detector (NPD) は、ノード関連の問題を検出して報告するオープンソースの Kubernetes コンポーネントです。クラスター内の各ノードで systemd サービスとして実行され、CPU 使用率、ディスク使用量、ネットワーク接続などのさまざまなメトリックとシステム情報を収集します。問題を検出すると、"イベントまたはノードの状態 (あるいはその両方)" を生成します。 Azure Kubernetes Service (AKS) は、NPD を使用して、Azure クラウドプラットフォーム上で実行されている Kubernetes クラスター内のノードを監視および管理します。 AKS Linux 拡張機能では、既定で NPD が有効になります。

Note

NPD へのアップグレードは、ノードイメージと Kubernetes バージョンのアップグレードプロセスに依存しません。ノードプールが異常な場合 (つまり障害状態)、新しい NPD バージョンはインストールされません。

ノードの状態

ノードの状態では、ノードが使用できない原因となっている永続的な問題が示されます。 AKS では、NPD が生成した次のノードの状態を使用して、ノードの永続的な問題を明らかにします。 NPD は、対応する Kubernetes イベントも出力します。

問題のあるデーモンの種類	NodeCondition	理由
CustomPluginMonitor	FilesystemCorruptionProblem	FilesystemCorruptionDetected
CustomPluginMonitor	KubeletProblem	KubeletIsDown
CustomPluginMonitor	ContainerRuntimeProblem	ContainerRuntimeIsDown
CustomPluginMonitor	VMEventScheduled	VMEventScheduled
CustomPluginMonitor	FrequentUnregisterNetDevice	UnregisterNetDevice
CustomPluginMonitor	FrequentKubeletRestart	FrequentKubeletRestart
CustomPluginMonitor	FrequentContainerdRestart	FrequentContainerdRestart
CustomPluginMonitor	FrequentDockerRestart	FrequentDockerRestart
SystemLogMonitor	KernelDeadlock	DockerHung
SystemLogMonitor	ReadonlyFilesystem	FilesystemIsReadOnly

イベント

NPD は、基になる問題の診断に役立つ関連情報を含むイベントを生成します。

問題のあるデーモンの種類	理由
CustomPluginMonitor	EgressBlocked
CustomPluginMonitor	FilesystemCorruptionDetected
CustomPluginMonitor	KubeletIsDown
CustomPluginMonitor	ContainerRuntimeIsDown
CustomPluginMonitor	FreezeScheduled
CustomPluginMonitor	RebootScheduled
CustomPluginMonitor	RedeployScheduled
CustomPluginMonitor	TerminateScheduled
CustomPluginMonitor	PreemptScheduled
CustomPluginMonitor	DNSProblem
CustomPluginMonitor	PodIPProblem
SystemLogMonitor	OOMKilling
SystemLogMonitor	TaskHung
SystemLogMonitor	UnregisterNetDevice
SystemLogMonitor	KernelOops
SystemLogMonitor	DockerSocketCannotConnect
SystemLogMonitor	KubeletRPCDeadlineExceeded
SystemLogMonitor	KubeletRPCNoSuchContainer
SystemLogMonitor	CNICannotStatFS
SystemLogMonitor	PLEGUnhealthy
SystemLogMonitor	KubeletStart
SystemLogMonitor	DockerStart
SystemLogMonitor	ContainerdStart

場合によっては、ワークロードの中断を最小限に抑えるために、AKS によってノードが自動的に切断およびドレインされます。イベントとアクションの詳細については、「ノードの自動ドレイン」を参照してください。

ノードの状態とイベントを確認する

kubectl describe node コマンドを使用して、ノードの状態とイベントを確認します。

kubectl describe node my-aks-node

出力は、次の要約された出力例のようになります。

...
...

Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  VMEventScheduled              False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoVMEventScheduled              VM has no scheduled event
  FrequentContainerdRestart     False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  FrequentDockerRestart         False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentDockerRestart         docker is functioning properly
  FilesystemCorruptionProblem   False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   FilesystemIsOK                  Filesystem is healthy
  FrequentUnregisterNetDevice   False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  ContainerRuntimeProblem       False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:40 +0000   ContainerRuntimeIsUp            container runtime service is up
  KernelDeadlock                False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   KernelHasNoDeadlock             kernel has no deadlock
  FrequentKubeletRestart        False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  KubeletProblem                False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   KubeletIsUp                     kubelet service is up
  ReadonlyFilesystem            False   Thu, 01 Jun 2023 19:14:25 +0000   Thu, 01 Jun 2023 03:57:41 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  NetworkUnavailable            False   Thu, 01 Jun 2023 03:58:39 +0000   Thu, 01 Jun 2023 03:58:39 +0000   RouteCreated                    RouteController created a route
  MemoryPressure                True    Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 19:16:50 +0000   KubeletHasInsufficientMemory    kubelet has insufficient memory available
  DiskPressure                  False   Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:22 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:22 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Thu, 01 Jun 2023 19:16:50 +0000   Thu, 01 Jun 2023 03:57:23 +0000   KubeletReady                    kubelet is posting ready status. AppArmor enabled
...
...
...
Events:
  Type    Reason                   Age                  From     Message
  ----    ------                   ----                 ----     -------
  Normal  NodeHasSufficientMemory  94s (x176 over 15h)  kubelet  Node aks-agentpool-40622340-vmss000009 status is now: NodeHasSufficientMemory

これらのイベントは、KubeEvents を介して Container Insights でも使用できます。

メトリック

NPD は、ノードの問題に基づく Prometheus メトリックも公開します。これを監視とアラートに使用できます。これらのメトリックはノード IP のポート 20257 で公開され、Prometheus でそれらをスクレイピングできます。

次の YAML の例は、Azure Managed Prometheus アドオンでデーモンセットとして使用できるスクレイピング構成を示しています。

kind: ConfigMap
apiVersion: v1
metadata:
  name: ama-metrics-prometheus-config-node
  namespace: kube-system
data:
  prometheus-config: |-
    global:
      scrape_interval: 1m
    scrape_configs:
    - job_name: node-problem-detector
      scrape_interval: 1m
      scheme: http
      metrics_path: /metrics
      relabel_configs:
      - source_labels: [__metrics_path__]
        regex: (.*)
        target_label: metrics_path
      - source_labels: [__address__]
        replacement: '$NODE_NAME'
        target_label: instance
      static_configs:
      - targets: ['$NODE_IP:20257']

次の例は、スクレイピングされたメトリックを示しています。

problem_gauge{reason="UnregisterNetDevice",type="FrequentUnregisterNetDevice"} 0
problem_gauge{reason="VMEventScheduled",type="VMEventScheduled"} 0

次のステップ

NPD の詳細については、「kubernetes/node-problem-detector」を参照してください。

次の方法で共有