This baseline reference architecture provides workload-agnostic guidance and recommendations for configuring Azure Stack HCI 23H2 and later infrastructure to ensure a reliable platform that can deploy and manage highly available virtualized and containerized workloads. This architecture describes the resource components and cluster design choices for the physical nodes that provide local compute, storage, and networking features. It also describes how to use Azure services to simplify and streamline the day-to-day management of Azure Stack HCI.
For more information about workload architecture patterns that are optimized to run on Azure Stack HCI, see the content located in the Azure Stack HCI workloads navigation menu.
This architecture is a starting point for how to use the storage switched network design to deploy a multinode Azure Stack HCI cluster. The workload applications deployed on an Azure Stack HCI cluster should be well architected. Well-architected workload applications must be deployed using multiple instances or high availability of any critical workload services and have appropriate business continuity and disaster recovery (BCDR) controls in place. These BCDR controls include regular backups and disaster recovery failover capabilities. To focus on the HCI infrastructure platform, these workload design aspects are intentionally excluded from this article.
For more information about guidelines and recommendations for the five pillars of the Azure Well-Architected Framework, see the Azure Stack HCI Well-Architected Framework service guide.
Article layout
Architecture | Design decisions | Well-Architected Framework approach |
---|---|---|
▪ Architecture ▪ Potential use cases ▪ Scenario details ▪ Platform resources ▪ Platform-supporting resources ▪ Deploy this scenario |
▪ Cluster design choices ▪ Physical disk drives ▪ Network design ▪ Monitoring ▪ Update management |
▪ Reliability ▪ Security ▪ Cost optimization ▪ Operational excellence ▪ Performance efficiency |
Tip
The Azure Stack HCI 23H2 cluster reference implementation demonstrates how to use an Azure Resource Management template (ARM template) and parameter file to deploy a switched multi-server deployment of Azure Stack HCI. Alternatively, the Bicep example demonstrates how to use a Bicep template to deploy an Azure Stack HCI cluster and its prerequisites resources.
Architecture
For more information, see Related resources.
Potential use cases
Typical use cases for Azure Stack HCI include the ability to run high availability (HA) workloads in on-premises or edge locations, which provides a solution to address workload requirements. You can:
Provide a hybrid cloud solution that's deployed on-premises to address data sovereignty, regulation and compliance, or latency requirements.
Deploy and manage HA-virtualized or container-based edge workloads that are deployed in a single location or in multiple locations. This strategy enables business-critical applications and services to operate in a resilient, cost-effective, and scalable manner.
Lower the total cost of ownership (TCO) by using solutions that are certified by Microsoft, cloud-based deployment, centralized management, and monitoring and alerting.
Provide a centralized provisioning capability by using Azure and Azure Arc to deploy workloads across multiple locations consistently and securely. Tools like the Azure portal, Azure CLI, or infrastructure as code (IaC) templates use Kubernetes for containerization or traditional workload virtualization to drive automation and repeatability.
Adhere to strict security, compliance, and audit requirements. Azure Stack HCI is deployed with a hardened security posture configured by default, or secure-by-default. Azure Stack HCI incorporates certified hardware, Secure Boot, Trusted Platform Module (TPM), virtualization-based security (VBS), Credential Guard, and enforced Windows Defender Application Control policies. It also integrates with modern cloud-based security and threat-management services like Microsoft Defender for Cloud and Microsoft Sentinel.
Scenario details
The following sections provide more information about the scenarios and potential use cases for this reference architecture. These sections include a list of business benefits and example workload resource types that you can deploy on Azure Stack HCI.
Use Azure Arc with Azure Stack HCI
Azure Stack HCI directly integrates with Azure by using Azure Arc to lower the TCO and operational overhead. Azure Stack HCI is deployed and managed through Azure, which provides built-in integration of Azure Arc through deployment of the Azure Arc resource bridge component. This component is installed during the HCI cluster deployment process. Azure Stack HCI cluster nodes are enrolled with Azure Arc for servers as a prerequisite to initiate the cloud-based deployment of the cluster. During deployment, mandatory extensions are installed on each cluster node, such as Lifecycle Manager, Microsoft Edge Device Management, and Telemetry and Diagnostics. You can use Azure Monitor and Log Analytics to monitor the HCI cluster after deployment by enabling Azure Stack HCI Insights. Feature updates for Azure Stack HCI are released periodically to enhance the customer experience. Updates are controlled and managed through Azure Update Manager.
You can deploy workload resources such as Azure Arc virtual machines (VMs), Azure Arc-enabled Azure Kubernetes Service (AKS), and Azure Virtual Desktop session hosts that use the Azure portal by selecting an Azure Stack HCI cluster custom location as the target for the workload deployment. These components provide centralized administration, management, and support. If you have active Software Assurance on your existing Windows Server Datacenter core licenses, you can reduce costs further by applying Azure Hybrid Benefit to Azure Stack HCI, Windows Server VMs, and AKS clusters. This optimization helps manage costs effectively for these services.
Azure and Azure Arc integration extend the capabilities of Azure Stack HCI virtualized and containerized workloads to include:
Azure Arc VMs for traditional applications or services that run in VMs on Azure Stack HCI.
AKS on Azure Stack HCI for containerized applications or services that benefit from using Kubernetes as their orchestration platform.
Azure Virtual Desktop to deploy your session hosts for Azure Virtual Desktop workloads on Azure Stack HCI (on-premises). You can use the control and management plane in Azure to initiate the host pool creation and configuration.
Azure Arc-enabled data services for containerized Azure SQL Managed Instance or an Azure Database for PostgreSQL server that use Azure Arc-enabled AKS that's hosted on Azure Stack HCI.
The Azure Arc-enabled Azure Event Grid extension for Kubernetes to deploy the Event Grid broker and Event Grid operator components. This deployment enables capabilities such as Event Grid topics and subscriptions for event processing.
Azure Arc-enabled machine learning with an AKS cluster that's deployed on Azure Stack HCI as the compute target to run Azure Machine Learning. You can use this approach to train or deploy machine learning models at the edge.
Azure Arc-connected workloads provide enhanced Azure consistency and automation for Azure Stack HCI deployments, like automating guest OS configuration with Azure Arc VM extensions or evaluating compliance with industry regulations or corporate standards through Azure Policy. You can activate Azure Policy through the Azure portal or IaC automation.
Take advantage of the Azure Stack HCI default security configuration
The Azure Stack HCI default security configuration provides a defense-in-depth strategy to simplify security and compliance costs. The deployment and management of IT services for retail, manufacturing, and remote office scenarios presents unique security and compliance challenges. Securing workloads against internal and external threats is crucial in environments that have limited IT support or a lack or dedicated datacenters. Azure Stack HCI has default security hardening and deep integration with Azure services to help you address these challenges.
Azure Stack HCI-certified hardware ensures built-in Secure Boot, Unified Extensible Firmware Interface (UEFI), and TPM support. Use these technologies in combination with VBS to help protect your security-sensitive workloads. You can use BitLocker Drive Encryption to encrypt boot disk volumes and storage spaces direct volumes at rest. Server Message Block (SMB) encryption provides automatic encryption of traffic between servers in the cluster (on the storage network) and signing of SMB traffic between the cluster nodes and other systems. SMB encryption also helps prevent relay attacks and facilitates compliance with regulatory standards.
You can onboard Azure Stack HCI VMs in Defender for Cloud to activate cloud-based behavioral analytics, threat detection and remediation, alerting, and reporting. Manage Azure Stack HCI VMs in Azure Arc so that you can use Azure Policy to evaluate their compliance with industry regulations and corporate standards.
Components
This architecture consists of physical server hardware that you can use to deploy Azure Stack HCI clusters in on-premises or edge locations. To enhance platform capabilities, Azure Stack HCI integrates with Azure Arc and other Azure services that provide supporting resources. Azure Stack HCI provides a resilient platform to deploy, manage, and operate user applications or business systems. Platform resources and services are described in the following sections.
Platform resources
The architecture requires the following mandatory resources and components:
Azure Stack HCI is a hyperconverged infrastructure (HCI) solution that's deployed on-premises or in edge locations by using physical server hardware and networking infrastructure. Azure Stack HCI provides a platform to deploy and manage virtualized workloads such as VMs, Kubernetes clusters, and other services that are enabled by Azure Arc. Azure Stack HCI clusters can scale from a single-node deployment to a maximum of sixteen nodes using validated, integrated, or premium hardware categories that are provided by original equipment manufacturer (OEM) partners.
Azure Arc is a cloud-based service that extends the management model based on Azure Resource Manager to Azure Stack HCI and other non-Azure locations. Azure Arc uses Azure as the control and management plane to enable the management of various resources such as VMs, Kubernetes clusters, and containerized data and machine learning services.
Azure Key Vault is a cloud service that you can use to securely store and access secrets. A secret is anything that you want to tightly restrict access to, such as API keys, passwords, certificates, cryptographic keys, local admin credentials, and BitLocker recovery keys.
Cloud witness is a feature of Azure Storage that acts as a failover cluster quorum. Azure Stack HCI cluster nodes use this quorum for voting, which ensures high availability for the cluster. The storage account and witness configuration are created during the Azure Stack HCI cloud deployment process.
Update Manager is a unified service designed to manage and govern updates for Azure Stack HCI. You can use Update Manager to manage workloads that are deployed on Azure Stack HCI, including guest OS update compliance for Windows and Linux VMs. This unified approach streamlines patch management across Azure, on-premises environments, and other cloud platforms through a single dashboard.
Platform-supporting resources
The architecture includes the following optional supporting services to enhance the capabilities of the platform:
Monitor is a cloud-based service for collecting, analyzing, and acting on diagnostic logs and telemetry from your cloud and on-premises workloads. You can use Monitor to maximize the availability and performance of your applications and services through a comprehensive monitoring solution. Deploy Azure Stack HCI Insights to simplify the creation of the Monitor data collection rule (DCR) and quickly enable monitoring of Azure Stack HCI clusters.
Azure Policy is a service that evaluates Azure and on-premises resources. Azure Policy evaluates resources through integration with Azure Arc by using the properties of those resources to business rules, called policy definitions, to determine compliance or capabilities that you can use to apply VM Guest Configuration using policy settings.
Defender for Cloud is a comprehensive infrastructure security management system. It enhances the security posture of your datacenters and delivers advanced threat protection for hybrid workloads, whether they reside in Azure or elsewhere, and across on-premises environments.
Azure Backup is a cloud-based service that provides a simple, secure, and cost-effective solution to back up your data and recover it from the Microsoft Cloud. Azure Backup Server is used to take backup of VMs that are deployed on Azure Stack HCI and store them in the Backup service.
Site Recovery is a disaster recovery service that provides BCDR capabilities by enabling business apps and workloads to fail over if there's a disaster or outage. Site Recovery manages replication and failover of workloads that run on physical servers and VMs between their primary site (on-premises) and a secondary location (Azure).
Cluster design choices
It's important to understand the workload performance and resiliency requirements when you design an Azure Stack HCI cluster. These requirements include recovery time objective (RTO) and recovery point objective (RPO) times, compute (CPU), memory, and storage requirements for all workloads that are deployed on the Azure Stack HCI cluster. Several characteristics of the workload affect the decision-making process and include:
Central processing unit (CPU) architecture capabilities, including hardware security technology features, the number of CPUs, the GHz frequency (speed) and the number of cores per CPU socket.
Graphics processing unit (GPU) requirements of the workload, such as for AI or machine learning, inferencing, or graphics rendering.
The memory per node, or the quantity of physical memory required to run the workload.
The number of physical nodes in the cluster that are 1 to 16 nodes in scale. The maximum number of nodes is three when you use the storage switchless network architecture.
To maintain compute resiliency, you need to reserve at least N+1 nodes worth of capacity in the cluster. This strategy enables node draining for updates or recovery from sudden outages like power outages or hardware failures.
For business-critical or mission-critical workloads, consider reserving N+2 nodes worth of capacity to increase resiliency. For example, if two nodes in the cluster are offline, the workload can remain online. This approach provides resiliency for scenarios in which a node that's running a workload goes offline during a planned update procedure and results in two nodes being offline simultaneously.
Storage resiliency, capacity, and performance requirements:
Resiliency: We recommend that you deploy three or more nodes to enable three-way mirroring, which provides three copies of the data, for the infrastructure and user volumes. Three-way mirroring increases performance and maximum reliability for storage.
Capacity: The total required usable storage after fault tolerance, or copies, is taken into consideration. This number is approximately 33% of the raw storage space of your capacity tier disks when you use three-way mirroring.
Performance: Input/output operations per second (IOPS) of the platform that determines the storage throughput capabilities for the workload when multiplied by the block size of the application.
To design and plan an Azure Stack HCI deployment, we recommend that you use the Azure Stack HCI sizing tool and create a New Project for sizing your HCI clusters. Using the sizing tool requires that you understand your workload requirements. When considering the number and size of workload VMs that run on your cluster, make sure to consider factors such as the number of vCPUs, memory requirements, and necessary storage capacity for the VMs.
The sizing tool Preferences section guides you through questions that relate to the system type (Premier, Integrated System, or Validated Node) and CPU family options. It also helps you select your resiliency requirements for the cluster. Make sure to:
Reserve a minimum of N+1 nodes worth of capacity, or one node, across the cluster.
Reserve N+2 nodes worth of capacity across the cluster for extra resiliency. This option enables the system to withstand a node failure during an update or other unexpected event that affects two nodes simultaneously. It also ensures that there's enough capacity in the cluster for the workload to run on the remaining online nodes.
This scenario requires use of three-way mirroring for user volumes, which is the default for clusters that have three or more physical nodes.
The output from the Azure Stack HCI sizing tool is a list of recommended hardware solution SKUs that can provide the required workload capacity and platform resiliency requirements based on the input values in the Sizer Project. For more information about available OEM hardware partner solutions, see Azure Stack HCI Solutions Catalog. To help rightsize solution SKUs to meet your requirements, contact your preferred hardware solution provider or system integration (SI) partner.
Physical disk drives
Storage Spaces Direct supports multiple physical disk drive types that vary in performance and capacity. When you design an Azure Stack HCI cluster, work with your chosen hardware OEM partner to determine the most appropriate physical disk drive types to meet the capacity and performance requirements of your workload. Examples include spinning Hard Disk Drives (HDDs), or Solid State Drives (SSDs) and NVMe drives. These drives are often called flash drives, or Persistent memory (PMem) storage, which is known as storage-class memory (SCM).
The reliability of the platform depends on the performance of critical platform dependencies, such as physical disk types. Make sure to choose the right disk types for your requirements. Use all-flash storage solutions such as NVMe or SSD drives for workloads that have high-performance or low-latency requirements. These workloads include but aren't limited to highly transactional database technologies, production AKS clusters, or any mission-critical or business-critical workloads that have low-latency or high-throughput storage requirements. Use all-flash deployments to maximize storage performance. All-NVMe drive or all-SSD drive configurations, especially at a small scale, improve storage efficiency and maximize performance because no drives are used as a cache tierFor more information, see All-flash based storage.
For general purpose workloads, a hybrid storage configuration, like NVMe drives or SSDs for cache and HDDs for capacity, might provide more storage space. The tradeoff is that spinning disks have lower performance if your workload exceeds the cache working set, and HDDs have a lower mean time between failure value compared to NVMe and SSD drives.
The performance of your cluster storage is influenced by the physical disk drive type, which varies based on the performance characteristics of each drive type and the caching mechanism that you choose. The physical disk drive type is an integral part of any Storage Spaces Direct design and configuration. Depending on the Azure Stack HCI workload requirements and budget constraints, you can choose to maximize performance, maximize capacity, or implement a mixed-drive type configuration that balances performance and capacity.
Storage Spaces Direct provides a built-in, persistent, real-time, read, write, server-side cache that maximizes storage performance. The cache should be sized and configured to accommodate the working set of your applications and workloads. Storage Spaces Direct virtual disks, or volumes, are used in combination with cluster shared volume (CSV) in-memory read cache to improve Hyper-V performance, especially for unbuffered input access to workload virtual hard disk (VHD) or virtual hard disk v2 (VHDX) files.
Tip
For high-performance or latency-sensitive workloads, we recommend that you use an all-flash storage (all NVMe or all SSD) configuration and a cluster size of three or more physical nodes. Deploying this design with the default storage configuration settings uses three-way mirroring for the infrastructure and user volumes. This deployment strategy provides the highest performance and resiliency. When you use an all-NVMe or all-SSD configuration, you benefit from the full usable storage capacity of each flash drive. Unlike hybrid or mixed NVMe + SSD setups, there's no capacity reserved for caching. This ensures optimal utilization of your storage resources. For more information about how to balance performance and capacity to meet your workload requirements, see Plan volumes - When performance matters most.
Network design
Network design is the overall arrangement of components within the network's physical infrastructure and logical configurations. You can use the same physical network interface card (NIC) ports for all combinations of management, compute, and storage network intents. Using the same NIC ports for all intent-related purposes is called a fully converged networking configuration.
Although a fully converged networking configuration is supported, the optimal configuration for performance and reliability is for the storage intent to use dedicated network adapter ports. Therefore, this baseline architecture provides example guidance for how to deploy a multinode Azure Stack HCI cluster by using the storage switched network architecture with two network adapter ports that are converged for management and compute intents and two dedicated network adapter ports for the storage intent. For more information, see Network considerations for cloud deployments of Azure Stack HCI.
This architecture requires two or more physical nodes and up to a maximum of 16 nodes in scale. Each node requires four network adapter ports that are connected to two Top-of-Rack (ToR) switches. The two ToR switches should be interconnected through multi-chassis link aggregation group (MLAG) links. The two network adapter ports that are used for the storage intent traffic must support Remote Direct Memory Access (RDMA). These ports require a minimum link speed of 10 Gbps, but we recommend a speed of 25 Gbps or higher. The two network adapter ports used for the management and compute intents are converged using switch embedded teaming (SET) technology. SET technology provides link redundancy and load-balancing capabilities. These ports require a minimum link speed of 1 Gbps, but we recommend a speed of 10 Gbps or higher.
Physical network topology
The following physical network topology shows the actual physical connections between nodes and networking components.
You need the following components when you design a multinode storage switched Azure Stack HCI deployment that uses this baseline architecture:
Dual ToR switches:
Dual ToR network switches are required for network resiliency, and the ability to service or apply firmware updates, to the switches without incurring downtime. This strategy prevents a single point of failure (SPoF).
The dual ToR switches are used for the storage, or east-west, traffic. These switches use two dedicated Ethernet ports that have specific storage virtual local area networks (VLANs) and priority flow control (PFC) traffic classes that are defined to provide lossless RDMA communication.
These switches connect to the nodes through Ethernet cables.
Two or more physical nodes and up to a maximum of 16 nodes:
Each node is a physical server that runs Azure Stack HCI OS.
Each node requires four network adapter ports in total: two RDMA-capable ports for storage and two network adapter ports for management and compute traffic.
Storage uses the two dedicated RDMA-capable network adapter ports that connect with one path to each of the two ToR switches. This approach provides link-path redundancy and dedicated prioritized bandwidth for SMB Direct storage traffic.
Management and compute uses two network adapter ports that provide one path to each of the two ToR switches for link-path redundancy.
External connectivity:
Dual ToR switches connect to the external network, such as your internal corporate LAN, to provide access to the required outbound URLs by using your edge border network device. This device can be a firewall or router. These switches route traffic that goes in and out of the Azure Stack HCI cluster, or north-south traffic.
External north-south traffic connectivity supports the cluster management intent and compute intents. This is achieved by using two switch ports and two network adapter ports per node that are converged through switch embedded teaming (SET) and a virtual switch within Hyper-V to ensure resiliency. These components work to provide external connectivity for Azure Arc VMs and other workload resources deployed within the logical networks that are created in Resource Manager using Azure portal, CLI, or IaC templates.
Logical network topology
The logical network topology shows an overview of how network data flows between devices, regardless of their physical connections.
A summarization of the logical setup for this multinode storage switched baseline architecture for Azure Stack HCI is as follows:
Dual ToR switches:
- Before you deploy the cluster, the two ToR network switches need to be configured with the required VLAN IDs, maximum transmission unit settings, and datacenter bridging configuration for the management, compute, and storage ports. For more information, see Physical network requirements for Azure Stack HCI, or ask your switch hardware vendor or SI partner for assistance.
Azure Stack HCI uses the Network ATC approach to apply network automation and intent-based network configuration.
Network ATC is designed to ensure optimal networking configuration and traffic flow by using network traffic intents. Network ATC defines which physical network adapter ports are used for the different network traffic intents (or types), such as for the cluster management, workload compute, and cluster storage intents.
Intent-based policies simplify the network configuration requirements by automating the node network configuration based on parameter inputs that are specified as part of the Azure Stack HCI cloud deployment process.
External communication:
When the nodes or workload need to communicate externally by accessing the corporate LAN, internet, or another service, they route using the dual ToR switches. This process is outlined in the previous physical network topology section.
When the two ToR switches act as Layer 3 devices, they handle routing and provide connectivity beyond the cluster to the edge border device, such as your firewall or router.
Management network intent uses the converged SET team virtual interface, which enables the cluster management IP address and control plane resources to communicate externally.
For the compute network intent, you can create one or more logical networks in Azure with the specific VLAN IDs for your environment. The workload resources, such as VMs, use these IDs to give access to the physical network. The logical networks use the two physical network adapter ports that are converged by using an SET team for the compute and management intents.
Storage traffic:
The physical nodes communicate with each other by using two dedicated network adapter ports that are connected to the ToR switches to provide high bandwidth and resiliency for storage traffic.
The SMB1 and SMB2 storage ports connect to two separate nonroutable (or Layer 2) networks. Each network has a specific VLAN ID configured that must match the switch ports configuration on the ToR switches' default storage VLAN IDs: 711 and 712.
There's no default gateway configured on the two storage intent network adapter ports within the Azure Stack HCI node OS.
Each node can access Storage Spaces Direct capabilities of the cluster, such as remote physical disks that are used in the storage pool, virtual disks, and volumes. Access to these capabilities is facilitated through the SMB-Direct RDMA protocol over the two dedicated storage network adapter ports that are available in each node. SMB Multichannel is used for resiliency.
This configuration provides sufficient data transfer speed for storage-related operations, such as maintaining consistent copies of data for mirrored volumes.
Network switch requirements
Your Ethernet switches must meet the different specifications required by Azure Stack HCI and set by the Institute of Electrical and Electronics Engineers Standards Association (IEEE SA). For example, for multinode storage switched deployments, the storage network is used for RDMA via RoCE v2 or iWARP. This process requires IEEE 802.1Qbb PFC to ensure lossless communication for the storage traffic class. Your ToR switches must provide support for IEEE 802.1Q for VLANs and IEEE 802.1AB for the Link Layer Discovery Protocol.
If you plan to use existing network switches for an Azure Stack HCI deployment, review the list of mandatory IEEE standards and specifications that the network switches and configuration must provide. When purchasing new network switches, contact your switch vendor to ensure that the devices meet the Azure Stack HCI IEEE specification requirements, or review the list of hardware vendor-certified switch models that support Azure Stack HCI network requirements.
IP address requirements
In a multinode storage switched deployment, the number of IP addresses needed increases with the addition of each physical node, up to a maximum of 16 nodes within a single cluster. For example, to deploy a two-node storage switched configuration of Azure Stack HCI, the cluster infrastructure requires a minimum of 11 x IP addresses to be allocated. More IP addresses are required if you use microsegmentation or software-defined networking. For more information, see Review two-node storage reference pattern IP address requirements for Azure Stack HCI.
When you design and plan IP address requirements for Azure Stack HCI, remember to account for additional IP addresses or network ranges needed for your workload beyond the ones that are required for the Azure Stack HCI cluster and infrastructure components. If you plan to deploy AKS on Azure Stack HCI, see AKS enabled by Azure Arc network requirements.
Monitoring
To enhance monitoring and alerting, enable Monitor Insights on Azure Stack HCI. Insights can scale to monitor and manage multiple on-premises clusters by using an Azure consistent experience. Insights uses cluster performance counters and event log channels to monitor key Azure Stack HCI features. Logs are collected by the DCR that's configured through Monitor and Log Analytics.
Azure Stack HCI Insights is built using Monitor and Log Analytics, which ensures an always up-to-date, scalable solution that's highly customizable. Insights provides access to default workbooks with basic metrics, along with specialized workbooks created for monitoring key features of Azure Stack HCI. These components provide a near real-time monitoring solution and enable the creation of graphs, customization of visualizations through aggregation and filtering, and configuration of custom resource health alert rules.
Update management
Azure Stack HCI clusters and the deployed workload resources, such as Azure Arc VMs, need to be updated and patched regularly. By regularly applying updates, you ensure that your organization maintains a strong security posture, and you improve the overall reliability and supportability of your estate. We recommend that you use automatic and periodic manual assessments for early discovery and application of security patches and OS updates.
Infrastructure updates
Azure Stack HCI is continuously updated to improve the customer experience and add new features and functionality. This process is managed through release trains, which deliver new baseline builds quarterly. Baseline builds are applied to Azure Stack HCI clusters to keep them up to date. In addition to regular baseline build updates, Azure Stack HCI is updated with monthly OS security and reliability updates.
Update Manager is an Azure service that you can use to apply, view, and manage updates for Azure Stack HCI. This service provides a mechanism to view all Azure Stack HCI clusters across your entire infrastructure and edge locations by using the Azure portal to provide a centralized management experience. For more information, see the following resources:
It's important to check for new driver and firmware updates regularly, such as every three to six months. If you use a Premier solution category version for your Azure Stack HCI hardware, the Solution Builder Extension package updates are integrated with Update Manager to provide a simplified update experience. If you use validated nodes or an integrated system category, there might be a requirement to download and run an OEM-specific update package that contains the firmware and driver updates for your hardware. To determine how updates are supplied for your hardware, contact your hardware OEM or SI partner.
Workload guest OS patching
You can enroll Azure Arc VMs that are deployed on Azure Stack HCI by using Azure Update Manager (AUM) to provide a unified patch management experience by using the same mechanism used to update the Azure Stack HCI cluster physical nodes. You can use AUM to create Guest maintenance configurations. These configurations control settings such as the Reboot setting reboot if necessary, the schedule (dates, times, and repeat options), and either a dynamic (subscription) or static list of the Azure Arc VMs for the scope. These settings control the configuration for when OS security patches are installed inside your workload VM's guest OS.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.
Reliability
Reliability ensures your application can meet the commitments you make to your customers. For more information, see Overview of the reliability pillar.
Identify the potential failure points
Every architecture is susceptible to failures. You can anticipate failures and be prepared with mitigations with failure mode analysis. The following table describes four examples of potential points of failure in this architecture:
Component | Risk | Likelihood | Effect/mitigation/note | Outage |
---|---|---|---|---|
Azure Stack HCI cluster outage | Power, network, hardware, or software failure | Medium | To prevent a prolonged application outage caused by the failure of an Azure Stack HCI cluster for business or mission-critical use cases, your workload should be architected using HA and DR principles. For example, you can use industry-standard workload data replication technologies to maintain multiple copies of persistent state data that are deployed using multiple Azure Arc VMs or AKS instances that are deployed on separate Azure Stack HCI clusters and in separate physical locations. | Potential outage |
Azure Stack HCI single physical node outage | Power, hardware, or software failure | Medium | To prevent a prolonged application outage caused by the failure of a single Azure Stack HCI node, your Azure Stack HCI cluster should have multiple physical nodes. Your workload capacity requirements during the cluster design phase determine the number of nodes. We recommend that you have three or more nodes. We also recommended that you use three-way mirroring, which is the default storage resiliency mode for clusters with three or more nodes. To prevent a SPoF and increase workload resiliency, deploy multiple instances of your workload by using two or more Azure Arc VMs or container pods that run in multiple AKS worker nodes. If a single node fails, the Azure Arc VMs and workload / application services are restarted on the remaining online physical nodes in the cluster. | Potential outage |
Azure Arc VM or AKS worker node (workload) | Misconfiguration | Medium | Application users are unable to sign in or access the application. Misconfigurations should be caught during deployment. If these errors happen during a configuration update, DevOps team must roll back changes. You can redeploy the VM if necessary. Redeployment takes less than 10 minutes to deploy but can take longer according to the type of deployment. | Potential outage |
Connectivity to Azure | Network outage | Medium | The cluster needs to reach the Azure control plane regularly for billing, management, and monitoring capabilities. If your cluster loses connectivity to Azure, it operates in a degraded state. For example, it wouldn't be possible to deploy new Azure Arc VMs or AKS clusters if your cluster loses connectivity to Azure. Existing workloads that are running on the HCI cluster continue to run, but you should restore the connection within 48 to 72 hours to ensure uninterrupted operation. | None |
For more information, see Recommendations for performing failure mode analysis.
Reliability targets
This section describes an example scenario. A fictitious customer called Contoso Manufacturing uses this reference architecture to deploy Azure Stack HCI. They want to address their requirements and deploy and manage workloads on-premises. Contoso Manufacturing has an internal service-level objective (SLO) target of 99.8% that business and application stakeholders agree on for their services.
An SLO of 99.8% uptime, or availability, results in the following periods of allowed downtime, or unavailability, for the applications that are deployed using Azure Arc VMs that run on Azure Stack HCI:
Weekly: 20 minutes and 10 seconds
Monthly: 1 hour, 26 minutes, and 56 seconds
Quarterly: 4 hours, 20 minutes, and 49 seconds
Yearly: 17 hours, 23 minutes, and 16 seconds
To help meet the SLO targets, Contoso Manufacturing implements the principle of least privilege (PoLP) to restrict the number of Azure Stack HCI cluster administrators to a small group of trusted and qualified individuals. This approach helps prevent downtime due to any inadvertent or accidental actions performed on production resources. Furthermore, the security event logs for on-premises Active Directory Domain Services (AD DS) domain controllers are monitored to detect and report any user account group membership changes, known as add and remove actions, for the Azure Stack HCI cluster administrators group by using a security information event management (SIEM) solution. Monitoring increases reliability and improves the security of the solution.
For more information, see Recommendations for identity and access management.
Strict change control procedures are in place for Contoso Manufacturing's production systems. This process requires that all changes are tested and validated in a representative test environment before implementation in production. All changes submitted to the weekly change advisory board process must include a detailed implementation plan (or link to source code), risk level score, a comprehensive rollback plan, post-release testing and verification, and clear success criteria for a change to be reviewed or approved.
For more information, see Recommendations for safe deployment practices.
Monthly security patches and quarterly baseline updates are applied to production Azure Stack HCI clusters only after they're validated by the preproduction environment. Update Manager and the cluster-aware updating feature automate the process of using VM live migration to minimize downtime for business-critical workloads during the monthly servicing operations. Contoso Manufacturing standard operating procedures require that security, reliability, or baseline build updates are applied to all production systems within four weeks of their release date. Without this policy, production systems are perpetually unable to stay current with monthly OS and security updates. Out-of-date systems negatively affect platform reliability and security.
For more information, see Recommendations for establishing a security baseline.
Contoso Manufacturing implements daily, weekly, and monthly backups to retain the last 6 x days of daily backups (Mondays to Saturdays), the last 3 x weekly (each Sunday) and 3 x monthly backups, with each Sunday week 4 being retained to become the month 1, month 2, and month 3 backups by using a rolling calendar based schedule that's documented and auditable. This approach meets Contoso Manufacturing requirements for an adequate balance between the number of data recovery points available and reducing costs for the offsite or cloud backup storage service.
For more information, see Recommendations for designing a disaster recovery strategy.
Data backup and recovery processes are tested for each business system every six months. This strategy provides assurance that BCDR processes are valid and that the business is protected if a datacenter disaster or cyber incident occurs.
For more information, see Recommendations for designing a reliability testing strategy.
The operational processes and procedures described previously in the article, and the recommendations in the Well-Architected Framework service guide for Azure Stack HCI, enable Contoso Manufacturing to meet their 99.8% SLO target and effectively scale and manage Azure Stack HCI and workload deployments across multiple manufacturing sites that are distributed around the world.
For more information, see Recommendations for defining reliability targets.
Redundancy
Consider a workload that you deploy on a single Azure Stack HCI cluster as a locally redundant deployment. The cluster provides high availability at the platform level, but you must deploy the cluster in a single rack. For business-critical or mission-critical use cases, we recommend that you deploy multiple instances of a workload or service across two or more separate Azure Stack HCI clusters, ideally in separate physical locations.
Use industry-standard, high-availability patterns for workloads that provide active/passive replication, synchronous replication, or asynchronous replication such as SQL Server Always On. You can also use an external network load balancing (NLB) technology that routes user requests across the multiple workload instances that run on Azure Stack HCI clusters that you deploy in separate physical locations. Consider using a partner external NLB device. Or you can evaluate the load balancing options that support traffic routing for hybrid and on-premises services, such as an Azure Application Gateway instance that uses Azure ExpressRoute or a VPN tunnel to connect to an on-premises service.
For more information, see Recommendations for designing for redundancy.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.
Security considerations include:
A secure foundation for the Azure Stack HCI platform: Azure Stack HCI is a secure-by-default product that uses validated hardware components with a TPM, UEFI, and Secure Boot to build a secure foundation for the Azure Stack HCI platform and workload security. When deployed with the default security settings, Azure Stack HCI has Windows Defender Application Control, Credential Guard, and BitLocker enabled. To simplify delegating permissions by using the PoLP, use Azure Stack HCI built-in role-based access control (RBAC) roles such as Azure Stack HCI Administrator for platform administrators and Azure Stack HCI VM Contributor or Azure Stack HCI VM Reader for workload operators.
Default security settings: Azure Stack HCI security default applies default security settings for your Azure Stack HCI cluster during deployment and enables drift control to keep the nodes in a known good state. You can use the security default settings to manage cluster security, drift control, and secured core server settings on your cluster.
Security event logs: Azure Stack HCI syslog forwarding integrates with security monitoring solutions by retrieving relevant security event logs to aggregate and store events for retention in your own SIEM platform.
Protection from threats and vulnerabilities: Defender for Cloud protects your Azure Stack HCI clusters from various threats and vulnerabilities. This service helps improve the security posture of your Azure Stack HCI environment and can protect against existing and evolving threats.
Threat detection and remediation: Microsoft Advanced Threat Analytics detects and remediates threats, such as those targeting AD DS, that provide authentication services to Azure Stack HCI cluster nodes and their Windows Server VM workloads.
Network isolation: Isolate networks if needed. For example, you can provision multiple logical networks that use separate VLANs and network address ranges. When you use this approach, ensure that the management network can reach each logical network and VLAN so that Azure Stack HCI cluster nodes can communicate with the VLAN networks through the ToR switches or gateways. This configuration is required for management of the workload, such as allowing infrastructure management agents to communicate with the workload guest OS.
For more information, see Recommendations for building a segmentation strategy.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.
Cost optimization considerations include:
Cloud-style billing model for licensing: Azure Stack HCI pricing follows the monthly subscription billing model with a flat rate per physical processor core in an Azure Stack HCI cluster. Extra usage charges apply if you use other Azure services. If you own on-premises core licenses for Windows Server Datacenter edition with active Software Assurance, you might choose to exchange these licenses to activate Azure Stack HCI cluster and Windows Server VM subscription fees.
Automatic VM Guest patching for Azure Arc VMs: This feature helps reduce the overhead of manual patching and the associated maintenance costs. Not only does this action help make the system more secure, but it also optimizes resource allocation and contributes to overall cost efficiency.
Cost monitoring consolidation: To consolidate monitoring costs, use Azure Stack HCI Insights and patch using Update Manager for Azure Stack HCI. Insights uses Monitor to provide rich metrics and alerting capabilities. The lifecycle manager component of Azure Stack HCI integrates with Update Manager to simplify the task of keeping your clusters up to date by consolidating update workflows for various components into a single experience. Use Monitor and Update Manager to optimize resource allocation and contribute to overall cost efficiency.
For more information, see Recommendations for optimizing personnel time.
Initial workload capacity and growth: When you plan your Azure Stack HCI deployment, consider your initial workload capacity, resiliency requirements, and future growth considerations. Consider if using a two or three-node storage switchless architecture could reduce costs, such as removing the need to procure storage-class network switches. Procuring extra storage class network switches can be an expensive component of new Azure Stack HCI cluster deployments. Instead, you can use existing switches for management and compute networks, which simplifies the infrastructure. If your workload capacity and resiliency needs don't scale beyond a three-node configuration, consider if you can use existing switches for the management and compute networks, and use the three-node storage switchless architecture to deploy Azure Stack HCI.
For more information, see Recommendations for optimizing component costs.
Tip
You can save on costs with Azure Hybrid Benefit if you have Windows Server Datacenter licenses with active Software Assurance. For more information, see Azure Hybrid Benefit for Azure Stack HCI.
Operational excellence
Operational excellence covers the operations processes that deploy an application and keep it running in production. For more information, see Overview of the operational excellence pillar.
Operational excellence considerations include:
Simplified provisioning and management experience integrated with Azure: The Cloud Based Deployment in Azure provides a wizard-driven interface that shows you how to create an Azure Stack HCI cluster. Similarly, Azure simplifies the process of managing Azure Stack HCI clusters and Azure Arc VMs. You can automate the portal-based deployment of the Azure Stack HCI cluster by using the ARM template. This template provides consistency and automation to deploy Azure Stack HCI at scale, specifically in edge scenarios such as retail stores or manufacturing sites that require an Azure Stack HCI cluster to run business-critical workloads.
Automation capabilities for Virtual Machines: Azure Stack HCI provides a wide range of automation capabilities for managing workloads, such as Azure Arc VMs, with the automated deployment of Azure Arc VMs by using Azure CLI, ARM, or Bicep template, with Virtual Machine OS updates using Azure Arc Extension for Updates and Azure Update Manager to update each Azure Stack HCI cluster. Azure Stack HCI also provides support for Azure Arc VM management by using Azure CLI and non-Azure Arc VMs by using Windows PowerShell. You can run Azure CLI commands locally from one of the Azure Stack HCI servers or remotely from a management computer. Integration with Azure Automation and Azure Arc facilitates a wide range of extra automation scenarios for VM workloads through Azure Arc extensions.
For more information, see Recommendations for using IaC.
Automation capabilities for containers on AKS: Azure Stack HCI provides a wide range of automation capabilities for managing workloads, such as containers, on AKS. You can automate the deployment of AKS clusters by using Azure CLI. Update AKS workload clusters by using the Azure Arc extension for Kubernetes updates. You can also manage Azure Arc-enabled AKS by using Azure CLI. You can run Azure CLI commands locally from one of the Azure Stack HCI servers or remotely from a management computer. Integrate with Azure Arc for a wide range of extra automation scenarios for containerized workloads through Azure Arc extensions.
For more information, see Recommendations for enabling automation.
Performance efficiency
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an efficient manner. For more information, see Performance efficiency pillar overview.
Performance efficiency considerations include:
Workload storage performance: Consider using the DiskSpd tool to test workload storage performance capabilities of an Azure Stack HCI cluster. You can use the VMFleet tool to generate load and measure the performance of a storage subsystem. Evaluate whether you should use VMFleet to measure storage subsystem performance.
We recommend that you establish a baseline for your Azure Stack HCI clusters performance before you deploy production workloads. DiskSpd uses various command-line parameters that enable administrators to test the storage performance of the cluster. The main function of DiskSpd is to issue read and write operations and output performance metrics, such as latency, throughput, and IOPs.
For more information, see Recommendations for performance testing.
Workload storage resiliency: Consider the benefits of storage resiliency, usage (or capacity) efficiency, and performance. Planning for Azure Stack HCI volumes includes identifying the optimal balance between resiliency, usage efficiency, and performance. You might find it difficult to optimize this balance because maximizing one of these characteristics typically has a negative effect on one or more of the other characteristics. Increasing resiliency reduces the usable capacity. As a result, the performance might vary, depending on the resiliency type selected. When resiliency and performance are the priority, and when you use three or more nodes, the default storage configuration employs three-way mirroring for both infrastructure and user volumes.
For more information, see Recommendations for capacity planning.
Network performance optimization: Consider network performance optimization. As part of your design, be sure to include projected network traffic bandwidth allocation when determining your optimal network hardware configuration.
To optimize compute performance in Azure Stack HCI, you can use GPU acceleration. GPU acceleration is beneficial for high-performance AI or machine learning workloads that involve data insights or inferencing. These workloads require deployment at edge locations due to considerations like data gravity or security requirements. In a hybrid deployment or on-premises deployment, it's important to take your workload performance requirements, including GPUs, into consideration. This approach helps you select the right services when you design and procure your Azure Stack HCI clusters.
For more information, see Recommendations for selecting the right services.
Deploy this scenario
The following section provides an example list of the high-level tasks or typical workflow used to deploy Azure Stack HCI, including prerequisite tasks and considerations. This workflow list is intended as an example guide only. It isn't an exhaustive list of all required actions, which can vary based on organizational, geographic, or project-specific requirements.
Scenario: there is a project or use case requirement to deploy a hybrid cloud solution in an on-premises or edge location to provide local compute for data processing capabilities and a desire to use Azure-consistent management and billing experiences. More details are described in the potential use cases section of this article. The remaining steps assume that Azure Stack HCI is the chosen infrastructure platform solution for the project.
Gather workload and use case requirements from relevant stakeholders. This strategy enables the project to confirm that the features and capabilities of Azure Stack HCI meet the workload scale, performance, and functionality requirements. This review process should include understanding the workload scale, or size, and required features such as Azure Arc VMs, AKS, Azure Virtual Desktop, or Azure Arc-enabled Data Services or Azure Arc-enabled Machine Learning service. The workload RTO and RPO (reliability) values and other nonfunctional requirements (performance/load scalability) should be documented as part of this requirements gathering step.
Review the Azure Stack HCI sizer output for the recommended hardware partner solution. This output includes details of the recommended physical server hardware make and model, number of physical nodes, and the specifications for the CPU, memory, and storage configuration of each physical node that are required to deploy and run your workloads.
Use the Azure Stack HCI sizing tool to create a new project that models the workload type and scale. This project includes the size and number of VMs and their storage requirements. These details are inputted together with choices for the system type, preferred CPU family, and your resiliency requirements for high availability and storage fault tolerance, as explained in the previous Cluster design choices section.
Review the Azure Stack HCI Sizer output for the recommended hardware partner solution. This solution includes details of the recommended physical server hardware (make and model), number of physical nodes, and the specifications for the CPU, memory, and storage configuration of each physical node that are required to deploy and run your workloads.
Contact the hardware OEM or SI partner to further qualify the suitability of the recommended hardware version versus your workload requirements. If available, use OEM-specific sizing tools to determine OEM-specific hardware sizing requirements for the intended workloads. This step typically includes discussions with the hardware OEM or SI partner for the commercial aspects of the solution. These aspects include quotations, availability of the hardware, lead times, and any professional or value-add services that the partner provides to help accelerate your project or business outcomes.
Deploy two ToR switches for network integration. For high availability solutions, HCI clusters require two ToR switches to be deployed. Each physical node requires four NICs, two of which must be RDMA capable, which provides two links from each node to the two ToR switches. Two NICs, one connected to each switch, are converged for outbound north-south connectivity for the compute and management networks. The other two RDMA capable NICs are dedicated for the storage east-west traffic. If you plan to use existing network switches, ensure that the make and model of your switches are on the approved list of network switches supported by Azure Stack HCI.
Work with the hardware OEM or SI partner to arrange delivery of the hardware. The SI partner or your employees are then required to integrate the hardware into your on-premises datacenter or edge location, such as racking and stacking the hardware, physical network, and power supply unit cabling for the physical nodes.
Perform the Azure Stack HCI cluster deployment. Depending on your chosen solution version (Premier solution, Integrated system, or Validated nodes), either the hardware partner, SI partner, or your employees can deploy the Azure Stack HCI software. This step starts by onboarding the physical nodes Azure Stack HCI OS into Azure Arc-enabled servers, then starting the Azure Stack HCI cloud deployment process. Customers and partners can raise a support request directly with Microsoft in the Azure portal by selecting the Support + Troubleshooting icon or by contacting their hardware OEM or SI partner, depending on the nature of the request and the hardware solution category.
Tip
The Azure Stack HCI 23H2 cluster reference implementation demonstrates how to deploy a switched multiserver deployment of Azure Stack HCI by using an ARM template and parameter file. Alternatively, the Bicep example demonstrates how to use a Bicep template to deploy an Azure Stack HCI cluster, including its prerequisites resources.
Deploy highly available workloads on Azure Stack HCI using Azure portal, CLI, or ARM + Azure Arc templates for automation. Use the custom location resource of the new HCI cluster as the target region when you deploy workload resources such as Azure Arc VMs, AKS, Azure Virtual Desktop session hosts, or other Azure Arc-enabled services that you can enable through AKS extensions and containerization on Azure Stack HCI.
Install monthly updates to improve the security and reliability of the platform. To keep your Azure Stack HCI clusters up to date, it's important to install Microsoft software updates and hardware OEM driver and firmware updates. These updates improve the security and reliability of the platform. Update Manager applies the updates and provides a centralized and scalable solution to install updates across a single cluster or multiple clusters. Check with your hardware OEM partner to determine the process for installing hardware driver and firmware updates because this process can vary depending on your chosen hardware solution category type (Premier solution, Integrated system, or Validated nodes). For more information, see Infrastructure updates.
Related resources
- Hybrid architecture design
- Azure hybrid options
- Automation in a hybrid environment
- Azure Automation State Configuration
- Optimize administration of SQL Server instances in on-premises and multicloud environments by using Azure Arc
Next steps
Product documentation:
- Azure Stack HCI, version 23H2 release information
- AKS on Azure Stack HCI
- Azure Virtual Desktop for Azure Stack HCI
- What is Azure Stack HCI monitoring?
- Protect VM workloads with Site Recovery on Azure Stack HCI
- Monitor overview
- Change tracking and inventory overview
- Update Manager overview
- What are Azure Arc-enabled data services?
- What are Azure Arc-enabled servers?
- What is the Backup service?
Product documentation for details on specific Azure services:
- Azure Stack HCI
- Azure Arc
- Key Vault
- Azure Blob Storage
- Monitor
- Azure Policy
- Azure Container Registry
- Defender for Cloud
- Site Recovery
- Backup
Microsoft Learn modules:
- Configure Monitor
- Design your site recovery solution in Azure
- Introduction to Azure Arc-enabled servers
- Introduction to Azure Arc-enabled data services
- Introduction to AKS
- Scale model deployment with Machine Learning anywhere - Tech Community Blog
- Realizing Machine Learning anywhere with AKS and Azure Arc-enabled Machine Learning - Tech Community Blog
- Machine learning on AKS hybrid and Stack HCI using Azure Arc-enabled machine learning - Tech Community Blog
- Introduction to Kubernetes compute target in Machine Learning
- Keep your VMs updated
- Protect your VM settings with Automation state configuration
- Protect your VMs by using Backup