Storage recommendations for AI workloads on Azure infrastructure (IaaS)
This article provides storage recommendations for organizations running AI workloads on Azure infrastructure (IaaS). A storage solution for AI workloads on Azure infrastructure must be capable of managing the demands of data storage, access, and transfer that are inherent to AI model training and inferencing.
AI workloads require high throughput and low latency for efficient data retrieval and processing. They also need mechanisms for data versioning and consistency to guarantee accurate and reproducible outcomes across distributed environments. When selecting the appropriate storage solution, consider factors such as data transfer times, latency, performance requirements, and compatibility with existing systems.
Use a file system for active data. Implement a file system to store "job-specific/hot" data actively used or generated by AI jobs. This solution is ideal for real-time data processing due to its low latency and high throughput capabilities. These capabilities are critical for optimizing the performance of AI workflows. Azure has three principal file system solutions to support training and inferencing AI models on Azure infrastructure. To choose the right file system, follow these recommendations:
Use Azure Managed Lustre for lowest data transfer times and minimized latency. Azure Managed Lustre provides high performance with parallel file system capabilities and simplifies management with Azure integration. It's cost-effective, with usage-based storage costs, and allows selective data import from Blob Storage, optimizing data handling.
Use Azure NetApp Files when you need enterprise-grade features and performance for AI workloads. Azure NetApp Files offer high reliability and performance, ideal for mission-critical applications. Azure NetApp Files is beneficial if you have existing investments in NetApp infrastructure. It's beneficial for hybrid cloud capabilities and when you need to customize and fine-tune storage configurations.
Use local NVMe/SSD file systems when performance is the top priority. It aggregates the local NVMe of compute (worker nodes) using a job-dedicated parallel file system like BeeGFS On Demand (BeeOND). They operate directly on the compute nodes to create a temporary, high-performance file system during the job. These systems offer ultra-low latency and high throughput, making them ideal for I/O-intensive applications like deep learning training or real-time inferencing.
Transfer inactive data to Azure Blob Storage. After completing a job, transfer inactive job data from Azure Managed Lustre to Azure Blob Storage for long-term, cost-effective storage. Blob storage provides scalable options with different access tiers, ensuring efficient storage of inactive or infrequently accessed data, while keeping it readily available when needed.
Implement checkpointing for model training. Set up a checkpointing mechanism that saves the model’s state, including training weights and parameters, at regular intervals such as every 500 iterations. Store this checkpoint data in Azure Managed Lustre to allow restarting the model training from a previously saved state, improving the flexibility and resilience of your AI workflows.
Automate data migration to lower-cost storage tiers. Configure Azure Blob Storage lifecycle management policies to automatically migrate older, infrequently accessed data to lower-cost storage tiers, such as the Cool or Archive tiers. This approach optimizes storage costs while ensuring that important data remains accessible when needed.
Ensure data consistency across distributed environments. Ensure data consistency across distributed AI workloads by setting up synchronization between Azure Managed Lustre and Azure Blob Storage. This synchronization ensures that all nodes accessing the data are working with the same, consistent version, preventing errors and discrepancies in distributed environments.
Enable data versioning for reproducibility. Activate versioning in Azure Blob Storage to track changes to datasets and models over time. This feature facilitates rollback, enhances reproducibility, and supports collaboration. It maintains a detailed history of modifications to data and models and allows you to compare and restore previous versions as needed.