Getting started with AI experimentation

Data scientists and AI engineers collaborate during the experimentation phase of an AI project. They work on exploratory data analysis, prototyping ML approaches, feature engineering, and testing hypotheses.

While hypothesis driven development and agile approaches are encouraged for ML project management, the focus here is on the engineering aspects of experimentation.

For the data science aspects of experimentation, see Model Experimentation.

💡Key outcomes of engineering experimentation:

  • Normalization, transformation, and other required pre-processing has been applied to the data (or a subset of the data). Pre-processed data is evaluated for feasibility and suitability for solving the business problem.
  • The data has been enriched or augmented to improve its suitability and may even be partially or fully synthetic.
  • A Hypothesis driven approach with centrally tracked experiments and potentially compute has been applied.
  • Experiments are documented, shared, and can be easily reproduced.
  • ML approaches, libraries, and algorithms are tested and evaluated and the best performing approach is selected. The best performing approach may not necessarily be the most accurate model. The best approach could be a trade-off in terms of ease of implementation coupled with accuracy. For example, using AutoML for rapid prototyping and development, or exporting to an ONNX model for deployment to an Edge device.
  • The data distribution is recorded and stored as a reference to measure future drift as the data changes.
  • An automated pipeline has been designed and potentially partially applied to the Experiments.

Experimentation topics

Experimentation guidance is provided in the following sections.

Algorithm exploration for AI Projects

Training an AI model is an iterative process. At the beginning of an AI project, we don't know which AI algorithm will produce the best performing model. Based on domain expertise, there is usually a small set of AI algorithms that perform well in a given domain. But, each of these algorithms must be tried and evaluated. Automated machine learning (automated ML or AutoML) is the process of partially automating algorithm exploration.

For more detail on AutoML, refer to Ways to use AutoML in Azure ML

Using MLOps during experimentation

The primary goal of the experimentation phase is to find a clear way/solution/algorithm to solve an ML-related problem. Obviously, until we have an understanding of how to solve the problem, we can't start the model development phase, nor build an end-to-end ML flow.

The Importance of MLOps

Some engineers believe that it's a good idea to wait until the experimentation phase is done prior to setting up MLOps. There are two challenges to this thinking:

  • The experimentation phase has various outcomes that can be reused in the model development and even in the inferencing phases. Almost all the code, starting from data augmentation up to model training, can be migrated to the end to end ML flow.
  • The experimentation phase can be overlapped with the model development phase. The phase begins once we know how to solve the ML problem and now want to tune parameters and components.

These challenges can be summarized into a common problem statement: How can we share experimentation and model development code?

Ignoring this issue at beginning of the project can lead to a situation where data scientists and software engineers are working on two different code bases and spending significant time to sync code between them. The good news is that MLOps provides a way to solve the problem.

MLOps practices to implement

The following sections summarize good MLOps practices that can help data scientists and software engineers collaborate and share code effectively.

Start working with DevOps engineers as soon as possible

Prior to starting any experiment, it's worthwhile to make sure that these conditions have been met:

  • Development environment is in place: it can be a compute instance (Azure ML) or an interactive cluster (Databricks) or any other compute (including local one) that allows you to run experiments.
  • All data sources should be created: it's a good idea to use an ML platform capability to build technology-specific entities like data assets in Azure ML. It allows you to work with some abstract entities rather than with specific data sources (like Azure Blob).
  • Data should be uploaded: it's important to have access to different kinds of data sets to do experiments under different conditions. For example, data might be used on local devices, some may represent a toy data set for validation, and some may represent a full data set for training.
  • Authentication should be in place: data scientists should have access to the right data with the ability to avoid entering credentials in their code.

All the conditions are critical and must be managed via collaboration of ML Engineers and DevOps Engineers.

As soon as data scientists have a stable idea, it's important to collaborate with DevOps engineers to make sure all needed code artifacts are in place to start migrating notebooks into pure Python code. The following steps must take place:

  • Number of ML flows should be identified: each ML pipeline has its own operationalization aspects and DevOps engineers have to find a way to add all the aspects into CI/CD.
  • Placeholders for ML pipelines should be created: each ML pipeline has its own folder structure and ability to use some common code in the repo. A placeholder is a folder with some code and configuration files that represents an empty ML pipeline. Placeholders can be added to CI/CD right away, even prior to migration having been started.

The following diagram shows the stages discussed above:

Collaboration with DevOps Engineers

Start with methods

It doesn't take much effort to wrap your code into methods (even classes) once it's possible. Later, it will be easy to move the methods from the experimentation notebook into pure Python files.

When implementing methods in the experimentation notebook, it's important to make sure that there are no global variables, instead rely on parameters. Later, relying on parameters helps to port code outside of the notebook as is.

Move stable code outside notebooks as soon as possible

If there are some methods/classes to share between notebooks or methods that are not going to be affected much in future iterations, it's worthwhile to move the methods into Python files and import them into the notebook.

At this point, it's important to work with a DevOps engineer to make sure that linting and other checks are in place. If you begin moving code from notebooks to scripts inside the experimentation folder, it is okay to only apply linting. However, if you move your code into a main code folder (like src), it's important to follow all other rules like unit testing code coverage.

Clean notebooks

Cleaning up notebooks before a commit is also good practice. Output information might contain personal information data or non-relevant information that might complicate review. Work with a DevOps engineer to apply some rules and pre-commit hooks. For example, you can use nb-clean.

Executing all notebook cells in sequence should work

Broken notebooks are a common situation when software engineers help data scientists to migrate their notebooks into a pipeline using a selected framework. It's especially true for new ML platforms (Azure ML v2) or for complex frameworks that require much coding to implement a basic pipeline, like Kubeflow. These errors are why it's important that all experimentation notebooks can be executed by a software engineer in order, including all cells, or including clear guidance on why a given cell is optional.

Benefits of using selected ML platform

If you start using Azure ML or a similar technology during the experimentation, the migration will be smooth. For example, the speed benefits gained from using parallel compute in Azure ML may be worth the small amount of additional code required to configure it. Collaboration with Software Engineers can help find some killer feature in the selected ML framework and use it from the beginning.

Use Exploratory Data Analysis (EDA) to explore and understand your data

Every machine learning project requires a deep understanding of the data to understand whether the data is representative of the problem. A systematic approach to understanding the data should be undertaken to ensure project success. Understanding of the data typically takes place during the Exploratory Data Analysis (EDA) phase. It is a complex part of an AI project where data cleansing takes place, and outliers are identified. In this phase, suitability of the data is assessed to inform hypothesis generation and experimentation. EDA is typically a creative process used to find answers to questions about the data.

EDA relates to using AI to analyze and structure data, aimed at better understanding the data and includes:

  • Its distribution, outliers, and completeness
  • Its relevance to solving a specific business problem using ML
  • Does it contain PII data and will redaction or obfuscation be required
  • How it can be used to generate synthetic data if necessary

Also as part of this phase data suitability is assessed for hypothesis generation and experimentation.

The following image illustrates the various phases, their respective complexity and roles during a typical machine learning project:

Stages and roles

Hypothesis driven development and experiment tracking 🧪

Code written during EDA may not make it to production, but treating it as production code is a best practice. It provides an audit and represents the investment made to determine the correct ML solution as part of a hypothesis-driven development approach.

It allows teams to not only reproduce the experiments but also to learn from past lessons, saving time and associated development costs.

Data versioning

In an ML project it is vital to track the data that experiments are run on and models are trained on. Tracking this relationship or lineage between the ML solution and experiment ensures repeatability, improves interpretability, and is an engineering best practice.

During iterative labeling, data ingestion, and experimentation it is critical that experiments can be recreated. Experiments may need to be recreated for many reasons. Examples include validating findings between different data scientists, or for regulatory reasons like bias audits in a model. These audits may happen years after a model is deployed. To replicate the creation of a given model, it is not enough to reuse the same hyperparameters, the same training data must be used as well. In some cases, it is where Feature Management Systems come in. In other cases, it may make more sense to implement data versioning in some other capacity (on its own, or with a Feature Management System).

ℹ️ Refer to the Data: Data Lineage section for more information nuanced to a Data Engineer/Governance role.

Basic implementation in Azure ML

A common pattern adopted in customer engagements follows these given guidelines:

  • Data is stored in blob storage, organized by ingestion date
  • Raw data is treated as immutable
  • Azure ML Data assets are used as training data inputs
  • All Data assets are tagged with date ranges related to their source.
  • Different versions of Data assets point to different date folders.

With these guidelines in place, the guidance in the Azure ML Docs for Versioning and Tracking ML Datasets can be followed. The current docs reference the v1 SDK, but the concept translates to v2 as well.

Some useful resources for data registration and versioning:

Resource Description
Azure ML Python SDK v2 to train a symbol detection using P&ID synthetic dataset This project provides a sample implementation of an AML workflow to train a symbol detection model. It includes a ‘data registration’ step that demonstrates registering a data asset for later use in AML, including stratified splitting techniques for consistent feature representation across the training and validation sets.

Third-party tools

Third-party data versioning tools, such as DVC, also provide full data versioning and tracking with the overhead of using their tool set. They may work well for customers who are in a greenfield state, looking to build a robust MLOps pipeline. But it requires more work to integrate with existing data pipelines.

Considerations for small datasets

Relatively small datasets, a few hundred MB or less, that change slowly are candidates for versioning within the project repository via Git Large File Storage (LFS). The commit of the repo at the point of model training will thus also specify the version of the training data used. Advantages include simplicity of setup and cross compatibility with different environments, such as, local, Databricks, and Azure ML compute resources. The disadvantages include large repository size and the necessity of encoding the repository commit at which training occurs.

Drift

Data evolves over time and the key statistic values of data, used for training an ML Model, might undergo changes as compared to the real data used for prediction. It degrades the performance of a model over time. Periodic comparison and evaluation of both the datasets, training and inference data, is performed to find any drift in them. With enough historical data, this activity can be integrated into the Exploratory Data Analysis phase to understand expected drift characteristics.

Experiment tracking

When tracking experiments, the goal is to not only be able to formulate a hypothesis and track the results of the experiment, but to also be able to fully reproduce the experiment if necessary, whether during this project or in a future related project.

In order to be able to reproduce an experiment, below are details illustrating what needs to be tracked at a minimum:

  • Code and scripts used for the experiment
  • Environment configuration files (docker is a great way to reliably and easily reproduce an environment)
  • Data versions used for training and evaluation
  • HyperParameters and other parameters
  • Evaluation metrics
  • Model weights
  • Visualizations and reports such as confusion matrices

Implementations

Here are some implementations that help illustrate how EDA is performed on real-world projects

Setting up a Development Environment

Refer to Experimentation in Azure ML for guidance on conducting experiments in Azure ML and best practices to set up the development environment.

Working with unstructured data

The Data Discovery solution for unstructured data aims to quickly provide structured views on your text, images and videos. All at scale using Synapse and unsupervised ML techniques that exploit state-of-the-art deep learning models.

The goal is to present this data to and facilitate discussion with a business user/data owner quickly via Power BI visualization. Then the customer and team can decide the next best action with the data, identify outliers, or generate a training data set for a supervised model.

Another goal is to help simplify and accelerate the complex Exploratory Data Analysis phase of the project by democratizing common data science functions.It also accelerate your project so that you can focus more on the business problem you are trying to solve.

Working with structured data

Feature Engineering

Data has to be formulated and consumed in an input format that is expected by the underlying model. These inputs are generated by transforming, engineering and enriching the data and called features. Feature engineering is the process of using domain knowledge to supplement, cull or create new features to aid the machine learning process with the goal of increasing the underlying model's predictive power. ML models can learn how to represent data. For example, a deep learning model can learn embeddings during training, a lower dimensional representation of the data passing through the model.

For more information on embeddings, refer to Understanding embeddings in Azure OpenAI Service

Responsible AI

We should always ensure that ML solutions follow Responsible AI best practices.

For more detail, refer to Responsible AI

Synthetic data generation

Synthetic data serves two purposes: protecting sensitive data and providing more data in data-poor scenarios. Sensitive data is often necessary to develop ML solutions, but can put vulnerable data at risk of disclosure. In other scenarios, there is insufficient data to explore modeling approaches and acquiring more data is cost or time prohibitive. In both instances, synthetic data can provide a safe and cost-effective resource for model training, evaluation, and testing.

For more detail, refer to Synthetic data concepts

Other resources

  • The Data Science Toolkit is an open-source collection of proven ML and AI implementation accelerators. Accelerators enable the automation of commonly repeated development processes to allow data science practitioners to focus on delivering complex business value and spend less time on basic setup.