Fine-tune Hugging Face models for a single GPU

This article describes how to fine-tune a Hugging Face model with the Hugging Face transformers library on a single GPU. It also includes Databricks-specific recommendations for loading data from the lakehouse and logging models to MLflow, which enables you to use and govern your models on Azure Databricks.

The Hugging Face transformers library provides the Trainer utility and Auto Model classes that enable loading and fine-tuning Transformers models.

These tools are available for the following tasks with simple modifications:

  • Loading models to fine-tune.
  • Constructing the configuration for the Hugging Face Transformers Trainer utility.
  • Performing training on a single GPU.

See What are Hugging Face Transformers?

Requirements

  • A single-node cluster with one GPU on the driver.
  • The GPU version of Databricks Runtime 13.0 ML and above.
    • This example for fine-tuning requires the 🤗 Transformers, 🤗 Datasets, and 🤗 Evaluate packages which are included in Databricks Runtime 13.0 ML and above.
  • MLflow 2.3.
  • Data prepared and loaded for fine-tuning a model with transformers.

Tokenize a Hugging Face dataset

Hugging Face Transformers models expect tokenized input, rather than the text in the downloaded data. To ensure compatibility with the base model, use an AutoTokenizer loaded from the base model. Hugging Face datasets allows you to directly apply the tokenizer consistently to both the training and testing data.

For example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model)
def tokenize_function(examples):
    return tokenizer(examples["text"], padding=False, truncation=True)

train_test_tokenized = train_test_dataset.map(tokenize_function, batched=True)

Set up the training configuration

Hugging Face training configuration tools can be used to configure a Trainer. The Trainer classes require the user to provide:

  • Metrics
  • A base model
  • A training configuration

You can configure evaluation metrics in addition to the default loss metric that the Trainer computes. The following example demonstrates adding accuracy as a metric:

import numpy as np
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Use the Auto Model classes for NLP to load the appropriate model for your task.

For text classification, use AutoModelForSequenceClassification to load a base model for text classification. When creating the model, provide the number of classes and the label mappings created during dataset preparation.

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
        base_model,
        num_labels=len(label2id),
        label2id=label2id,
        id2label=id2label
        )

Next, create the training configuration. The TrainingArguments class allows you to specify the output directory, evaluation strategy, learning rate, and other parameters.

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir=training_output_dir, evaluation_strategy="epoch")

Using a data collator batches input in training and evaluation datasets. DataCollatorWithPadding gives good baseline performance for text classification.

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer)

With all of these parameters constructed, you can now create a Trainer.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_test_dataset["train"],
    eval_dataset=train_test_dataset["test"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

Train and log to MLflow

Hugging Face interfaces well with MLflow and automatically logs metrics during model training using the MLflowCallback. However, you must log the trained model yourself.

Wrap training in an MLflow run. This constructs a Transformers pipeline from the tokenizer and the trained model, and writes it to local disk. Finally, log the model to MLflow with mlflow.transformers.log_model.

from transformers import pipeline

with mlflow.start_run() as run:
  trainer.train()
  trainer.save_model(model_output_dir)
  pipe = pipeline("text-classification", model=AutoModelForSequenceClassification.from_pretrained(model_output_dir), batch_size=1, tokenizer=tokenizer)
  model_info = mlflow.transformers.log_model(
        transformers_model=pipe,
        artifact_path="classification",
        input_example="Hi there!",
    )

If you don’t need to create a pipeline, you can submit the components that are used in training into a dictionary:

model_info = mlflow.transformers.log_model(
  transformers_model={"model": trainer.model, "tokenizer": tokenizer},
  task="text-classification",
  artifact_path="text_classifier",
  input_example=["MLflow is great!", "MLflow on Databricks is awesome!"],
)

Load the model for inference

When your model is logged and ready, loading the model for inference is the same as loading the MLflow wrapped pre-trained model.

logged_model = "runs:/{run_id}/{model_artifact_path}".format(run_id=run.info.run_id, model_artifact_path=model_artifact_path)

# Load model as a Spark UDF. Override result_type if the model does not return double values.
loaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model, result_type='string')

test = test.select(test.text, test.label, loaded_model_udf(test.text).alias("prediction"))
display(test)

See Deploy models using Mosaic AI Model Serving for more information.

Troubleshoot common CUDA errors

This section describes common CUDA errors and guidance on how to resolve them.

OutOfMemoryError: CUDA out of memory

When training large models, a common error you may encounter is the CUDA out of memory error.

Example:

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 666.34 MiB already allocated; 17.75 MiB free; 720.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Try the following recommendations to resolve this error:

  • Reduce the batch size for training. You can reduce the per_device_train_batch_size value in TrainingArguments.

  • Use lower precision training. You can set fp16=True in TrainingArguments.

  • Use gradient_accumulation_steps in TrainingArguments to effectively increase overall batch size.

  • Use 8-bit Adam optimizer.

  • Clean up the GPU memory before training. Sometimes, GPU memory may be occupied by some unused code.

    from numba import cuda
    device = cuda.get_current_device()
    device.reset()
    

CUDA kernel errors

When running the training, you may get CUDA kernel errors.

Example:

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

For debugging, consider passing CUDA_LAUNCH_BLOCKING=1.

To troubleshoot:

  • Try running the code on CPU to see if the error is reproducible.

  • Another option is to get a better traceback by setting CUDA_LAUNCH_BLOCKING=1:

    import os
    os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
    

Notebook: Fine-tune text classification on a single GPU

To get started quickly with example code, this example notebook provides an end-to-end example for fine-tuning a model for text classification. The subsequent sections of this article go into more detail around using Hugging Face for fine-tuning on Azure Databricks.

Fine-tuning Hugging Face text classification models notebook

Get notebook

Additional resources

Learn more about Hugging Face on Azure Databricks.