Eseguire test con pytest usando l'estensione Databricks per Visual Studio Code

Articolo
10/15/2024

Questo articolo descrive come eseguire test usando pytest l'estensione Databricks per Visual Studio Code. Si veda Che cosa è l'estensione Databricks per Visual Studio Code?.

È possibile eseguire pytest nel codice locale che non richiede una connessione a un cluster in un'area di lavoro remota di Azure Databricks. Ad esempio, è possibile usare pytest per testare le funzioni che accettano e restituiscono dataframe PySpark nella memoria locale. Per iniziare a pytest usarlo ed eseguirlo in locale, vedere Introduzione nella pytest documentazione.

Per eseguire pytest il codice in un'area di lavoro remota di Azure Databricks, eseguire le operazioni seguenti nel progetto di Visual Studio Code:

Passaggio 1: Creare i test

Aggiungere un file Python con il codice seguente, che contiene i test da eseguire. In questo esempio si presuppone che questo file sia denominato spark_test.py e si trova nella radice del progetto di Visual Studio Code. Questo file contiene una pytest fixture, che rende il cluster (il punto di SparkSession ingresso alla funzionalità Spark nel cluster) disponibile per i test. Questo file contiene un singolo test che controlla se la cella specificata nella tabella contiene il valore specificato. È possibile aggiungere test personalizzati a questo file in base alle esigenze.

from pyspark.sql import SparkSession
import pytest

@pytest.fixture
def spark() -> SparkSession:
  # Create a SparkSession (the entry point to Spark functionality) on
  # the cluster in the remote Databricks workspace. Unit tests do not
  # have access to this SparkSession by default.
  return SparkSession.builder.getOrCreate()

# Now add your unit tests.

# For example, here is a unit test that must be run on the
# cluster in the remote Databricks workspace.
# This example determines whether the specified cell in the
# specified table contains the specified value. For example,
# the third column in the first row should contain the word "Ideal":
#
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# |_c0 | carat | cut   | color | clarity | depth | table | price | x    | y     | z    |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# | 1  | 0.23  | Ideal | E     | SI2     | 61.5  | 55    | 326   | 3.95 | 3. 98 | 2.43 |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# ...
#
def test_spark(spark):
  spark.sql('USE default')
  data = spark.sql('SELECT * FROM diamonds')
  assert data.collect()[0][2] == 'Ideal'

Passaggio 2: Creare lo strumento di esecuzione pytest

Aggiungere un file Python con il codice seguente, che indica pytest di eseguire i test del passaggio precedente. In questo esempio si presuppone che il file sia denominato pytest_databricks.py e che si trova nella radice del progetto di Visual Studio Code.

import pytest
import os
import sys

# Run all tests in the connected directory in the remote Databricks workspace.
# By default, pytest searches through all files with filenames ending with
# "_test.py" for tests. Within each of these files, pytest runs each function
# with a function name beginning with "test_".

# Get the path to the directory for this file in the workspace.
dir_root = os.path.dirname(os.path.realpath(__file__))
# Switch to the root directory.
os.chdir(dir_root)

# Skip writing .pyc files to the bytecode cache on the cluster.
sys.dont_write_bytecode = True

# Now run pytest from the root directory, using the
# arguments that are supplied by your custom run configuration in
# your Visual Studio Code project. In this case, the custom run
# configuration JSON must contain these unique "program" and
# "args" objects:
#
# ...
# {
#   ...
#   "program": "${workspaceFolder}/path/to/this/file/in/workspace",
#   "args": ["/path/to/_test.py-files"]
# }
# ...
#
retcode = pytest.main(sys.argv[1:])

Passaggio 3: Creare una configurazione di esecuzione personalizzata

Per indicare pytest di eseguire i test, è necessario creare una configurazione di esecuzione personalizzata. Usare la configurazione di esecuzione basata su cluster Databricks esistente per creare una configurazione di esecuzione personalizzata, come indicato di seguito:

Nel menu principale fare clic su Esegui > Aggiungi configurazione.
Nel riquadro comandi selezionare Databricks.

Visual Studio Code aggiunge un .vscode/launch.json file al progetto, se questo file non esiste già.
Modificare la configurazione dell'esecuzione iniziale come indicato di seguito e quindi salvare il file:
- Modificare il nome della configurazione di esecuzione da Run on Databricks a un nome visualizzato univoco per questa configurazione, in questo esempio Unit Tests (on Databricks).
- Passare program da ${file} al percorso nel progetto che contiene il test runner, in questo esempio ${workspaceFolder}/pytest_databricks.py.
- Passare args da [] al percorso nel progetto che contiene i file con i test, in questo esempio ["."].
Il file launch.json avrà un aspetto simile al seguente:
```
{
  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
  "version": "0.2.0",
  "configurations": [
    {
      "type": "databricks",
      "request": "launch",
      "name": "Unit Tests (on Databricks)",
      "program": "${workspaceFolder}/pytest_databricks.py",
      "args": ["."],
      "env": {}
    }
  ]
}
```

Passaggio 4: Eseguire i test

Assicurarsi che pytest sia già installato nel cluster. Ad esempio, con la pagina delle impostazioni del cluster aperta nell'area di lavoro di Azure Databricks, eseguire le operazioni seguenti:

Nella scheda Librerie, se pytest è visibile, pytest è già installato. Se pytest non è visibile, fare clic su Installa nuovo.
Per Origine libreria fare clic su PyPI.
Per Pacchetto immettere pytest.
Cliccare Installa.
Attendere che lo stato cambi da In sospeso a Installato.

Per eseguire i test, eseguire le operazioni seguenti dal progetto di Visual Studio Code:

Scegliere Visualizza > esecuzione dal menu principale.
Nell'elenco Esegui ed esegui debug fare clic su Unit Test (in Databricks), se non è già selezionato.
Fare clic sulla freccia verde (Avvia debug).

I pytest risultati vengono visualizzati nella console di debug (visualizza > console di debug nel menu principale). Ad esempio, questi risultati mostrano che almeno un test è stato trovato nel spark_test.py file e un punto (.) indica che è stato trovato e superato un singolo test. (Un test non superato visualizzerebbe un . F)

<date>, <time> - Creating execution context on cluster <cluster-id> ...
<date>, <time> - Synchronizing code to /Workspace/path/to/directory ...
<date>, <time> - Running /pytest_databricks.py ...
============================= test session starts ==============================
platform linux -- Python <version>, pytest-<version>, pluggy-<version>
rootdir: /Workspace/path/to/directory
collected 1 item

spark_test.py .                                                          [100%]

============================== 1 passed in 3.25s ===============================
<date>, <time> - Done (took 10818ms)

Condividi tramite

Eseguire test con pytest usando l'estensione Databricks per Visual Studio Code

Passaggio 1: Creare i test

Passaggio 2: Creare lo strumento di esecuzione pytest

Passaggio 3: Creare una configurazione di esecuzione personalizzata

Passaggio 4: Eseguire i test

Commenti e suggerimenti

Risorse aggiuntive