자습서: 로컬에서 dbt 모델 만들기, 실행 및 테스트

아티클
08/02/2024

이 자습서에서는 dbt 모델을 로컬로 만들고, 실행하고, 테스트하는 방법을 안내합니다. Azure Databricks 작업 작업으로 dbt 프로젝트를 실행할 수도 있습니다. 자세한 내용은 Azure Databricks 작업에서 dbt 변환 사용을 참조 하세요.

시작하기 전에

이 자습서를 수행하려면 먼저 Azure Databricks 작업 영역을 dbt Core에 연결해야 합니다. 자세한 내용은 dbt Core에 연결을 참조하세요.

1단계: 모델 만들기 및 실행

이 단계에서는 즐겨 찾는 텍스트 편집기를 사용하여 동일한 데이터베이스의 기존 데이터를 기반으로 데이터베이스에 새 뷰(기본값) 또는 새 테이블을 만드는 문인 모델을select 만듭니다. 이 절차에서는 샘플 데이터 세트의 샘플 diamonds 테이블을 기반으로 모델을 만듭니다.

다음 코드를 사용하여 이 테이블을 만듭니다.

DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")

프로젝트의 models 디렉터리에서 다음 SQL 문을 사용하여 diamonds_four_cs.sql 파일을 만듭니다. 이 문은 diamonds 테이블에서 각 다이아몬드의 캐럿, 컷, 색상 및 투명도 세부 정보만 선택합니다. config 블록은 이 문을 기반으로 데이터베이스에 테이블을 만들도록 dbt에 지시합니다.
```
{{ config(
  materialized='table',
  file_format='delta'
) }}
```
```
select carat, cut, color, clarity
from diamonds
```
팁

델타 파일 형식 및 증분 전략 사용과 같은 추가 config 옵션은 dbt 설명서의 Databricks 구성을 참조 merge 하세요.
프로젝트의 models 디렉터리에서 다음 SQL 문을 사용하여 diamonds_list_colors.sql 파일을 만듭니다. 이 명령문은 diamonds_four_cs 테이블의 colors 열에서 고유한 값을 선택하여 결과를 알파벳 순서로 처음부터 마지막까지 정렬합니다. config 블록이 없기 때문에 이 모델은 dbt에게 이 명령문을 기반으로 데이터베이스에 보기를 만들도록 지시합니다.
```
select distinct color
from {{ ref('diamonds_four_cs') }}
sort by color asc
```
프로젝트의 models 디렉터리에서 다음 SQL 문을 사용하여 diamonds_prices.sql 파일을 만듭니다. 이 명령문은 다이아몬드 가격을 색상별로 평균화하여 평균 가격을 기준으로 가장 높은 값에서 가장 낮은 값까지 결과를 정렬합니다. 이 모델은 dbt에 이 문을 기반으로 데이터베이스에 뷰를 만들도록 지시합니다.
```
select color, avg(price) as price
from diamonds
group by color
order by price desc
```

가상 환경이 활성화된 상태에서 dbt run 명령을 위의 세 파일의 경로와 함께 실행합니다. (profiles.yml 파일에 지정된 대로) default 데이터베이스에서 dbt는 diamonds_four_cs라는 테이블 하나와 diamonds_list_colors 및 diamonds_prices라는 2개의 뷰를 만듭니다. dbt는 관련 .sql 파일 이름으로부터 뷰 및 테이블 이름을 가져옵니다.

dbt run --model models/diamonds_four_cs.sql models/diamonds_list_colors.sql models/diamonds_prices.sql

...
... | 1 of 3 START table model default.diamonds_four_cs.................... [RUN]
... | 1 of 3 OK created table model default.diamonds_four_cs............... [OK ...]
... | 2 of 3 START view model default.diamonds_list_colors................. [RUN]
... | 2 of 3 OK created view model default.diamonds_list_colors............ [OK ...]
... | 3 of 3 START view model default.diamonds_prices...................... [RUN]
... | 3 of 3 OK created view model default.diamonds_prices................. [OK ...]
... |
... | Finished running 1 table model, 2 view models ...

Completed successfully

Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

다음 SQL 코드를 실행하여 새 보기에 대한 정보를 나열하고 테이블 및 보기에서 모든 행을 선택합니다.

클러스터에 연결하는 경우 SQL을 Notebook의 기본 언어로 지정하여 클러스터에 연결된 Notebook에서 이 SQL 코드를 실행할 수 있습니다. SQL 웨어하우스에 연결하는 경우 쿼리에서 이 SQL 코드를 실행할 수 있습니다.

SHOW views IN default;

+-----------+----------------------+-------------+
| namespace | viewName             | isTemporary |
+===========+======================+=============+
| default   | diamonds_list_colors | false       |
+-----------+----------------------+-------------+
| default   | diamonds_prices      | false       |
+-----------+----------------------+-------------+

SELECT * FROM diamonds_four_cs;

+-------+---------+-------+---------+
| carat | cut     | color | clarity |
+=======+=========+=======+=========+
| 0.23  | Ideal   | E     | SI2     |
+-------+---------+-------+---------+
| 0.21  | Premium | E     | SI1     |
+-------+---------+-------+---------+
...

SELECT * FROM diamonds_list_colors;

+-------+
| color |
+=======+
| D     |
+-------+
| E     |
+-------+
...

SELECT * FROM diamonds_prices;

+-------+---------+
| color | price   |
+=======+=========+
| J     | 5323.82 |
+-------+---------+
| I     | 5091.87 |
+-------+---------+
...

2단계: 더 복잡한 모델 만들기 및 실행

이 단계에서는 관련 데이터 테이블 집합에 대해 더 복잡한 모델을 만듭니다. 이 데이터 테이블에는 한 시즌 동안 6차례의 경기를 플레이하는 세 팀의 가상 스포츠 리그에 대한 정보가 포함되어 있습니다. 이 절차에서는 데이터 테이블을 만들고, 모델을 만들고, 모델을 실행합니다.

다음 SQL 코드를 실행하여 필요한 데이터 테이블을 만듭니다.

이 단계의 테이블과 뷰는 이 예에서 식별하는 데 도움이 되도록 zzz_로 시작합니다. 자체 테이블 및 뷰에서는 이 패턴을 따를 필요는 없습니다.

DROP TABLE IF EXISTS zzz_game_opponents;
DROP TABLE IF EXISTS zzz_game_scores;
DROP TABLE IF EXISTS zzz_games;
DROP TABLE IF EXISTS zzz_teams;

CREATE TABLE zzz_game_opponents (
game_id INT,
home_team_id INT,
visitor_team_id INT
) USING DELTA;

INSERT INTO zzz_game_opponents VALUES (1, 1, 2);
INSERT INTO zzz_game_opponents VALUES (2, 1, 3);
INSERT INTO zzz_game_opponents VALUES (3, 2, 1);
INSERT INTO zzz_game_opponents VALUES (4, 2, 3);
INSERT INTO zzz_game_opponents VALUES (5, 3, 1);
INSERT INTO zzz_game_opponents VALUES (6, 3, 2);

-- Result:
-- +---------+--------------+-----------------+
-- | game_id | home_team_id | visitor_team_id |
-- +=========+==============+=================+
-- | 1       | 1            | 2               |
-- +---------+--------------+-----------------+
-- | 2       | 1            | 3               |
-- +---------+--------------+-----------------+
-- | 3       | 2            | 1               |
-- +---------+--------------+-----------------+
-- | 4       | 2            | 3               |
-- +---------+--------------+-----------------+
-- | 5       | 3            | 1               |
-- +---------+--------------+-----------------+
-- | 6       | 3            | 2               |
-- +---------+--------------+-----------------+

CREATE TABLE zzz_game_scores (
game_id INT,
home_team_score INT,
visitor_team_score INT
) USING DELTA;

INSERT INTO zzz_game_scores VALUES (1, 4, 2);
INSERT INTO zzz_game_scores VALUES (2, 0, 1);
INSERT INTO zzz_game_scores VALUES (3, 1, 2);
INSERT INTO zzz_game_scores VALUES (4, 3, 2);
INSERT INTO zzz_game_scores VALUES (5, 3, 0);
INSERT INTO zzz_game_scores VALUES (6, 3, 1);

-- Result:
-- +---------+-----------------+--------------------+
-- | game_id | home_team_score | visitor_team_score |
-- +=========+=================+====================+
-- | 1       | 4               | 2                  |
-- +---------+-----------------+--------------------+
-- | 2       | 0               | 1                  |
-- +---------+-----------------+--------------------+
-- | 3       | 1               | 2                  |
-- +---------+-----------------+--------------------+
-- | 4       | 3               | 2                  |
-- +---------+-----------------+--------------------+
-- | 5       | 3               | 0                  |
-- +---------+-----------------+--------------------+
-- | 6       | 3               | 1                  |
-- +---------+-----------------+--------------------+

CREATE TABLE zzz_games (
game_id INT,
game_date DATE
) USING DELTA;

INSERT INTO zzz_games VALUES (1, '2020-12-12');
INSERT INTO zzz_games VALUES (2, '2021-01-09');
INSERT INTO zzz_games VALUES (3, '2020-12-19');
INSERT INTO zzz_games VALUES (4, '2021-01-16');
INSERT INTO zzz_games VALUES (5, '2021-01-23');
INSERT INTO zzz_games VALUES (6, '2021-02-06');

-- Result:
-- +---------+------------+
-- | game_id | game_date  |
-- +=========+============+
-- | 1       | 2020-12-12 |
-- +---------+------------+
-- | 2       | 2021-01-09 |
-- +---------+------------+
-- | 3       | 2020-12-19 |
-- +---------+------------+
-- | 4       | 2021-01-16 |
-- +---------+------------+
-- | 5       | 2021-01-23 |
-- +---------+------------+
-- | 6       | 2021-02-06 |
-- +---------+------------+

CREATE TABLE zzz_teams (
team_id INT,
team_city VARCHAR(15)
) USING DELTA;

INSERT INTO zzz_teams VALUES (1, "San Francisco");
INSERT INTO zzz_teams VALUES (2, "Seattle");
INSERT INTO zzz_teams VALUES (3, "Amsterdam");

-- Result:
-- +---------+---------------+
-- | team_id | team_city     |
-- +=========+===============+
-- | 1       | San Francisco |
-- +---------+---------------+
-- | 2       | Seattle       |
-- +---------+---------------+
-- | 3       | Amsterdam     |
-- +---------+---------------+

프로젝트의 models 디렉터리에서 다음 SQL 문을 사용하여 zzz_game_details.sql 파일을 만듭니다. 이 문은 팀 이름 및 점수와 같은 각 게임의 세부 정보를 제공하는 테이블을 만듭니다. config 블록은 이 문을 기반으로 데이터베이스에 테이블을 만들도록 dbt에 지시합니다.

-- Create a table that provides full details for each game, including
-- the game ID, the home and visiting teams' city names and scores,
-- the game winner's city name, and the game date.

{{ config(
  materialized='table',
  file_format='delta'
) }}

-- Step 4 of 4: Replace the visitor team IDs with their city names.
select
  game_id,
  home,
  t.team_city as visitor,
  home_score,
  visitor_score,
  -- Step 3 of 4: Display the city name for each game's winner.
  case
    when
      home_score > visitor_score
        then
          home
    when
      visitor_score > home_score
        then
          t.team_city
  end as winner,
  game_date as date
from (
  -- Step 2 of 4: Replace the home team IDs with their actual city names.
  select
    game_id,
    t.team_city as home,
    home_score,
    visitor_team_id,
    visitor_score,
    game_date
  from (
    -- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).
    select
      g.game_id,
      go.home_team_id,
      gs.home_team_score as home_score,
      go.visitor_team_id,
      gs.visitor_team_score as visitor_score,
      g.game_date
    from
      zzz_games as g,
      zzz_game_opponents as go,
      zzz_game_scores as gs
    where
      g.game_id = go.game_id and
      g.game_id = gs.game_id
  ) as all_ids,
    zzz_teams as t
  where
    all_ids.home_team_id = t.team_id
) as visitor_ids,
  zzz_teams as t
where
  visitor_ids.visitor_team_id = t.team_id
order by game_date desc

프로젝트의 models 디렉터리에서 다음 SQL 문을 사용하여 zzz_win_loss_records.sql 파일을 만듭니다. 이 문은 시즌의 팀 승패 기록을 나열하는 보기를 만듭니다.

-- Create a view that summarizes the season's win and loss records by team.

-- Step 2 of 2: Calculate the number of wins and losses for each team.
select
  winner as team,
  count(winner) as wins,
  -- Each team played in 4 games.
  (4 - count(winner)) as losses
from (
  -- Step 1 of 2: Determine the winner and loser for each game.
  select
    game_id,
    winner,
    case
      when
        home = winner
          then
            visitor
      else
        home
    end as loser
  from {{ ref('zzz_game_details') }}
)
group by winner
order by wins desc

가상 환경이 활성화된 상태에서 dbt run 명령을 위의 두 파일의 경로와 함께 실행합니다. (profiles.yml 파일에 지정된 대로) default 데이터베이스에서 dbt는 zzz_game_details라는 테이블 하나와 zzz_win_loss_records라는 뷰 하나를 만듭니다. dbt는 관련 .sql 파일 이름으로부터 뷰 및 테이블 이름을 가져옵니다.

dbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sql

...
... | 1 of 2 START table model default.zzz_game_details.................... [RUN]
... | 1 of 2 OK created table model default.zzz_game_details............... [OK ...]
... | 2 of 2 START view model default.zzz_win_loss_records................. [RUN]
... | 2 of 2 OK created view model default.zzz_win_loss_records............ [OK ...]
... |
... | Finished running 1 table model, 1 view model ...

Completed successfully

Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

다음 SQL 코드를 실행하여 새 뷰에 대한 정보를 나열하고 테이블 및 뷰의 모든 행을 선택합니다.

SHOW VIEWS FROM default LIKE 'zzz_win_loss_records';

+-----------+----------------------+-------------+
| namespace | viewName             | isTemporary |
+===========+======================+=============+
| default   | zzz_win_loss_records | false       |
+-----------+----------------------+-------------+

SELECT * FROM zzz_game_details;

+---------+---------------+---------------+------------+---------------+---------------+------------+
| game_id | home          | visitor       | home_score | visitor_score | winner        | date       |
+=========+===============+===============+============+===============+===============+============+
| 1       | San Francisco | Seattle       | 4          | 2             | San Francisco | 2020-12-12 |
+---------+---------------+---------------+------------+---------------+---------------+------------+
| 2       | San Francisco | Amsterdam     | 0          | 1             | Amsterdam     | 2021-01-09 |
+---------+---------------+---------------+------------+---------------+---------------+------------+
| 3       | Seattle       | San Francisco | 1          | 2             | San Francisco | 2020-12-19 |
+---------+---------------+---------------+------------+---------------+---------------+------------+
| 4       | Seattle       | Amsterdam     | 3          | 2             | Seattle       | 2021-01-16 |
+---------+---------------+---------------+------------+---------------+---------------+------------+
| 5       | Amsterdam     | San Francisco | 3          | 0             | Amsterdam     | 2021-01-23 |
+---------+---------------+---------------+------------+---------------+---------------+------------+
| 6       | Amsterdam     | Seattle       | 3          | 1             | Amsterdam     | 2021-02-06 |
+---------+---------------+---------------+------------+---------------+---------------+------------+

SELECT * FROM zzz_win_loss_records;

+---------------+------+--------+
| team          | wins | losses |
+===============+======+========+
| Amsterdam     | 3    | 1      |
+---------------+------+--------+
| San Francisco | 2    | 2      |
+---------------+------+--------+
| Seattle       | 1    | 3      |
+---------------+------+--------+

3단계: 테스트 만들기 및 실행

이 단계에서는 모델에 대한 어설션인 테스트를 만듭니다. 이러한 테스트를 실행할 때 dbt는 프로젝트의 각 테스트의 통과/실패 여부를 알려줍니다.

두 가지 유형의 테스트가 있습니다. YAML에 적용된 스키마 테스트는 어설션을 전달하지 않는 레코드 수를 반환합니다. 이 숫자가 0이면 모든 레코드가 통과하므로 테스트가 통과합니다. 데이터 테스트는 통과하려면 0개의 레코드를 반환해야 하는 쿼리입니다.

프로젝트의 models 디렉터리에서 다음을 schema.yml 파일을 만듭니다. 이 파일에는 지정된 열에 고유한 값이 있는지, null이 아닌지, 지정된 값만 있는지 또는 조합이 있는지 여부를 결정하는 스키마 테스트가 포함되어 있습니다.

version: 2

models:
  - name: zzz_game_details
    columns:
      - name: game_id
        tests:
          - unique
          - not_null
      - name: home
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: visitor
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: home_score
        tests:
          - not_null
      - name: visitor_score
        tests:
          - not_null
      - name: winner
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: date
        tests:
          - not_null
  - name: zzz_win_loss_records
    columns:
      - name: team
        tests:
          - unique
          - not_null
          - relationships:
              to: ref('zzz_game_details')
              field: home
      - name: wins
        tests:
          - not_null
      - name: losses
        tests:
          - not_null

프로젝트의 tests 디렉터리에서 다음 SQL 문을 사용하여 zzz_game_details_check_dates.sql 파일을 만듭니다. 이 파일에는 정규 시즌을 벗어나서 진행된 경기가 있는지 확인하는 데이터 테스트가 포함되어 있습니다.
```
-- This season's games happened between 2020-12-12 and 2021-02-06.
-- For this test to pass, this query must return no results.

select date
from {{ ref('zzz_game_details') }}
where date < '2020-12-12'
or date > '2021-02-06'
```
프로젝트의 tests 디렉터리에서 다음 SQL 문을 사용하여 zzz_game_details_check_scores.sql 파일을 만듭니다. 이 파일에는 점수가 부정적인지 또는 동점인 게임이 있는지 확인하기 위한 데이터 테스트가 포함되어 있습니다.
```
-- This sport allows no negative scores or tie games.
-- For this test to pass, this query must return no results.

select home_score, visitor_score
from {{ ref('zzz_game_details') }}
where home_score < 0
or visitor_score < 0
or home_score = visitor_score
```
프로젝트의 tests 디렉터리에서 다음 SQL 문을 사용하여 zzz_win_loss_records_check_records.sql 파일을 만듭니다. 이 파일에는 음수인 승패 기록이 있는 팀, 플레이한 경기보다 많은 승패 기록이 있는 팀, 허용된 것보다 많은 경기를 플레이한 팀이 있는지 확인하는 데이터 테스트가 포함되어 있습니다.
```
-- Each team participated in 4 games this season.
-- For this test to pass, this query must return no results.

select wins, losses
from {{ ref('zzz_win_loss_records') }}
where wins < 0 or wins > 4
or losses < 0 or losses > 4
or (wins + losses) > 4
```

가상 환경이 활성화되면 명령을 실행합니다 dbt test .

dbt test --models zzz_game_details zzz_win_loss_records

...
... | 1 of 19 START test accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [RUN]
... | 1 of 19 PASS accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [PASS ...]
...
... |
... | Finished running 19 tests ...

Completed successfully

Done. PASS=19 WARN=0 ERROR=0 SKIP=0 TOTAL=19

4단계: 정리

다음 SQL 코드를 실행하여 이 예제에 대해 만든 테이블과 뷰를 삭제할 수 있습니다.

DROP TABLE zzz_game_opponents;
DROP TABLE zzz_game_scores;
DROP TABLE zzz_games;
DROP TABLE zzz_teams;
DROP TABLE zzz_game_details;
DROP VIEW zzz_win_loss_records;

DROP TABLE diamonds;
DROP TABLE diamonds_four_cs;
DROP VIEW diamonds_list_colors;
DROP VIEW diamonds_prices;

문제 해결

Azure Databricks에서 dbt Core를 사용할 때 발생하는 일반적인 문제와 이를 해결하는 방법에 대한 자세한 내용은 dbt Labs 웹 사이트에서 도움말을 참조하세요.

다음 단계

dbt Core 프로젝트를 Azure Databricks 작업 작업으로 실행합니다. Azure Databricks 작업에서의 dbt 변환 사용을 참조하세요.

추가 리소스

dbt Labs 웹 사이트에서 다음 리소스를 살펴봅니다.

다음을 통해 공유

자습서: 로컬에서 dbt 모델 만들기, 실행 및 테스트

시작하기 전에

1단계: 모델 만들기 및 실행

2단계: 더 복잡한 모델 만들기 및 실행

3단계: 테스트 만들기 및 실행

4단계: 정리

문제 해결

다음 단계

추가 리소스

피드백

추가 리소스