教學課程:在本機建立、執行及測試 dbt 模型
本教學課程將逐步引導您瞭解如何在本機建立、執行及測試 dbt 模型。 您也可以以 Azure Databricks 作業工作的形式執行 dbt 專案。 如需詳細資訊,請參閱 在 Azure Databricks 作業中使用 dbt 轉換。
開始之前
若要遵循本教學課程,您必須先將 Azure Databricks 工作區連線到 dbt Core。 如需詳細資訊,請參閱連線至 dbt Core。
步驟 1:建立和執行模型
在此步驟中,您會使用您慣用的文本編輯器來建立 模型,這些模型是 select
根據相同資料庫中的現有數據,建立新檢視(預設值)或資料庫中的新數據表的語句。 此程序根據範例資料集中的範例 diamonds
資料表建立模型。
使用下列程式碼建立此資料表。
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
在項目的
models
目錄中,使用下列 SQL 語句建立名為diamonds_four_cs.sql
的檔案。 此陳述式僅從diamonds
資料表中選取每顆寶石的克拉數、切工、色彩和透明度詳細資料。config
區塊指示 dbt 根據此陳述式在資料庫中建立資料表。{{ config( materialized='table', file_format='delta' ) }}
select carat, cut, color, clarity from diamonds
提示
如需其他
config
選項,例如使用 Delta 檔格式和merge
累加策略,請參閱 dbt 檔中的 Databricks 組態 。在項目的
models
目錄中,使用下列 SQL 語句建立名為diamonds_list_colors.sql
的第二個檔案。 此陳述式會從colors
資料表中的diamonds_four_cs
資料行中選取唯一值,依字母順序從頭到尾對結果進行排序。 由於沒有config
區塊,此模型將指示 dbt 根據此陳述式在資料庫中建立檢視。select distinct color from {{ ref('diamonds_four_cs') }} sort by color asc
在項目的
models
目錄中,使用下列 SQL 語句建立名為diamonds_prices.sql
的第三個檔案。 此陳述式依色彩計算寶石的平均價格,並依平均價格從最高到最低對結果進行排序。 此模型將指示 dbt 根據此陳述式在資料庫中建立檢視。select color, avg(price) as price from diamonds group by color order by price desc
啟用虛擬環境后,使用上述三個檔案的路徑執行
dbt run
命令。 在default
資料庫中(如 檔案中所profiles.yml
指定),dbt 會建立一個名為的數據表diamonds_four_cs
,以及名為diamonds_list_colors
和diamonds_prices
的兩個檢視表。 dbt 從相關的.sql
檔案名稱中取得這些檢視和資料表名稱。dbt run --model models/diamonds_four_cs.sql models/diamonds_list_colors.sql models/diamonds_prices.sql
... ... | 1 of 3 START table model default.diamonds_four_cs.................... [RUN] ... | 1 of 3 OK created table model default.diamonds_four_cs............... [OK ...] ... | 2 of 3 START view model default.diamonds_list_colors................. [RUN] ... | 2 of 3 OK created view model default.diamonds_list_colors............ [OK ...] ... | 3 of 3 START view model default.diamonds_prices...................... [RUN] ... | 3 of 3 OK created view model default.diamonds_prices................. [OK ...] ... | ... | Finished running 1 table model, 2 view models ... Completed successfully Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
執行下列 SQL 程式碼,以列出有關新檢視的資訊並選取資料表和檢視中的所有資料列。
如果您要連線到叢集,您可以從連線到叢集的筆記本執行此 SQL 程式代碼,並將 SQL 指定為筆記本的預設語言。 如果要連線至 SQL 倉儲,您可以從查詢執行此 SQL 程式碼。
SHOW views IN default;
+-----------+----------------------+-------------+ | namespace | viewName | isTemporary | +===========+======================+=============+ | default | diamonds_list_colors | false | +-----------+----------------------+-------------+ | default | diamonds_prices | false | +-----------+----------------------+-------------+
SELECT * FROM diamonds_four_cs;
+-------+---------+-------+---------+ | carat | cut | color | clarity | +=======+=========+=======+=========+ | 0.23 | Ideal | E | SI2 | +-------+---------+-------+---------+ | 0.21 | Premium | E | SI1 | +-------+---------+-------+---------+ ...
SELECT * FROM diamonds_list_colors;
+-------+ | color | +=======+ | D | +-------+ | E | +-------+ ...
SELECT * FROM diamonds_prices;
+-------+---------+ | color | price | +=======+=========+ | J | 5323.82 | +-------+---------+ | I | 5091.87 | +-------+---------+ ...
步驟 2:建立和執行更複雜的模型
在此步驟中,您將為一組相關的資料表建立更複雜的模型。 這些資料表包含有關虛擬體育聯賽的資訊,其中三支隊伍需要在一個賽季中完成六場比賽。 此程序將建立資料表、建立模型並執行模型。
執行下列 SQL 程式碼以建立必要的資料表。
如果您要連線到叢集,您可以從連線到叢集的筆記本執行此 SQL 程式代碼,並將 SQL 指定為筆記本的預設語言。 如果要連線至 SQL 倉儲,您可以從查詢執行此 SQL 程式碼。
本步驟中的資料表和檢視以
zzz_
開頭,有助於在本範例中加以識別。 對於自己的資料表和檢視,無需遵循此模式。DROP TABLE IF EXISTS zzz_game_opponents; DROP TABLE IF EXISTS zzz_game_scores; DROP TABLE IF EXISTS zzz_games; DROP TABLE IF EXISTS zzz_teams; CREATE TABLE zzz_game_opponents ( game_id INT, home_team_id INT, visitor_team_id INT ) USING DELTA; INSERT INTO zzz_game_opponents VALUES (1, 1, 2); INSERT INTO zzz_game_opponents VALUES (2, 1, 3); INSERT INTO zzz_game_opponents VALUES (3, 2, 1); INSERT INTO zzz_game_opponents VALUES (4, 2, 3); INSERT INTO zzz_game_opponents VALUES (5, 3, 1); INSERT INTO zzz_game_opponents VALUES (6, 3, 2); -- Result: -- +---------+--------------+-----------------+ -- | game_id | home_team_id | visitor_team_id | -- +=========+==============+=================+ -- | 1 | 1 | 2 | -- +---------+--------------+-----------------+ -- | 2 | 1 | 3 | -- +---------+--------------+-----------------+ -- | 3 | 2 | 1 | -- +---------+--------------+-----------------+ -- | 4 | 2 | 3 | -- +---------+--------------+-----------------+ -- | 5 | 3 | 1 | -- +---------+--------------+-----------------+ -- | 6 | 3 | 2 | -- +---------+--------------+-----------------+ CREATE TABLE zzz_game_scores ( game_id INT, home_team_score INT, visitor_team_score INT ) USING DELTA; INSERT INTO zzz_game_scores VALUES (1, 4, 2); INSERT INTO zzz_game_scores VALUES (2, 0, 1); INSERT INTO zzz_game_scores VALUES (3, 1, 2); INSERT INTO zzz_game_scores VALUES (4, 3, 2); INSERT INTO zzz_game_scores VALUES (5, 3, 0); INSERT INTO zzz_game_scores VALUES (6, 3, 1); -- Result: -- +---------+-----------------+--------------------+ -- | game_id | home_team_score | visitor_team_score | -- +=========+=================+====================+ -- | 1 | 4 | 2 | -- +---------+-----------------+--------------------+ -- | 2 | 0 | 1 | -- +---------+-----------------+--------------------+ -- | 3 | 1 | 2 | -- +---------+-----------------+--------------------+ -- | 4 | 3 | 2 | -- +---------+-----------------+--------------------+ -- | 5 | 3 | 0 | -- +---------+-----------------+--------------------+ -- | 6 | 3 | 1 | -- +---------+-----------------+--------------------+ CREATE TABLE zzz_games ( game_id INT, game_date DATE ) USING DELTA; INSERT INTO zzz_games VALUES (1, '2020-12-12'); INSERT INTO zzz_games VALUES (2, '2021-01-09'); INSERT INTO zzz_games VALUES (3, '2020-12-19'); INSERT INTO zzz_games VALUES (4, '2021-01-16'); INSERT INTO zzz_games VALUES (5, '2021-01-23'); INSERT INTO zzz_games VALUES (6, '2021-02-06'); -- Result: -- +---------+------------+ -- | game_id | game_date | -- +=========+============+ -- | 1 | 2020-12-12 | -- +---------+------------+ -- | 2 | 2021-01-09 | -- +---------+------------+ -- | 3 | 2020-12-19 | -- +---------+------------+ -- | 4 | 2021-01-16 | -- +---------+------------+ -- | 5 | 2021-01-23 | -- +---------+------------+ -- | 6 | 2021-02-06 | -- +---------+------------+ CREATE TABLE zzz_teams ( team_id INT, team_city VARCHAR(15) ) USING DELTA; INSERT INTO zzz_teams VALUES (1, "San Francisco"); INSERT INTO zzz_teams VALUES (2, "Seattle"); INSERT INTO zzz_teams VALUES (3, "Amsterdam"); -- Result: -- +---------+---------------+ -- | team_id | team_city | -- +=========+===============+ -- | 1 | San Francisco | -- +---------+---------------+ -- | 2 | Seattle | -- +---------+---------------+ -- | 3 | Amsterdam | -- +---------+---------------+
在項目的
models
目錄中,使用下列 SQL 語句建立名為zzz_game_details.sql
的檔案。 此陳述式會建立一個資料表,該資料表提供每場比賽的詳細資料,例如隊伍名稱和分數。config
區塊指示 dbt 根據此陳述式在資料庫中建立資料表。-- Create a table that provides full details for each game, including -- the game ID, the home and visiting teams' city names and scores, -- the game winner's city name, and the game date.
{{ config( materialized='table', file_format='delta' ) }}
-- Step 4 of 4: Replace the visitor team IDs with their city names. select game_id, home, t.team_city as visitor, home_score, visitor_score, -- Step 3 of 4: Display the city name for each game's winner. case when home_score > visitor_score then home when visitor_score > home_score then t.team_city end as winner, game_date as date from ( -- Step 2 of 4: Replace the home team IDs with their actual city names. select game_id, t.team_city as home, home_score, visitor_team_id, visitor_score, game_date from ( -- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates). select g.game_id, go.home_team_id, gs.home_team_score as home_score, go.visitor_team_id, gs.visitor_team_score as visitor_score, g.game_date from zzz_games as g, zzz_game_opponents as go, zzz_game_scores as gs where g.game_id = go.game_id and g.game_id = gs.game_id ) as all_ids, zzz_teams as t where all_ids.home_team_id = t.team_id ) as visitor_ids, zzz_teams as t where visitor_ids.visitor_team_id = t.team_id order by game_date desc
在項目的
models
目錄中,使用下列 SQL 語句建立名為zzz_win_loss_records.sql
的檔案。 此陳述式會建立一個檢視,其中列出該賽季隊伍的勝負記錄。-- Create a view that summarizes the season's win and loss records by team. -- Step 2 of 2: Calculate the number of wins and losses for each team. select winner as team, count(winner) as wins, -- Each team played in 4 games. (4 - count(winner)) as losses from ( -- Step 1 of 2: Determine the winner and loser for each game. select game_id, winner, case when home = winner then visitor else home end as loser from {{ ref('zzz_game_details') }} ) group by winner order by wins desc
啟用虛擬環境之後,請使用上述兩個檔案的路徑執行
dbt run
命令。default
在資料庫中(如 檔案中所profiles.yml
指定),dbt 會建立一zzz_game_details
個名為的數據表和一個名為 的檢視zzz_win_loss_records
表。 dbt 從相關的.sql
檔案名稱中取得這些檢視和資料表名稱。dbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sql
... ... | 1 of 2 START table model default.zzz_game_details.................... [RUN] ... | 1 of 2 OK created table model default.zzz_game_details............... [OK ...] ... | 2 of 2 START view model default.zzz_win_loss_records................. [RUN] ... | 2 of 2 OK created view model default.zzz_win_loss_records............ [OK ...] ... | ... | Finished running 1 table model, 1 view model ... Completed successfully Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
執行下列 SQL 程式碼,以列出有關新檢視的資訊並選取資料表和檢視中的所有資料列。
如果您要連線到叢集,您可以從連線到叢集的筆記本執行此 SQL 程式代碼,並將 SQL 指定為筆記本的預設語言。 如果要連線至 SQL 倉儲,您可以從查詢執行此 SQL 程式碼。
SHOW VIEWS FROM default LIKE 'zzz_win_loss_records';
+-----------+----------------------+-------------+ | namespace | viewName | isTemporary | +===========+======================+=============+ | default | zzz_win_loss_records | false | +-----------+----------------------+-------------+
SELECT * FROM zzz_game_details;
+---------+---------------+---------------+------------+---------------+---------------+------------+ | game_id | home | visitor | home_score | visitor_score | winner | date | +=========+===============+===============+============+===============+===============+============+ | 1 | San Francisco | Seattle | 4 | 2 | San Francisco | 2020-12-12 | +---------+---------------+---------------+------------+---------------+---------------+------------+ | 2 | San Francisco | Amsterdam | 0 | 1 | Amsterdam | 2021-01-09 | +---------+---------------+---------------+------------+---------------+---------------+------------+ | 3 | Seattle | San Francisco | 1 | 2 | San Francisco | 2020-12-19 | +---------+---------------+---------------+------------+---------------+---------------+------------+ | 4 | Seattle | Amsterdam | 3 | 2 | Seattle | 2021-01-16 | +---------+---------------+---------------+------------+---------------+---------------+------------+ | 5 | Amsterdam | San Francisco | 3 | 0 | Amsterdam | 2021-01-23 | +---------+---------------+---------------+------------+---------------+---------------+------------+ | 6 | Amsterdam | Seattle | 3 | 1 | Amsterdam | 2021-02-06 | +---------+---------------+---------------+------------+---------------+---------------+------------+
SELECT * FROM zzz_win_loss_records;
+---------------+------+--------+ | team | wins | losses | +===============+======+========+ | Amsterdam | 3 | 1 | +---------------+------+--------+ | San Francisco | 2 | 2 | +---------------+------+--------+ | Seattle | 1 | 3 | +---------------+------+--------+
步驟 3:建立並執行測試
在此步驟中,您將建立測試,這些測試是您對模型所做的判斷提示。 執行這些測試時,dbt 會告訴您專案中的每個測試是通過還是失敗。
有兩種類型的測試。 YAML 中套用的架構測試會傳回未通過判斷提示的記錄數目。 此數字為零時,表示所有記錄都通過,因此測試也通過。 資料測試是必須傳回零個記錄才算通過的特定查詢。
在項目的
models
目錄中,使用下列內容建立名為schema.yml
的檔案。 此檔案包含結構描述測試,用於確定指定的資料行是否具有唯一值、不為 null、僅具有指定值或具有組合。version: 2 models: - name: zzz_game_details columns: - name: game_id tests: - unique - not_null - name: home tests: - not_null - accepted_values: values: ['Amsterdam', 'San Francisco', 'Seattle'] - name: visitor tests: - not_null - accepted_values: values: ['Amsterdam', 'San Francisco', 'Seattle'] - name: home_score tests: - not_null - name: visitor_score tests: - not_null - name: winner tests: - not_null - accepted_values: values: ['Amsterdam', 'San Francisco', 'Seattle'] - name: date tests: - not_null - name: zzz_win_loss_records columns: - name: team tests: - unique - not_null - relationships: to: ref('zzz_game_details') field: home - name: wins tests: - not_null - name: losses tests: - not_null
在項目的
tests
目錄中,使用下列 SQL 語句建立名為zzz_game_details_check_dates.sql
的檔案。 此檔案包含一個資料測試,用於確定常規賽季外是否進行了任何比賽。-- This season's games happened between 2020-12-12 and 2021-02-06. -- For this test to pass, this query must return no results. select date from {{ ref('zzz_game_details') }} where date < '2020-12-12' or date > '2021-02-06'
在項目的
tests
目錄中,使用下列 SQL 語句建立名為zzz_game_details_check_scores.sql
的檔案。 此檔案包含一個資料測試,用於確定是否有任何分數為負或任何比賽為平局。-- This sport allows no negative scores or tie games. -- For this test to pass, this query must return no results. select home_score, visitor_score from {{ ref('zzz_game_details') }} where home_score < 0 or visitor_score < 0 or home_score = visitor_score
在項目的
tests
目錄中,使用下列 SQL 語句建立名為zzz_win_loss_records_check_records.sql
的檔案。 此檔案包含一個資料測試,用於確定任何隊伍是否有負數的勝場或敗場記錄、是否有超過進行比賽數的勝負記錄,或者進行的比賽數是否多於允許的比賽數。-- Each team participated in 4 games this season. -- For this test to pass, this query must return no results. select wins, losses from {{ ref('zzz_win_loss_records') }} where wins < 0 or wins > 4 or losses < 0 or losses > 4 or (wins + losses) > 4
啟用虛擬環境後,執行
dbt test
命令。dbt test --models zzz_game_details zzz_win_loss_records
... ... | 1 of 19 START test accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [RUN] ... | 1 of 19 PASS accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [PASS ...] ... ... | ... | Finished running 19 tests ... Completed successfully Done. PASS=19 WARN=0 ERROR=0 SKIP=0 TOTAL=19
步驟 4:清理
您可以透過執行下列 SQL 程式碼來刪除為此範例建立的資料表和檢視。
如果您要連線到叢集,您可以從連線到叢集的筆記本執行此 SQL 程式代碼,並將 SQL 指定為筆記本的預設語言。 如果要連線至 SQL 倉儲,您可以從查詢執行此 SQL 程式碼。
DROP TABLE zzz_game_opponents;
DROP TABLE zzz_game_scores;
DROP TABLE zzz_games;
DROP TABLE zzz_teams;
DROP TABLE zzz_game_details;
DROP VIEW zzz_win_loss_records;
DROP TABLE diamonds;
DROP TABLE diamonds_four_cs;
DROP VIEW diamonds_list_colors;
DROP VIEW diamonds_prices;
疑難排解
如需搭配 Azure Databricks 使用 dbt Core 以及如何解決常見問題的相關信息,請參閱 在 dbt Labs 網站上取得協助 。
下一步
以 Azure Databricks 作業工作的形式執行 dbt Core 專案。 請參閱在 Azure Databricks 作業中使用 dbt 轉換。
其他資源
探索 dbt Labs 網站上的下列資源: