Transform data with Apache Spark and query with SQL

In this guide, you will:

  • Upload data to OneLake with the OneLake file explorer.

  • Use a Fabric notebook to read data on OneLake and write back as a Delta table.

  • Analyze and transform data with Spark using a Fabric notebook.

  • Query one copy of data on OneLake with SQL.

Prerequisites

Before you begin, you must:

  • Download and install OneLake file explorer.

  • Create a workspace with a Lakehouse item.

  • Download the WideWorldImportersDW dataset. You can use Azure Storage Explorer to connect to https://fabrictutorialdata.blob.core.windows.net/sampledata/WideWorldImportersDW/csv/full/dimension_city and download the set of csv files. Or you can use your own csv data and update the details as required.

Note

Always create, load, or create a shortcut to Delta-Parquet data directly under the Tables section of the lakehouse. Do not nest your tables in subfolders under the Tables section as the lakehouse will not recognize it as a table and will label it as Unidentified.

Upload, read, analyze, and query data

  1. In OneLake file explorer, navigate to your lakehouse and under the /Files directory, create a subdirectory named dimension_city.

    Screenshot of new folder created in OneLake file explorer.

  2. Copy your sample csv files to the OneLake directory /Files/dimension_city using OneLake file explorer.

    Screenshot of copying files to OneLake in file explorer.

  3. Navigate to your lakehouse in the Power BI service and view your files.

    Screenshot of viewing files in lakehouse in Fabric.

  4. Select Open notebook, then New notebook to create a notebook.

    Screenshot of creating new notebook in Fabric.

  5. Using the Fabric notebook, convert the CSV files to Delta format. The following code snippet reads data from user created directory /Files/dimension_city and converts it to a Delta table dim_city.

    import os
    from pyspark.sql.types import *
    for filename in os.listdir("/lakehouse/default/Files/<replace with your folder path>"):
    df=spark.read.format('csv').options(header="true",inferSchema="true").load("abfss://<replace with workspace name>@onelake.dfs.fabric.microsoft.com/<replace with item name>.Lakehouse/Files/<folder name>/"+filename,on_bad_lines="skip")
    df.write.mode("overwrite").format("delta").save("Tables/<name of delta table>")
    
  6. To see your new table, refresh your view of the /Tables directory.

    Screenshot of a viewing table in a lakehouse in Fabric.

  7. Query your table with SparkSQL in the same Fabric notebook.

    %%sql
    SELECT * from <replace with item name>.dim_city LIMIT 10;
    
  8. Modify the Delta table by adding a new column named newColumn with data type integer. Set the value of 9 for all the records for this newly added column.

    %%sql
    
    ALTER TABLE <replace with item name>.dim_city ADD COLUMN newColumn int;
    
    UPDATE <replace with item name>.dim_city SET newColumn = 9;
    
    SELECT City,newColumn FROM <replace with item name>.dim_city LIMIT 10;
    
  9. You can also access any Delta table on OneLake via a SQL analytics endpoint. A SQL analytics endpoint references the same physical copy of Delta table on OneLake and offers the T-SQL experience. Select the SQL analytics endpoint for lakehouse1 and then select New SQL Query to query the table using T-SQL.

    SELECT TOP (100) * FROM [<replace with item name>].[dbo].[dim_city];