Azure Synapse Spark client library for Java - version 1.0.0-beta.5
Azure Synapse is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning needs.
The Azure Synapse Analytics Spark client library enables programmatically managing Spark jobs.
Source code | API reference documentation | Product documentation | Samples
Getting started
Adding the package to your project
Maven dependency for the Azure Synapse Spark client library. Add it to your project's POM file.
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-analytics-synapse-spark</artifactId>
<version>1.0.0-beta.4</version>
</dependency>
Prerequisites
- Java Development Kit (JDK) with version 8 or above
- Azure subscription.
- An existing Azure Synapse workspace. If you need to create an Azure Synapse workspace, you can use the Azure Portal or Azure CLI.
az synapse workspace create \ --name <your-workspace-name> \ --resource-group <your-resource-group-name> \ --storage-account <your-storage-account-name> \ --file-system <your-storage-file-system-name> \ --sql-admin-login-user <your-sql-admin-user-name> \ --sql-admin-login-password <your-sql-admin-user-password> \ --location <your-workspace-location>
Authenticate the client
In order to interact with the Azure Synapse service, you'll need to create an instance of the SparkClient class. You would need a workspace endpoint and client secret credentials (client id, client secret, tenant id) to instantiate a client object using the default DefaultAzureCredential
examples shown in this document.
The DefaultAzureCredential
way of authentication by providing client secret credentials is being used in this getting started section but you can find more ways to authenticate with azure-identity.
Create/Get credentials
To create/get client secret credentials you can use the Azure Portal, Azure CLI or Azure Cloud Shell
Here is an Azure Cloud Shell snippet below to
Create a service principal and configure its access to Azure resources:
az ad sp create-for-rbac -n <your-application-name> --skip-assignment
Output:
{ "appId": "generated-app-ID", "displayName": "dummy-app-name", "name": "http://dummy-app-name", "password": "random-password", "tenant": "tenant-ID" }
Create Spark client
Once you've populated the AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, and AZURE_TENANT_ID environment variables and replaced your-workspace-endpoint with the URI returned above, you can create Spark clients. For example, the following code creates SparkBatchClient:
import com.azure.identity.DefaultAzureCredentialBuilder;
import com.azure.analytics.synapse.spark.SparkBatchClient;
import com.azure.analytics.synapse.spark.SparkClientBuilder;
SparkBatchClient batchClient = new SparkClientBuilder()
.endpoint("https://{YOUR_WORKSPACE_NAME}.dev.azuresynapse.net")
.sparkPoolName("{SPARK_POOL_NAME}")
.credential(new DefaultAzureCredentialBuilder().build())
.buildSparkBatchClient();
NOTE: For using an asynchronous client use SparkBatchAsyncClient instead of SparkBatchClient and call
buildSparkBatchAsyncClient()
Key concepts
Spark batch Client
The Spark batch client performs the interactions with the Azure Synapse service for getting, setting, updating, deleting, and listing Spark batch jobs. Asynchronous (SparkBatchAsyncClient) and synchronous (SparkBatchClient) clients exist in the SDK allowing for the selection of a client based on an application's use case.
Examples
The Azure.Analytics.Synapse.Spark package supports synchronous and asynchronous APIs. The following section covers some of the most common Azure Synapse Analytics Spark job related tasks:
Sync API
The following sections provide several code snippets covering some of the most common Azure Synapse Spark service tasks, including:
Spark batch job examples
Create a Spark batch job
createSparkBatchJob
creates a Spark batch job.
SparkBatchJobOptions options = new SparkBatchJobOptions()
.setName(name)
.setFile(file)
.setClassName("WordCount")
.setArguments(Arrays.asList(
String.format("abfss://%s@%s.dfs.core.windows.net/samples/java/wordcount/shakespeare.txt", fileSystem, storageAccount),
String.format("abfss://%s@%s.dfs.core.windows.net/samples/java/wordcount/result/", fileSystem, storageAccount)
))
.setDriverMemory("28g")
.setDriverCores(4)
.setExecutorMemory("28g")
.setExecutorCores(4)
.setExecutorCount(2);
SparkBatchJob jobCreated = batchClient.createSparkBatchJob(options);
List Spark batch jobs
getSparkBatchJobs
enumerates the Spark batch jobs in the Synapse workspace.
SparkBatchJobCollection jobs = batchClient.getSparkBatchJobs();
for (SparkBatchJob job : jobs.getSessions()) {
System.out.println(job.getName());
}
Cancel a Spark batch job
cancelSparkBatchJob
cancels a Spark batch job by the given job ID.
batchClient.cancelSparkBatchJob(jobId);
Async API
The following sections provide several code snippets covering some of the most common asynchronous Azure Synapse Spark service tasks, including:
- Create a Spark job asynchronously
- Retrieve a Spark job asynchronously
- List Spark jobs asynchronously
- Delete a Spark job asynchronously
Note : You should add
System.in.read()
orThread.sleep()
after the function calls in the main class/thread to allow async functions/operations to execute and finish before the main application/thread exits.
Create a Spark job asynchronously
createSparkBatchJob
creates a Spark batch job.
String storageAccount = "<storage-account>";
String fileSystem = "<file-system>";
String name = "<job-name>";
String file = String.format("abfss://%s@%s.dfs.core.windows.net/samples/java/wordcount/wordcount.jar", fileSystem, storageAccount);
SparkBatchJobOptions options = new SparkBatchJobOptions()
.setName(name)
.setFile(file)
.setClassName("WordCount")
.setArguments(Arrays.asList(
String.format("abfss://%s@%s.dfs.core.windows.net/samples/java/wordcount/shakespeare.txt", fileSystem, storageAccount),
String.format("abfss://%s@%s.dfs.core.windows.net/samples/java/wordcount/result/", fileSystem, storageAccount)
))
.setDriverMemory("28g")
.setDriverCores(4)
.setExecutorMemory("28g")
.setExecutorCores(4)
.setExecutorCount(2);
batchClient.createSparkBatchJob(options).subscribe(job -> System.out.printf("Job ID: %f\n", job.getId()));
List Spark batch jobs asynchronously
getSparkBatchJobs
enumerates the Spark batch jobs in the Synapse workspace.
batchClient.getSparkBatchJobs().subscribe(jobs -> {
for (SparkBatchJob job : jobs.getSessions()) {
System.out.println(job.getName());
}
});
Cancel a Spark batch job asynchronously
cancelSparkBatchJob
deletes a Spark batch job by the job ID.
batchClient.cancelSparkBatchJob(jobId);
Troubleshooting
Default HTTP client
All client libraries by default use the Netty HTTP client. Adding the above dependency will automatically configure the client library to use the Netty HTTP client. Configuring or changing the HTTP client is detailed in the HTTP clients wiki.
Default SSL library
All client libraries, by default, use the Tomcat-native Boring SSL library to enable native-level performance for SSL operations. The Boring SSL library is an Uber JAR containing native libraries for Linux / macOS / Windows, and provides better performance compared to the default SSL implementation within the JDK. For more information, including how to reduce the dependency size, refer to the performance tuning section of the wiki.
Next steps
Several Synapse Java SDK samples are available to you in the SDK's GitHub repository. These samples provide example code for additional scenarios commonly encountered while working with Azure Synapse Analytics.
Additional documentation
For more extensive documentation on Azure Synapse Analytics, see the API reference documentation.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.