Azure File Data Lake client library for Java - version 12.22.0
Azure Data Lake Storage is Microsoft's optimized storage solution for for big data analytics workloads. A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access.
Source code | API reference documentation | REST API documentation | Product documentation | Samples
Getting started
Prerequisites
- Java Development Kit (JDK) with version 8 or above
- Here are details about Java 8 client compatibility with Azure Certificate Authority.
- Azure Subscription
- Create Storage Account
Include the package
Include the BOM file
Please include the azure-sdk-bom to your project to take dependency on GA version of the library. In the following snippet, replace the {bom_version_to_target} placeholder with the version number. To learn more about the BOM, see the AZURE SDK BOM README.
<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-sdk-bom</artifactId>
<version>{bom_version_to_target}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
and then include the direct dependency in the dependencies section without the version tag.
<dependencies>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-storage-file-datalake</artifactId>
</dependency>
</dependencies>
Include direct dependency
If you want to take dependency on a particular version of the library that is not present in the BOM, add the direct dependency to your project as follows.
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-storage-file-datalake</artifactId>
<version>12.22.0</version>
</dependency>
Create a Storage Account
To create a Storage Account you can use the Azure Portal or Azure CLI. Note: To use data lake, your account must have hierarchical namespace enabled.
# Install the extension “Storage-Preview”
az extension add --name storage-preview
# Create the storage account
az storage account create -n my-storage-account-name -g my-resource-group --sku Standard_LRS --kind StorageV2 --hierarchical-namespace true
Your storage account URL, subsequently identified as <your-storage-account-url>
, would be formatted as follows
http(s)://<storage-account-name>.dfs.core.windows.net
Authenticate the client
In order to interact with the Storage Service you'll need to create an instance of the Service Client class. To make this possible you'll need the Account SAS (shared access signature) string of the Storage Account. Learn more at SAS Token
Get credentials
SAS Token
a. Use the Azure CLI snippet below to get the SAS token from the Storage Account.
az storage blob generate-sas \
--account-name {Storage Account name} \
--container-name {container name} \
--name {blob name} \
--permissions {permissions to grant} \
--expiry {datetime to expire the SAS token} \
--services {storage services the SAS allows} \
--resource-types {resource types the SAS allows}
Example:
CONNECTION_STRING=<connection-string>
az storage blob generate-sas \
--account-name MyStorageAccount \
--container-name MyContainer \
--name MyBlob \
--permissions racdw \
--expiry 2020-06-15
b. Alternatively, get the Account SAS Token from the Azure Portal.
- Go to your Storage Account
- Select
Shared access signature
from the menu on the left - Click on
Generate SAS and connection string
(after setup)
Shared Key Credential
a. Use Account name and Account key. Account name is your Storage Account name.
- Go to your Storage Account
- Select
Access keys
from the menu on the left - Under
key1
/key2
copy the contents of theKey
field
or
b. Use the connection string.
- Go to your Storage Account
- Select
Access keys
from the menu on the left - Under
key1
/key2
copy the contents of theConnection string
field
Key concepts
DataLake Storage Gen2 was designed to:
- Service multiple petabytes of information while sustaining hundreds of gigabits of throughput
- Allow you to easily manage massive amounts of data
Key Features of DataLake Storage Gen2 include:
- Hadoop compatible access
- A superset of POSIX permissions
- Cost effective in terms of low-cost storage capacity and transactions
- Optimized driver for big data analytics
A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access.
In the past, cloud-based analytics had to compromise in areas of performance, management, and security. Data Lake Storage Gen2 addresses each of these aspects in the following ways:
- Performance is optimized because you do not need to copy or transform data as a prerequisite for analysis. The hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.
- Management is easier because you can organize and manipulate files through directories and subdirectories.
- Security is enforceable because you can define POSIX permissions on directories or individual files.
- Cost effectiveness is made possible as Data Lake Storage Gen2 is built on top of the low-cost Azure Blob storage. The additional features further lower the total cost of ownership for running big data analytics on Azure.
Data Lake Storage Gen2 offers two types of resources:
- The
_filesystem
used via 'DataLakeFileSystemClient' - The
_path
used via 'DataLakeFileClient' or 'DataLakeDirectoryClient'
ADLS Gen2 | Blob |
---|---|
Filesystem | Container |
Path (File or Directory) | Blob |
Note: This client library does not support hierarchical namespace (HNS) disabled storage accounts.
URL format
Paths are addressable using the following URL format: The following URL addresses a file:
https://${myaccount}.dfs.core.windows.net/${myfilesystem}/${myfile}
Resource URI Syntax
For the storage account, the base URI for datalake operations includes the name of the account only:
https://${myaccount}.dfs.core.windows.net
For a file system, the base URI includes the name of the account and the name of the file system:
https://${myaccount}.dfs.core.windows.net/${myfilesystem}
For a file/directory, the base URI includes the name of the account, the name of the file system and the name of the path:
https://${myaccount}.dfs.core.windows.net/${myfilesystem}/${mypath}
Note that the above URIs may not hold for more advanced scenarios such as custom domain names.
Examples
The following sections provide several code snippets covering some of the most common Azure Storage Blob tasks, including:
- Create a
DataLakeServiceClient
- Create a
DataLakeFileSystemClient
- Create a
DataLakeFileClient
- Create a
DataLakeDirectoryClient
- Create a file system
- Enumerate paths
- Rename a file
- Rename a directory
- Get file properties
- Get directory properties
- Authenticate with Azure Identity
Create a DataLakeServiceClient
Create a DataLakeServiceClient
using the sasToken
generated above.
DataLakeServiceClient dataLakeServiceClient = new DataLakeServiceClientBuilder()
.endpoint("<your-storage-account-url>")
.sasToken("<your-sasToken>")
.buildClient();
or
// Only one "?" is needed here. If the sastoken starts with "?", please removing one "?".
DataLakeServiceClient dataLakeServiceClient = new DataLakeServiceClientBuilder()
.endpoint("<your-storage-account-url>" + "?" + "<your-sasToken>")
.buildClient();
Create a DataLakeFileSystemClient
Create a DataLakeFileSystemClient
using a DataLakeServiceClient
.
DataLakeFileSystemClient dataLakeFileSystemClient = dataLakeServiceClient.getFileSystemClient("myfilesystem");
or
Create a DataLakeFileSystemClient
from the builder sasToken
generated above.
DataLakeFileSystemClient dataLakeFileSystemClient = new DataLakeFileSystemClientBuilder()
.endpoint("<your-storage-account-url>")
.sasToken("<your-sasToken>")
.fileSystemName("myfilesystem")
.buildClient();
or
// Only one "?" is needed here. If the sastoken starts with "?", please removing one "?".
DataLakeFileSystemClient dataLakeFileSystemClient = new DataLakeFileSystemClientBuilder()
.endpoint("<your-storage-account-url>" + "/" + "myfilesystem" + "?" + "<your-sasToken>")
.buildClient();
Create a DataLakeFileClient
Create a DataLakeFileClient
using a DataLakeFileSystemClient
.
DataLakeFileClient fileClient = dataLakeFileSystemClient.getFileClient("myfile");
or
Create a FileClient
from the builder sasToken
generated above.
DataLakeFileClient fileClient = new DataLakePathClientBuilder()
.endpoint("<your-storage-account-url>")
.sasToken("<your-sasToken>")
.fileSystemName("myfilesystem")
.pathName("myfile")
.buildFileClient();
or
// Only one "?" is needed here. If the sastoken starts with "?", please removing one "?".
DataLakeFileClient fileClient = new DataLakePathClientBuilder()
.endpoint("<your-storage-account-url>" + "/" + "myfilesystem" + "/" + "myfile" + "?" + "<your-sasToken>")
.buildFileClient();
Create a DataLakeDirectoryClient
Get a DataLakeDirectoryClient
using a DataLakeFileSystemClient
.
DataLakeDirectoryClient directoryClient = dataLakeFileSystemClient.getDirectoryClient("mydir");
or
Create a DirectoryClient
from the builder sasToken
generated above.
DataLakeDirectoryClient directoryClient = new DataLakePathClientBuilder()
.endpoint("<your-storage-account-url>")
.sasToken("<your-sasToken>")
.fileSystemName("myfilesystem")
.pathName("mydir")
.buildDirectoryClient();
or
// Only one "?" is needed here. If the sastoken starts with "?", please removing one "?".
DataLakeDirectoryClient directoryClient = new DataLakePathClientBuilder()
.endpoint("<your-storage-account-url>" + "/" + "myfilesystem" + "/" + "mydir" + "?" + "<your-sasToken>")
.buildDirectoryClient();
Create a file system
Create a file system using a DataLakeServiceClient
.
dataLakeServiceClient.createFileSystem("myfilesystem");
or
Create a file system using a DataLakeFileSystemClient
.
dataLakeFileSystemClient.create();
Enumerate paths
Enumerating all paths using a DataLakeFileSystemClient
.
for (PathItem pathItem : dataLakeFileSystemClient.listPaths()) {
System.out.println("This is the path name: " + pathItem.getName());
}
Rename a file
Rename a file using a DataLakeFileClient
.
//Need to authenticate with azure identity and add role assignment "Storage Blob Data Contributor" to do the following operation.
DataLakeFileClient fileClient = dataLakeFileSystemClient.getFileClient("myfile");
fileClient.create();
fileClient.rename("new-file-system-name", "new-file-name");
Rename a directory
Rename a directory using a DataLakeDirectoryClient
.
//Need to authenticate with azure identity and add role assignment "Storage Blob Data Contributor" to do the following operation.
DataLakeDirectoryClient directoryClient = dataLakeFileSystemClient.getDirectoryClient("mydir");
directoryClient.create();
directoryClient.rename("new-file-system-name", "new-directory-name");
Get file properties
Get properties from a file using a DataLakeFileClient
.
DataLakeFileClient fileClient = dataLakeFileSystemClient.getFileClient("myfile");
fileClient.create();
PathProperties properties = fileClient.getProperties();
Get directory properties
Get properties from a directory using a DataLakeDirectoryClient
.
DataLakeDirectoryClient directoryClient = dataLakeFileSystemClient.getDirectoryClient("mydir");
directoryClient.create();
PathProperties properties = directoryClient.getProperties();
Authenticate with Azure Identity
The Azure Identity library provides Azure Active Directory support for authenticating with Azure Storage.
DataLakeServiceClient storageClient = new DataLakeServiceClientBuilder()
.endpoint("<your-storage-account-url>")
.credential(new DefaultAzureCredentialBuilder().build())
.buildClient();
Troubleshooting
When interacting with data lake using this Java client library, errors returned by the service correspond to the same HTTP
status codes returned for REST API requests. For example, if you try to retrieve a file system or path that
doesn't exist in your Storage Account, a 404
error is returned, indicating Not Found
.
Default HTTP Client
All client libraries by default use the Netty HTTP client. Adding the above dependency will automatically configure the client library to use the Netty HTTP client. Configuring or changing the HTTP client is detailed in the HTTP clients wiki.
Default SSL library
All client libraries, by default, use the Tomcat-native Boring SSL library to enable native-level performance for SSL operations. The Boring SSL library is an uber jar containing native libraries for Linux / macOS / Windows, and provides better performance compared to the default SSL implementation within the JDK. For more information, including how to reduce the dependency size, refer to the performance tuning section of the wiki.
Next steps
Several Storage datalake Java SDK samples are available to you in the SDK's GitHub repository.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.