Hi we have been experiencing some peculiarities in reading files from within the connected storage of Synapse Analytics over the last few days. These have been experienced while using pyspark in Notebooks.
Our initial issue started using the pandas method read_excel. We used this method to read an excel sheet using this code which worked on Monday 2022-02-21.
import pandas as pd
adslpath ='abfss://STORAGEACCOUNT.dfs.core.windows.net/CONTAINER/AssetManagement/Bronze/Raw/NOJV/FILE.xlsx'
pdf = pd.read_excel(adslpath, sheet_name='Page1', nrows=60,usecols='A:T')
This stopped working on Wednesday (2022-02-23) giving the error:
FileNotFoundError: [Errno 2] No such file or directory: 'abfss://STORAGEACCOUNT@MetContainer .dfs.core.windows.net/AssetManagement/Bronze/Raw/NOJV/FILE.xlsx'
We found we could connect to this with an Https link to the same source e.g.,
pdf = pd.read_excel(
r'https://CONTAINER.dfs.core.windows.net/dafdlfsd01/AssetManagement/Bronze/Raw/NOJV//FILE.xlsx?sv=2020-08-04&ss=bfqt&srt=sco&sp=rwdlacupx&se=2024-12-01T17:44:05Z&st=2022-02-24T09:44:05Z&spr=https&sig=NnT1Qv8uXMkgh3RQKBOvJ%2Bch%2BVtI6GF7r4ZSdTMEOg8%3D', sheet_name='Page1', nrows=60,usecols='A:T')
Note the use of a Shared Access Signature which does add complexity as this will need to be refreshed periodically.
While content that we have an approach that would work we continued our development which gets the files we want to read using a loop of files utilizing the method mssparkutils.fs.ls("Your directory path") as of yesterday - 2022-02-24 - we started experiencing issues with this method. It turns out this stopped accepting URL but would accept abfss links.
from notebookutils import mssparkutils
adslpath ='abfss://CONTAINER.dfs.core.windows.net/STORAGEACCOUNT/AssetManagement/Bronze/Raw/NOJV//'
# abfsspath = [file for file in mssparkutils.fs.ls(adslpath)]
# folder = [file for file in mssparkutils.fs.ls('https://CONTAINER.dfs.core.windows.net/STORAGEACCOUNT/AssetManagement/Bronze/Raw/NOJV//')]
folder = [file for file in mssparkutils.fs.ls('https://CONTAINER.blob.core.windows.net/STORAGEACCOUNT/AssetManagement/Bronze/Raw/NOJV//')]
Since this morning we now find each implementation of this method fails.
We have also noticed some changes to the nature of the display within synapse of the file explorer - though only for some users.
This has been replaced by this:
The big change here seems to be the introduction of the URI and the loss of the ABFSS Path and URL. This points to a shift to this from the older approach to the new approach.
We have a implementation with a range of clients which may now be in the process of breaking and have not see any indication of changes coming to this. The loss of the functionality in Microsoft Spark Utilities in particular is troubling. I have a number of questions.
- Is there some update we have missed and guidance about how to proceed?
- Will URI work using managed identity or will we need to use SAS to access files?
- Will pandas and notebookutils update to new versions or are there actions we will need to take or are they irrevocably broken?
Many thanks.
Stephen Connell.