Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken abfs:// on version 2022.7.0 #1002

Closed
toby-coleman opened this issue Jul 28, 2022 · 6 comments · Fixed by #1003
Closed

Broken abfs:// on version 2022.7.0 #1002

toby-coleman opened this issue Jul 28, 2022 · 6 comments · Fixed by #1003

Comments

@toby-coleman
Copy link
Contributor

I'm running the following code to access an Azure blob storage container:

import adlfs
fs = adlfs.AzureBlobFileSystem()
fs.ls("abfs://my-container-name")

This works perfectly with fsspec==2022.5.0 and adlfs==2022.7.0. However with fsspec==2022.7.0 and adlfs==2022.7.0 I get FileNotFoundError arising from azure.core.exceptions.ResourceNotFoundError: The specified container does not exist.. It does work, however, if I run:

fs.ls("az://my-container-name")

Expectation: abfs://... syntax should be supported on Python environments containing fsspec==2022.7.0.

Environment:

  • Platform: Ubuntu Linux
  • Python: 3.9
  • Credentials provided using AZURE_STORAGE_CONNECTION_STRING environment variable.

Working Python environment (from pip freeze):

adal==1.2.7
adlfs==2022.7.0
aiohttp==3.8.1
aiosignal==1.2.0
async-timeout==4.0.2
attrs==21.4.0
azure-core==1.24.2
azure-datalake-store==0.0.52
azure-identity==1.10.0
azure-storage-blob==12.13.0
certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi
cffi==1.15.1
charset-normalizer==2.1.0
cryptography==37.0.4
frozenlist==1.3.0
fsspec==2022.5.0
idna==3.3
isodate==0.6.1
msal==1.18.0
msal-extensions==1.0.0
msrest==0.7.1
multidict==6.0.2
oauthlib==3.2.0
portalocker==2.5.1
pycparser==2.21
PyJWT==2.4.0
python-dateutil==2.8.2
requests==2.28.1
requests-oauthlib==1.3.1
six==1.16.0
treelite==2.0.0
treelite-runtime==2.0.0
typing_extensions==4.3.0
urllib3==1.26.11
yarl==1.7.2

Broken Python environment:

adal==1.2.7
adlfs==2022.7.0
aiohttp==3.8.1
aiosignal==1.2.0
async-timeout==4.0.2
attrs==21.4.0
azure-core==1.24.2
azure-datalake-store==0.0.52
azure-identity==1.10.0
azure-storage-blob==12.13.0
certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi
cffi==1.15.1
charset-normalizer==2.1.0
cryptography==37.0.4
frozenlist==1.3.0
fsspec==2022.7.0
idna==3.3
isodate==0.6.1
msal==1.18.0
msal-extensions==1.0.0
msrest==0.7.1
multidict==6.0.2
oauthlib==3.2.0
portalocker==2.5.1
pycparser==2.21
PyJWT==2.4.0
python-dateutil==2.8.2
requests==2.28.1
requests-oauthlib==1.3.1
six==1.16.0
treelite==2.0.0
treelite-runtime==2.0.0
typing_extensions==4.3.0
urllib3==1.26.11
yarl==1.7.2
@martindurant
Copy link
Member

Would you mind debugging to see if #988 or #980 is responsible?

@toby-coleman
Copy link
Contributor Author

I think #988 is the culprit - I can get it to work again by removing "abfs" from this line in fsspec/utils.py.

@martindurant
Copy link
Member

@hayesgb : gcs and s3fs tests are run as part of fsspec's test suite. Would you like to create a CI run for adlfs? It seems like we should have caught this.

@toby-coleman , would you like to make a PR reverting 988?

@antonymilne
Copy link

antonymilne commented Aug 1, 2022

@martindurant @toby-coleman @hayesgb Please would it be possible to explain a bit more what has happened here? I don't know much about the inner workings of fsspec or adlfs, but I think the problem here seems to be not that #988 was wrong but that it was not release in coordination with required changes on the adlfs side.

Context: @SajidAlamQB and I work on Kedro, which relies heavily on fsspec for handling datasets. Two years ago fsspec's handling of abfs was raised as a possible bug (#256; fsspec/adlfs#45). From reading those (see fsspec/adlfs#45 (comment)), it seems that adding absf and adl was indeed the correct thing to do, but needed to be done in coordination with a change to adlfs.

It seems that instead of this change happening, we on Kedro instead rolled our own version of fsspec.utils.infer_storage_options which is the same as fsspec's version but includes absf and adl in the list of CLOUD_PROTOCOLS. Since then we have received requests from users to extend this list further (abfss and gdrive). These changes have all worked well for our users (I'm not sure why given apparently it should have required a change to adlfs also?), but as per kedro-org/kedro#1632 it is a bit annoying to maintain the list of CLOUD_PROTOCOLS on our side.

Hence, if at all possible, we'd really like to go back to using fsspec's infer_storage_options. Is there any way of coordinating changes with the other libraries so that we can make the changes that we'd like to CLOUD_PROTOCOLS? I understand this might not be easy to achieve, but if it is possible then that would be much appreciated! 🙏 Otherwise we will need to continue maintaining our own version of infer_storage_options.

@martindurant
Copy link
Member

The short story is: we made the change, and things broke, so we reverted!
You are right, that coordination with adlfs and any changes that might have been required there would have prevented the problem. I don't think anyone knew what changes those would be, though. Perhaps (from hazy memory) azure treats the "host" as the account owner, and it's not just a bucket/key.

Separately, it would indeed be good to run adlfs's test suite as one of the runs here, as we already do for s3fs/gcsfs.

@pedro-ricardo
Copy link

Hello @martindurant,
Could the cause of this weird behavior be that fsspec/adlfs implements it's own _strip_protocol method in AzureBlobFileSystem class?
Link to code

I suspect that AzureBlobFileSystem counts on receiving ops = infer_storage_options(path) as not having the "host" on joined on "path" key, as they do : ops["path"] = ops["host"] + ops["path"] some lines later.

But once adding those "adl", "abfs", "abfss" to the protocol list in infer_storage_options, you already do this join internaly here.

if protocol in ("s3", "s3a", "gcs", "gs", "adl", "abfs", "abfss", "gdrive"):
    options["path"] = options["host"] + options["path"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants