Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CLOUD_PROTOCOLS - Attempt #2 #1877

Closed
wants to merge 3 commits into from
Closed

Conversation

noklam
Copy link
Contributor

@noklam noklam commented Sep 26, 2022

Signed-off-by: Nok Chan [email protected]

Description

Related: #1714, #1632

The original issue of replacing infer_storage_option
https://github.com/quantumblacklabs/private-kedro/pull/545

Quote from the original issue

We need to split filepath without using infer_storage_options() from fsspec since it returns different results for different protocols that doesn't satisfy our needs (we are already making exceptions for HTTP protocol, we should avoid piling on exceptions e.g. for Azure).

I believe there is no good reason for this CLOUD_PROTOCOLS exists, the only thing we need to defend against is hdfs, which is a special case, the cloud protocols are the normal cases.

Development notes

  • Remove CLOUD_PROTOCOLS

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Let's see if it fails the tests...

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Nok Chan <[email protected]>
@noklam
Copy link
Contributor Author

noklam commented Sep 26, 2022

Maybe after all hdfs is the weird breed instead of the CLOUD_PROTOCOLS? All we need is just protocol and path and I don't think we need the entire infer_storage_option at all. Hopefully the tests will agree with me 🙏🏼

i.e. "hdfs://namenode:8020/file.txt", which has two colons and causes some parsing issues.

@noklam noklam changed the title [TEST] - remove cloud protocol Remove CLOUD_PROTOCOLS - Attempt #2 Sep 26, 2022
@noklam noklam marked this pull request as ready for review September 26, 2022 17:32
@noklam noklam requested a review from idanov as a code owner September 26, 2022 17:32
@noklam
Copy link
Contributor Author

noklam commented Sep 26, 2022

@datajoely @SajidAlamQB @AntonyMilneQB may have some comments on this? I don't have the full context of this issue but basically the description of this PR is what I found after looking into JIRA

@noklam
Copy link
Contributor Author

noklam commented Sep 26, 2022

Please ignore the print, I obviously have to do some clean up for CLOUD_PROTOCOLS too.

@antonymilne
Copy link
Contributor

I'm very unsure about this, but that's just because I don't understand why (or even if) it currently works. Maybe our assumption that it currently works actually doesn't hold? The tests passing is definitely a good sign, but really some manual testing is probably needed here. If we can resolve this once and for all then I would LOVE it 😅 ❤️ So thank you for trying!

Ideally we should remove our infer_storage_options entirely. If that's not possible, then removing our custom CLOUD_PROTOCOLS like you do here in theory sounds like a good step towards that, but it's now more divergent from what fsspec have and I don't know which is correct.

For reference:

CLOUD_PROTOCOLS = ["s3", "s3a", "gcs", "gs"]  # fsspec 
CLOUD_PROTOCOLS = ["s3", "gcs", "gs", "adl", "abfs"]  # kedro now
CLOUD_PROTOCOLS = ["hdfs"]  # your proposal

(fsspec CLOUD_PROTOCOLS comes from https://github.com/fsspec/filesystem_spec/blob/master/fsspec/utils.py#L75. Apart from that I think our infer_storage_options is functionally identical to theirs so comparing CLOUD_PROTOCOLS is the only thing we need to do?)

So my questions would be:

  1. If our tests pass without s3, s3a, gcs, gs in CLOUD_PROTOCOLS, as they have already in fsspec, then which is correct (i.e. actually works in practice)? Our tests or fsspec?
  2. Related: why are things (apparently) working for us with adl and abfs while when we tried adding it to fsspec it didn't work? Apparently it needed updates to ablfs to work properly... But then why does it work for us? Does it actually work for us in practice or is it just that the tests pass?
  3. If hdfs should go in CLOUD_PROTOCOLS, should that in theory be done on the fsspec side too (if they can coordinate it, which I know probably they won't easily)?
  4. Why do things currently work for us without hdfs? Why do things work for fsspec without hdfs?

Basically I'm very keen for us to get to the bottom of this, but I have limited faith in our tests here unless we can really understand the above.

@noklam noklam marked this pull request as draft September 27, 2022 10:07
Signed-off-by: Nok Chan <[email protected]>
@noklam noklam closed this Nov 7, 2022
@merelcht merelcht deleted the feat/remove_cloud_protocol branch November 24, 2022 18:36
@datajoely
Copy link
Contributor

how come we closed this? It's still causing issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants