Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add (optional) hash validation of objects #22

Open
3 of 6 tasks
dhermes opened this issue Jul 31, 2017 · 3 comments
Open
3 of 6 tasks

Add (optional) hash validation of objects #22

dhermes opened this issue Jul 31, 2017 · 3 comments
Labels
api: storage Issues related to the googleapis/google-resumable-media-python API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@dhermes
Copy link
Contributor

dhermes commented Jul 31, 2017

/cc @jonparrott

@dhermes dhermes added enhancement help wanted We'd love to have community involvement on this issue. labels Jul 31, 2017
@mfschwartz
Copy link

mfschwartz commented Oct 4, 2017

I'd like to add basic support for checksum validation for downloads, and would appreciate your thoughts about the proposed approach:

  1. The initial support would only cover downloads, not uploads.
  2. The initial support would only cover non-composite objects. Composite objects have no md5 checksum and instead need to be validated using crc32c, and the latter will perform very badly unless the user has a compiled crcmod. We went through quite a bit of trouble with this problem in gsutil (the commandline tool for GCS, of which I was the author/owner for the first 3 major versions). Basically, it's not included in Python 2 on most OS distros, and asking users to follow instructions to install it turned out to be hard for some users.)
  3. I propose defining an optional callback func that can be passed to Download._write_to_stream, which is called with every chunk being read (and written to the stream). I would then pass a function that accumulates the MD5 in that callback from Blob.download_to_file. The reason for doing the checksumming here is this is the code that knows about the file being written (the Download class just has a stream), so it's the place that can resume writes. (Note: When we resume writes we'll have to re-read the beginning of the already downloaded file to build the MD5 state. That will make resumed downloads slower than they run today, similar to when gsutil runs and prints "Catching up MD5 for gs://my-bucket/my-object...")

@mfschwartz
Copy link

A first cut implementation that does checksumming for non-chunked, non-composite downloads is being reviewed at:
#31
and
googleapis/google-cloud-python#4133

The next big things that need to be done:

  • Figure out why we heard reports of cases of truncated downloads (which previously resulted in corrupt downloaded data left silently in place, and now will result in an exception)
  • Add support for validating CRC32C for composite objects. Note that this is somewhat painful in Python 2 because most OS distros of Python 2 don't include a compiled crcmod (and executing without a compiled crcmod is very slow; and installing a compiled crcmod will make getting google-cloud-python working well more difficult). Unlike what happened with gsutil, for library users it's probably workable to require users to install a compiled crcmod. We may want to do this at a new major version, though, since it would potentially look like a big performance regression to users who don't install the compiled crcmod.
  • Add support for checksumming uploads.

@JustinBeckwith JustinBeckwith added 🚨 This issue needs some love. triage me I really want to be triaged. labels Dec 8, 2018
@tseaver tseaver added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed 🚨 This issue needs some love. enhancement help wanted We'd love to have community involvement on this issue. triage me I really want to be triaged. labels Jan 16, 2019
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/google-resumable-media-python API. label Mar 4, 2021
@SimonBohnenQC
Copy link
Contributor

SimonBohnenQC commented Sep 30, 2022

@dhermes @mfschwartz Would it make sense to check that the response code is not 206 before validating checksums? This fake-gcs-server includes checksums even for partial downloads, which seems to be within specification (judging by the orange box here). When those checksums are verified by this library, an error is thrown as the whole object is obviously not available to the client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/google-resumable-media-python API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants