Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTPFilesystem has a race condition on data size between the open and read calls, if content changes at server between the 2 calls #1541

Open
masariello opened this issue Mar 8, 2024 · 5 comments

Comments

@masariello
Copy link

The following script reproduces the issues

The script spins up an http server that makes the json content 1 chat longer every 1s.

Then the client bit hits the url with a 1s sleep between the open and read calls. The json parsing immediately fails because the terminating { gets chopped.

For good measure the same test url is also hit with a requests.get call that does not seem to have any issues.

import json
from time import sleep
import datetime as dt
import requests
from threading import Thread
from urllib.parse import urlparse
import fsspec
from fsspec.implementations.dirfs import DirFileSystem

dummy_base_url = 'http://localhost:8080/'
dummy_uri = 'foo/bar'
dummy_url = dummy_base_url + dummy_uri

protocol = urlparse(dummy_url).scheme
dummy_fs = fsspec.filesystem(protocol)
dummy_dirfs = DirFileSystem(path=dummy_base_url, fs=dummy_fs)


def start_dummy_http_server():
    """Spin up a dummy HTTP Server to test the observed issue
    The returned response from a GET request is in JSON format
    The returned message is of varying size; with the size changing every second

    :return:
    """
    from http.server import HTTPServer, BaseHTTPRequestHandler

    class Serv(BaseHTTPRequestHandler):
        def do_GET(self):
            t = dt.datetime.now()
            s = t.second
            t_str = t.strftime('%Y-%m-%d %H:%M:%S')
            response = {'time': t_str, 'extra': 'x'*s}  # YYYY-MM-DD HH:MM:SS xxxxxxxx (x repeated s times)
            response = json.dumps(response)
            self.send_response(200)
            self.send_header('Content-Length', str(len(response)))
            self.end_headers()
            self.wfile.write(bytes(response, 'utf-8'))

    httpd = HTTPServer(('localhost', 8080), Serv)
    httpd.serve_forever()

print("starting http server")
thread = Thread(target=start_dummy_http_server, daemon=True)
thread.start()
print("http server started")

count = 0
while count < 500:
    count += 1
    t = dt.datetime.now()
    old_r = requests.get(dummy_url).content  # this works fine
    with dummy_dirfs.open(dummy_uri) as f:
        open_size = f.size
        sleep(1)
        # the message on the server has changed between open and read
        # the server returns the new message, but truncates it to size of original message
        data = f.read()
    new_r = requests.get(dummy_url).content  # this works fine
    old_requests_size = len(old_r)
    new_requests_size = len(new_r)
    read_size = len(data)  # len(json.dumps(y))
    print(f"#{count} - t = {t}")
    print(f"open size = {open_size}")
    print(f"read size = {read_size}")
    print(f"orig requests size = {old_requests_size}")
    print(f"new requests size = {new_requests_size}")
    print(data)
    j = json.loads(data)  # this will fail; because the JSON is not valid
    num_keys = len(j.keys())
    print("success")
    sleep(1)


thread.join()
print("script ended")
masariello added a commit to masariello/filesystem_spec that referenced this issue Mar 8, 2024
…etween the open and read class, if content changes at server between the 2 class
@martindurant
Copy link
Member

I am not convinced that this should be an issue. When the server changes state during our operations, ideally we should raise an exception rather than silently change our state. The size of a file is needed to be able to do any random access, so just saying None (in the linked PR) will limit us to streaming the response HTTPStream File rather than HTTPFile. That might be OK for JSON, which must be parsed in a stream anyway, but no good for many different formats.

In general, I agree that getting the size could be deferred at least until the first read and that the first read could also contain some data. I think all of the HTTP-based backends would support that, but it would take some reworking.

@masariello
Copy link
Author

When the server changes state during our operations, ideally we should raise an exception rather than silently change our state

I can see this would be consistent with the concept of a file on disk changing during a read, but one cannot help noticing that in the case of an HTTP "file", a requests.get would be able to download the full consistent state of the object.

I agree that getting the size could be deferred at least until the first read

Yes, that would at least remove one round-trip to the server and reduce the probability of the race condition hitting the client.

@martindurant
Copy link
Member

I agree that getting the size could be deferred at least until the first read

Yes, that would at least remove one round-trip to the server and reduce the probability of the race condition hitting the client.

If you are interested, I would love to see an implementation. Else, keep pinging me util I do something about it.

Any workflow like

f = open()
f.seek()  # not with whence=2
f.read()  # read-to-end

should not need two requests. The server ought to tell us the total byte size at that point too, but I'm not sure this is guaranteed.

'Content-Range': 'bytes 3-30/31'
                             ^

@martindurant
Copy link
Member

Unless you feel like trying to implement my suggestion, do keep pinging me until I have time to do something about it myself.

@masariello
Copy link
Author

One way to avoid this race condition is to fs.open(path, block_size=0). This returns a non-seekable HTTPStreamFile, which will probably fit many use-cases. Does fit ours, so will go with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants