-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Down::ChunkedIO#pos returns wildly incorrect values #73
Comments
Could you write a self-contained example that reproduces the problem? Does it happen only when used with |
Found the issue & can create a PR with test cases, but there's a bit of discussion about how to handle it since we have two overlapping issues now. Detail of problemAs far as I can see the bug I've encountered here is specific to We know now from #74 that the entire Ruby CSV system tries to be fully encoding-aware, even down to looking for line ending characters in correct encoding. It uses The CSV parser reads 1K only in its initial line ending scan, it transpires. It's trying to determine the CSV file line ending type automatically, reading in 1K chunks - and of course, usually finding an end-of-line well inside that amount, so only needing to read a first 1K. During that time, it is passing In This is where we go wrong.
At first glance, I appear to have fixed the position offset problem with: line, extra = data.split(separator, 2)
line << separator if data.include?(separator)
data.clear # deallocate data
if extra
@position -= extra.bytesize # <--- Fix "position"
# ...rest of the implementation as-is... ...though I realise that there may be deeper subtleties of the implementation that I've overlooked. In a test case with CSV parsing, the offset previously went into hundreds of thousands before; with the fix above, the offset progresses up nicely until eventually precisely matching the input file size once the CSV system has read all data. Question arisingWhile I can do a PR with that one-line fix and some tests, the trouble here is that the use of We thus have something that overlaps. This issue and #74 are closely related. Would you prefer me to:
|
We are using Shrine for handling large CSV uploads that are then stream-processed. We have a progress meter for this which works off the underlying IO object's
#pos
values. For local files, this works perfectly. Once we went into our Staging environment with S3 as the storage engine, using Down under-the-hood, it all broke. It seems that after the first 1K of data,Down::ChunkedIO#pos
starts returning values much, much higher than they should be - far beyond the end of the file.For a particular test file of only 3669 bytes comprising around 55 CSV rows plus header, the size reported by the IO object was consistently correct. However, inside the CSV row iterator, the results of
#pos
were:The start offset is 0. The 1024 offset was presumed to be a chunk size from the CSV processor, but if I tried to rewind to zero and read 1024 bytes, I actually got a very strange 1057 bytes, perfectly aligned to a row end, instead. In any event, it then sits at 1024 for a while and once the CSV parsing seems to have gone past that first "chunk" - be it 1024 or 1057 bytes - then the positions reported become, as you can see, very wrong.
The above was generated with no rewinding or other shenanigans; in psuedocode we have:
The text was updated successfully, but these errors were encountered: