Add option `my_range` to method iterRecords #308

lguez · 2024-10-11T18:22:56Z

Using iterRecords with a range option should be faster than calling record within a loop, since we avoid the multiple calls to seek.

JamesParrott · 2024-10-11T19:45:51Z

Hi Lionel,
Thanks for the PR.

Reducing the multiple calls to seek would be great, but I can't find where .iterRecords triggers file.seek calls. Isn't the main overhead from iterRecords in recordContents = recStruct.unpack(f.read(recStruct.size))? So while it appears to be harmless and passes the existing test suite, I don't understand how the code change enables a speed increase, any better than only iterating over .iterRecords for the desired number of records. Hitherto I thought seek calls were mainly used in PyShp to improve speed already, by leveraging the list of shortcuts in the shapefile format itself, in the .shx file.

More importantly, as far as I understand it, supporting calls to iterRecords with my_range=[3,3,3] would allow the user to break _Record.oid on their instances:

    @property
    def oid(self):
        """The index position of the record in the original shapefile"""

To go into more detail, as I said above, I cannot see where the range indices i from xrange are used by any calls to .seek. It looks to me like the only thing i is used for is as oid in self.__record(oid=i,... and thence is not used in the method until its final line: _Record(recLookup, record, oid). In that class, the only thing it appears to me to be used for is: self.__oid = oid in __init__. So I do think this change would allow a user that passed in my_range = [1,1,1] to break something else in PyShp that expects the _Record.oids to be sequential.

Even if the .oids are only metadata, they preserve the correspondence between the records and shapes from their ordering. If the user wishes to change their values, Python will not stop them from doing so. But I'm inclined against adding code that makes creation of a confusing order of Records in a shapefile easier.

Maybe I've misunderstood the code. Nonetheless, to accept this change, I absolutely need a couple of things, and would very much like to see a couple of other things in support:

Must have:

proof this change does not allow the user to break their setup, produce junk, or corrupt the data read from the .dbf etc.
new tests added showing the expected impact of list(Reader.iterRecords(my_range = [1,1,1])
Highly desirable:
Performance benchmarks for the claimed speed improvement.
Example(s) of how to use the new feature.

If it is accepted, it would then be good to add documentation for the new feature.

Best wishes,

James.

JamesParrott · 2024-10-11T22:01:47Z

I think I've understood the intent now - to avoid repeated calls to Reader.record. That method does reset via .seek(0) each call (perhaps this is unneccessary). On SDD and I would've thought HDD too, seek should be O(1). But it occurred to me that if huge shapefiles were being read from tape, multiple calls to Reader.record could be slower than iterating through all the records to the highest (and the shapefile .dbf might be too large to load entirely into RAM).

I couldn't find a more_itertools function, but there are other ways in Python to iterate through, and yield only items at specified indices, e.g. a custom generator.

If no such alternative is acceptable, the idea could be investigated. But I'd prefer a separate method, that yields groups of records at specified indices, to changing iterRecords. Or I'd require a provev-to-be-safe re-implementation of iterRecords.

Note: as I understand it, Reader.__record does not look up record number oid. It's more like a next(iterator) call, and its implementation is tightly coupled to its calls. It just assumes whatever oid it is called with, is correct.

lguez · 2024-10-15T06:08:59Z

Thank you. Yes, the point was to avoid repeated calls to Reader.record. But you are right: I misunderstood the way Reader.__record works. I will propose another method.

Add option my_range to method iterRecords

e41b03c

Using iterRecords with a range option should be faster than calling record within a loop, since we avoid the multiple calls to seek.

JamesParrott closed this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option `my_range` to method iterRecords #308

Add option `my_range` to method iterRecords #308

lguez commented Oct 11, 2024

JamesParrott commented Oct 11, 2024 •

edited

Loading

JamesParrott commented Oct 11, 2024

lguez commented Oct 15, 2024

Add option my_range to method iterRecords #308

Add option my_range to method iterRecords #308

Conversation

lguez commented Oct 11, 2024

JamesParrott commented Oct 11, 2024 • edited Loading

JamesParrott commented Oct 11, 2024

lguez commented Oct 15, 2024

Add option `my_range` to method iterRecords #308

Add option `my_range` to method iterRecords #308

JamesParrott commented Oct 11, 2024 •

edited

Loading