Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option my_range to method iterRecords #308

Closed
wants to merge 1 commit into from

Conversation

lguez
Copy link
Contributor

@lguez lguez commented Oct 11, 2024

Using iterRecords with a range option should be faster than calling record within a loop, since we avoid the multiple calls to seek.

Using iterRecords with a range option should be faster than calling
record within a loop, since we avoid the multiple calls to seek.
@JamesParrott
Copy link
Collaborator

JamesParrott commented Oct 11, 2024

Hi Lionel,
Thanks for the PR.

Reducing the multiple calls to seek would be great, but I can't find where .iterRecords triggers file.seek calls. Isn't the main overhead from iterRecords in recordContents = recStruct.unpack(f.read(recStruct.size))? So while it appears to be harmless and passes the existing test suite, I don't understand how the code change enables a speed increase, any better than only iterating over .iterRecords for the desired number of records. Hitherto I thought seek calls were mainly used in PyShp to improve speed already, by leveraging the list of shortcuts in the shapefile format itself, in the .shx file.

More importantly, as far as I understand it, supporting calls to iterRecords with my_range=[3,3,3] would allow the user to break _Record.oid on their instances:

    @property
    def oid(self):
        """The index position of the record in the original shapefile"""

To go into more detail, as I said above, I cannot see where the range indices i from xrange are used by any calls to .seek. It looks to me like the only thing i is used for is as oid in self.__record(oid=i,... and thence is not used in the method until its final line: _Record(recLookup, record, oid). In that class, the only thing it appears to me to be used for is: self.__oid = oid in __init__. So I do think this change would allow a user that passed in my_range = [1,1,1] to break something else in PyShp that expects the _Record.oids to be sequential.

Even if the .oids are only metadata, they preserve the correspondence between the records and shapes from their ordering. If the user wishes to change their values, Python will not stop them from doing so. But I'm inclined against adding code that makes creation of a confusing order of Records in a shapefile easier.

Maybe I've misunderstood the code. Nonetheless, to accept this change, I absolutely need a couple of things, and would very much like to see a couple of other things in support:

Must have:

  • proof this change does not allow the user to break their setup, produce junk, or corrupt the data read from the .dbf etc.
  • new tests added showing the expected impact of list(Reader.iterRecords(my_range = [1,1,1])
    Highly desirable:
  • Performance benchmarks for the claimed speed improvement.
  • Example(s) of how to use the new feature.

If it is accepted, it would then be good to add documentation for the new feature.

Best wishes,

James.

@JamesParrott
Copy link
Collaborator

I think I've understood the intent now - to avoid repeated calls to Reader.record. That method does reset via .seek(0) each call (perhaps this is unneccessary). On SDD and I would've thought HDD too, seek should be O(1). But it occurred to me that if huge shapefiles were being read from tape, multiple calls to Reader.record could be slower than iterating through all the records to the highest (and the shapefile .dbf might be too large to load entirely into RAM).

I couldn't find a more_itertools function, but there are other ways in Python to iterate through, and yield only items at specified indices, e.g. a custom generator.

If no such alternative is acceptable, the idea could be investigated. But I'd prefer a separate method, that yields groups of records at specified indices, to changing iterRecords. Or I'd require a provev-to-be-safe re-implementation of iterRecords.

Note: as I understand it, Reader.__record does not look up record number oid. It's more like a next(iterator) call, and its implementation is tightly coupled to its calls. It just assumes whatever oid it is called with, is correct.

@lguez
Copy link
Contributor Author

lguez commented Oct 15, 2024

Thank you. Yes, the point was to avoid repeated calls to Reader.record. But you are right: I misunderstood the way Reader.__record works. I will propose another method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants