Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental reading/writing of TestSuites #186

Closed
goodmami opened this issue Oct 5, 2018 · 2 comments
Closed

Incremental reading/writing of TestSuites #186

goodmami opened this issue Oct 5, 2018 · 2 comments
Milestone

Comments

@goodmami
Copy link
Member

goodmami commented Oct 5, 2018

Creating a separate issue from #150.

The TestSuite class allows for in-memory testsuites (see #58) but removed the ability (compared to the deprecated ItsdbProfile class) to read and write testsuites incrementally, or without reading all tables. There are some issues with filtering where tables need to be joined or a cache of ids needs to be built, but this can perhaps be done by relying on the Relations and not decoding every column (e.g., just popping off the *-id columns from each line in order to build a key cache could be faster than reading all tables normally).

@goodmami
Copy link
Member Author

Just putting down my thoughts...

It is too much work to try and maintain a single data structure that will simultaneously work with in-memory data and on-disk data and manage to keep them in sync. I don't want to deal with caching and buffering and checking modification times of files. So there are other options:

  • Option 1: separate classes for in-memory and on-disk tables

    For example, VirtualTable and PersistentTable, or something, where the first acts like the current Table and the latter is just an interface to the data on disk (e.g., it supports iteration but not random access).

  • Option 2: only support on-disk tables; in-memory tables are simple lists

    The current Table class doesn't really offer much convenience over the other ways of, e.g., converting simple tuples or dicts into Record objects or selecting columns across the rows. So I don't think there's a big loss in not modeling lists of records as tables. The main thing is that this will no longer work:

    >>> ts = TestSuite(...)
    >>> last_item = ts['item'][-1]

    Well, it could work, but such random access would not be efficient. However, if we could also skip the decoding of rows we don't care about, we could get to that last item much faster the first time we request it.

@goodmami
Copy link
Member Author

Ok I've settled on a solution (for now... it may be changed if #204 goes through), and it does not really break the API unless people were using less common list methods on Tables, like .count(), .index(), etc. It is more complicated than I'd like but less than it could be.

  • Table is no longer a subclass of list, though it implements __getitem__, __setitem__, __len__, append, and extend with the same behavior as list objects (including slices).
  • Data stored internally are unprocessed rows split into columns as a tuple; upon retrieval a Record object is created.
  • Tables are either "attached" or "detached": an attached table is bound to a file and does not store records in memory unless they've been modified or added (i.e., only the changes), and detached tables are entirely in memory.
  • On loading an attached table, the file is opened and traversed only to get a line (i.e., record) count, and the internal list is filled with None.
  • On iterating or get-access, rows that are None get retrieved from the file based on the record number (position in the internal list). This is more efficient with iteration than random-access, but because nothing is parsed until the target item is retrieved, it is fairly quick.

I still have a few things to do before this is done:

  • make joins work incrementally
  • make write work incrementally
  • ensure table resizing (namely shrinking or deletion) works
  • add unit tests for new functionality (existing tests pass already!)
  • add documentation

goodmami added a commit that referenced this issue Mar 28, 2019
This modifies how tables are written and fixes some nasty bugs in
order to get streaming (incremental) processing of TestSuites. The
worst bug was when attempting to overwrite a table using the table's
rows, because the previous work on incremental reading meant that the
file would be overwritten as it was being read. Now temporary files
are written to first, then copied (or appended) to the destination.

Currently this only works for Python 3. Python 2 has good ol'
UnicodeDecodeError when trying to make strings out of ACE output. I'm
not certain that this wasn't already a problem, though.

Addresses #186
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant