Incremental reading/writing of TestSuites #186

goodmami · 2018-10-05T22:28:36Z

Creating a separate issue from #150.

The TestSuite class allows for in-memory testsuites (see #58) but removed the ability (compared to the deprecated ItsdbProfile class) to read and write testsuites incrementally, or without reading all tables. There are some issues with filtering where tables need to be joined or a cache of ids needs to be built, but this can perhaps be done by relying on the Relations and not decoding every column (e.g., just popping off the *-id columns from each line in order to build a key cache could be faster than reading all tables normally).

The text was updated successfully, but these errors were encountered:

goodmami · 2019-02-28T09:40:34Z

Just putting down my thoughts...

It is too much work to try and maintain a single data structure that will simultaneously work with in-memory data and on-disk data and manage to keep them in sync. I don't want to deal with caching and buffering and checking modification times of files. So there are other options:

Option 1: separate classes for in-memory and on-disk tables

For example, VirtualTable and PersistentTable, or something, where the first acts like the current Table and the latter is just an interface to the data on disk (e.g., it supports iteration but not random access).
Option 2: only support on-disk tables; in-memory tables are simple lists

The current Table class doesn't really offer much convenience over the other ways of, e.g., converting simple tuples or dicts into Record objects or selecting columns across the rows. So I don't think there's a big loss in not modeling lists of records as tables. The main thing is that this will no longer work:
```
>>> ts = TestSuite(...)
>>> last_item = ts['item'][-1]
```
Well, it could work, but such random access would not be efficient. However, if we could also skip the decoding of rows we don't care about, we could get to that last item much faster the first time we request it.

goodmami · 2019-03-20T01:37:53Z

Ok I've settled on a solution (for now... it may be changed if #204 goes through), and it does not really break the API unless people were using less common list methods on Tables, like .count(), .index(), etc. It is more complicated than I'd like but less than it could be.

Table is no longer a subclass of list, though it implements __getitem__, __setitem__, __len__, append, and extend with the same behavior as list objects (including slices).
Data stored internally are unprocessed rows split into columns as a tuple; upon retrieval a Record object is created.
Tables are either "attached" or "detached": an attached table is bound to a file and does not store records in memory unless they've been modified or added (i.e., only the changes), and detached tables are entirely in memory.
On loading an attached table, the file is opened and traversed only to get a line (i.e., record) count, and the internal list is filled with None.
On iterating or get-access, rows that are None get retrieved from the file based on the record number (position in the internal list). This is more efficient with iteration than random-access, but because nothing is parsed until the target item is retrieved, it is fairly quick.

I still have a few things to do before this is done:

make joins work incrementally
make write work incrementally
ensure table resizing (namely shrinking or deletion) works
add unit tests for new functionality (existing tests pass already!)
add documentation

This modifies how tables are written and fixes some nasty bugs in order to get streaming (incremental) processing of TestSuites. The worst bug was when attempting to overwrite a table using the table's rows, because the previous work on incremental reading meant that the file would be overwritten as it was being read. Now temporary files are written to first, then copied (or appended) to the destination. Currently this only works for Python 3. Python 2 has good ol' UnicodeDecodeError when trying to make strings out of ACE output. I'm not certain that this wasn't already a problem, though. Addresses #186

goodmami mentioned this issue Oct 5, 2018

Remaining feature discrepancy between TestSuite and ItsdbProfile #150

Closed

5 tasks

goodmami added this to the v1.0.0 milestone Dec 6, 2018

goodmami modified the milestones: v1.0.0, v0.9.2 Feb 19, 2019

goodmami closed this as completed in 407326c Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental reading/writing of TestSuites #186

Incremental reading/writing of TestSuites #186

goodmami commented Oct 5, 2018

goodmami commented Feb 28, 2019

goodmami commented Mar 20, 2019

Incremental reading/writing of TestSuites #186

Incremental reading/writing of TestSuites #186

Comments

goodmami commented Oct 5, 2018

goodmami commented Feb 28, 2019

goodmami commented Mar 20, 2019