Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of globals from qc tests #195

Merged
merged 46 commits into from
Dec 8, 2017
Merged

Removal of globals from qc tests #195

merged 46 commits into from
Dec 8, 2017

Conversation

bkatiemills
Copy link
Member

Summary

  • Solves Parallelization #192 by persisting information in supplementary postgres tables instead of module globals
  • Restores parallelization at the profile level (no longer need to launch multiple AutoQC.py processes running on subsets of the data)
  • Removes reliance on bashrc for setting up the Docker container: smooths out problems running on Travis.

Discussion

  • 1000 test profiles returned the same summary results after as before these changes.
  • Execution times on several different AWS instances as a function of parallelization is as follows:

100 profiles

m4.2xlarge 8 CPU / 32 GiB mem

Multiplicity Execution [s]
1 1546
2 781
4 417
6 379
8 365

c4.4xlarge 16 CPU / 30 GiB mem

Multiplicity Execution [s]
1 1449
2 708
4 371
8 194
16 183

c4.8xlarge 36 CPU / 60 GiB mem

Multiplicity Execution [s]
1 1411
2 684
4 370
8 197
16 110
32 100

1000 profilies

c4.8xlarge 36 CPU / 60 GiB mem

Multiplicity Execution [s]
8 1541
16 791
32 694

In all cases, execution time drops linearly with processes until the number of CPUs in the instance begins to saturate; also note that doubling the amount of memory while using the same number of processors in going from c4.4 to c4.8xlarge didn't change execution times for small numbers of CPUs, suggesting that this behavior is not a memory limitation.

AutoQC.py Outdated
@@ -58,12 +51,15 @@ def process_row(uid):

# run tests
for itest, test in enumerate(testNames):
result = run(test, [profile], parameterStore)
query = "UPDATE " + sys.argv[1] + " SET " + test.lower() + " = " + str(result[0][0]) + " WHERE uid = " + str(profile.uid()) + ";"
result = run(test, [profile], parameterStore)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding the changes, but does this store only the QC result for the first level of the profile? Should it be result = np.any(run(test,[profile],parameterStore)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct as written, though it is pretty confusing on second look! As I read it, result = run(test, [profile], parameterStore)[0][0] would be the first level in the first profile, and result = run(test, [profile], parameterStore)[0] is the QC result for every level in the first profile - which is the only one we consider, since run is getting a list of exactly one profile to run on.

AutoQC.py Outdated
query = "UPDATE " + sys.argv[1] + " SET " + test.lower() + " = " + str(result[0][0]) + " WHERE uid = " + str(profile.uid()) + ";"
result = run(test, [profile], parameterStore)[0]
result = pickle.dumps(result, -1)
query = "UPDATE " + sys.argv[1] + " SET " + test.lower() + " = " + str(psycopg2.Binary(result)) + " WHERE uid = " + str(profile.uid()) + ";"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth changing to create a query containing all the QC results for a single profile before sending it to the database? With sqlite this makes things a lot faster as it reduces the amount of time writing to the database. PostgresQL may be different though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally I would agree, but are there no cases where one test consumes the result of a previous test by looking it up from the db? I can't quickly find an example of this, but I remember something like this being relevant - I'll look into it more carefully this weekend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On closer inspection, the information pulled across tests is all auxiliary data from other tables, not the final QC result - so sure, I'll change this presently.

@s-good
Copy link
Contributor

s-good commented Mar 3, 2017

Hi @BillMills, one thing I noticed about these changes is that the QC routines now have database interactions with them, which means that it is more difficult for people to take them and use them in other things. I wonder if it is possible to somehow encapsulate this in a separate function so that they can be used outside of a database setup?

@bkatiemills
Copy link
Member Author

@s-good - so, just to make sure I understand what you're after, you'd like the main test routine for each QC test to not explicitly refer to any database, but call helper functions instead, the idea being that our helper functions will encapsulate our database interactions, and can be more easily replaced by third parties if they want to consume our tests somewhere else.

Sure - that's just a whole bunch of trivial wrapper functions, we can definitely do that. I'll have a go in the next week.

@s-good
Copy link
Contributor

s-good commented Mar 8, 2017

That's right, if possible so that it will run without a database automatically if one isn't present so that a third party can just plug the code into their system. Of course, the first priority is to get it working well for our own application - I've been finding a few problems with the EN_track_check (see #196 ) which I think these modifications might help with.

@bkatiemills
Copy link
Member Author

Hi @s-good,

Sorry it's been a while, just wanted to update you on my thinking here; I'm working on a hybrid postgres / sqlite3 solution that I think will be the best of both worlds. Basically, we'll use sqlite3 (which ships with python and requires no extra infrastructure setup, unlike postgres) for parameter stores (ie, everything other than the main results table), as well as for everything in the unit test suite. That way, it'll be possible to pick the qc tests and their associated unit tests up and use them in another project, per your request.

Meanwhile, the AutoQC infrastructure that consumes the qc tests will continue to use postgres for its main database table. I'm worried that sqlite isn't going to perform well for the main table in a case where we have high concurrency and large datasets - ie, full production runs. But, it would be interesting to see if experiment actually bears this out; if things go fine with a full sqlite backend, then we could just use that exclusively and simplify the whole operation - if it's feasible, we'll see.

@s-good
Copy link
Contributor

s-good commented Mar 21, 2017

Hi @BillMills, this sounds like a really nice solution! Thanks for working on this.

@bkatiemills
Copy link
Member Author

Alright - these last changes migrate to exclusive use of sqlite3. In addition to the unit tests, I see no change in results over my usual 1000-profile integration test case.

This is a huge architectural change. I strongly recommend performing an independent test for accuracy and speed before merging - but given the success of those tests, I think this will dramatically clean up and simplify AutoQC's architecture.

@s-good
Copy link
Contributor

s-good commented Jun 16, 2017

Sorry to have been slow to reply to this. I've been having a go with the new version. I think it looks really good. I've run it on a test set of 10000 profiles using 4 processors and it ran quickly. A couple of issues came up, which probably only need small code tweaks if anything, so I've not checked yet if the results are exactly the same as the postgres version but will aim to do that very soon.

  • I had a few problems with the interpretation of the truth QC flags in summarize results. At first I found that everything was being set to being flagged, which I think is because the data were in strings and then when converted to boolean these were always coming out as True. I tried to do a conversion to int first which worked for some data but not on others (don't know why). I might be wrong, but it looks like the full profile of QC results are being saved for the tests as a 'BLOB' type. I was wondering if we should do the same for the truth results? It could have advantages in the long run for analysing the data.
  • summarize_results fails if QC results have not been stored since the code tries to unpickle nonexistent data. I got around this by having using a try, except in unpack_qc so that it returns False if unpacking the data fails.

This was referenced Aug 13, 2017
@bkatiemills
Copy link
Member Author

after #210, the sqlite and postgres implementations of AutoQC perform identically in a single process on @s-good's 10k testing profiles. I'm currently investigating behavior in a multi-process environment, and if everything checks out there, we can go ahead and merge here.

@bkatiemills
Copy link
Member Author

after the fixes and upgrades in #211, this successfully runs on out 10k test sample in parallel, and is good to merge IMO after @s-good signs off.

@s-good
Copy link
Contributor

s-good commented Dec 3, 2017

This looks like a really impressive set of changes! Thanks for tackling this! I'm very happy for this to be merged. A couple of queries are below.

My only question is with the removal of the try/excepts in the CoTeDe QC test routines. I can't remember why I set it up like this, but I wonder if CoTeDe sometimes returns exceptions. What will happen in that situation now?

We also could consider including lines 52-57 from AutoQC.py https://github.com/IQuOD/AutoQC/blob/master/AutoQC.py#L52 in build-db.py to skip writing profiles to the database if there are no usable data, as done already in a slightly different way at https://github.com/IQuOD/AutoQC/blob/master/build-db.py#L61? It avoids having entries in the database that don't have QC results attached to them.

@bkatiemills
Copy link
Member Author

bkatiemills commented Dec 7, 2017

@s-good, regarding the try / excpets around the CoTeDe tests, you're right - those tests return a lot of exceptions, and we put these try / excepts in to not error out in those cases. The trouble was, that was suppressing those error messages that we really should be debugging (someday); now, those same errors are caught by the try/except at the test running step in AutoQC.py,

AutoQC/AutoQC.py

Lines 40 to 45 in 6f75133

try:
result = run(test, [profile], parameterStore)[0]
except:
print test, 'exception', sys.exc_info()
result = np.zeros(1, dtype=bool)
, which will log those errors to the per-profile logs, making them available for analysis and remedy.

I agree with your suggestion for skipping over the profiles that have no usable data at the db writing step - that should have been included from the beginning, I'll add that in, run one more regression and unit test run to make absolutely certain everything is correct, then merge.

@bkatiemills
Copy link
Member Author

Alright, @s-good the modifications you requested are in place and everything still checks out, so I'm going to go ahead and merge - thanks for your feedback on this long PR, I think this is a substantial step forward in design and a number of bugs caught along the way.

@bkatiemills bkatiemills merged commit bf9603d into master Dec 8, 2017
@bkatiemills bkatiemills deleted the no-global branch February 8, 2018 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants