-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal of globals from qc tests #195
Conversation
…le name, added usage instructions.
AutoQC.py
Outdated
@@ -58,12 +51,15 @@ def process_row(uid): | |||
|
|||
# run tests | |||
for itest, test in enumerate(testNames): | |||
result = run(test, [profile], parameterStore) | |||
query = "UPDATE " + sys.argv[1] + " SET " + test.lower() + " = " + str(result[0][0]) + " WHERE uid = " + str(profile.uid()) + ";" | |||
result = run(test, [profile], parameterStore)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be misunderstanding the changes, but does this store only the QC result for the first level of the profile? Should it be result = np.any(run(test,[profile],parameterStore)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct as written, though it is pretty confusing on second look! As I read it, result = run(test, [profile], parameterStore)[0][0]
would be the first level in the first profile, and result = run(test, [profile], parameterStore)[0]
is the QC result for every level in the first profile - which is the only one we consider, since run
is getting a list of exactly one profile to run on.
AutoQC.py
Outdated
query = "UPDATE " + sys.argv[1] + " SET " + test.lower() + " = " + str(result[0][0]) + " WHERE uid = " + str(profile.uid()) + ";" | ||
result = run(test, [profile], parameterStore)[0] | ||
result = pickle.dumps(result, -1) | ||
query = "UPDATE " + sys.argv[1] + " SET " + test.lower() + " = " + str(psycopg2.Binary(result)) + " WHERE uid = " + str(profile.uid()) + ";" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth changing to create a query containing all the QC results for a single profile before sending it to the database? With sqlite this makes things a lot faster as it reduces the amount of time writing to the database. PostgresQL may be different though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally I would agree, but are there no cases where one test consumes the result of a previous test by looking it up from the db? I can't quickly find an example of this, but I remember something like this being relevant - I'll look into it more carefully this weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On closer inspection, the information pulled across tests is all auxiliary data from other tables, not the final QC result - so sure, I'll change this presently.
Hi @BillMills, one thing I noticed about these changes is that the QC routines now have database interactions with them, which means that it is more difficult for people to take them and use them in other things. I wonder if it is possible to somehow encapsulate this in a separate function so that they can be used outside of a database setup? |
@s-good - so, just to make sure I understand what you're after, you'd like the main Sure - that's just a whole bunch of trivial wrapper functions, we can definitely do that. I'll have a go in the next week. |
That's right, if possible so that it will run without a database automatically if one isn't present so that a third party can just plug the code into their system. Of course, the first priority is to get it working well for our own application - I've been finding a few problems with the EN_track_check (see #196 ) which I think these modifications might help with. |
Hi @s-good, Sorry it's been a while, just wanted to update you on my thinking here; I'm working on a hybrid postgres / sqlite3 solution that I think will be the best of both worlds. Basically, we'll use sqlite3 (which ships with python and requires no extra infrastructure setup, unlike postgres) for parameter stores (ie, everything other than the main results table), as well as for everything in the unit test suite. That way, it'll be possible to pick the qc tests and their associated unit tests up and use them in another project, per your request. Meanwhile, the AutoQC infrastructure that consumes the qc tests will continue to use postgres for its main database table. I'm worried that sqlite isn't going to perform well for the main table in a case where we have high concurrency and large datasets - ie, full production runs. But, it would be interesting to see if experiment actually bears this out; if things go fine with a full sqlite backend, then we could just use that exclusively and simplify the whole operation - if it's feasible, we'll see. |
Hi @BillMills, this sounds like a really nice solution! Thanks for working on this. |
Alright - these last changes migrate to exclusive use of sqlite3. In addition to the unit tests, I see no change in results over my usual 1000-profile integration test case. This is a huge architectural change. I strongly recommend performing an independent test for accuracy and speed before merging - but given the success of those tests, I think this will dramatically clean up and simplify AutoQC's architecture. |
Sorry to have been slow to reply to this. I've been having a go with the new version. I think it looks really good. I've run it on a test set of 10000 profiles using 4 processors and it ran quickly. A couple of issues came up, which probably only need small code tweaks if anything, so I've not checked yet if the results are exactly the same as the postgres version but will aim to do that very soon.
|
Update EN_std_lev_bkg_and_buddy_check.py
correct and protect course correction angle calculation
addressed parallelization issues with sqlite db backing
This looks like a really impressive set of changes! Thanks for tackling this! I'm very happy for this to be merged. A couple of queries are below. My only question is with the removal of the try/excepts in the CoTeDe QC test routines. I can't remember why I set it up like this, but I wonder if CoTeDe sometimes returns exceptions. What will happen in that situation now? We also could consider including lines 52-57 from AutoQC.py https://github.com/IQuOD/AutoQC/blob/master/AutoQC.py#L52 in build-db.py to skip writing profiles to the database if there are no usable data, as done already in a slightly different way at https://github.com/IQuOD/AutoQC/blob/master/build-db.py#L61? It avoids having entries in the database that don't have QC results attached to them. |
@s-good, regarding the try / excpets around the CoTeDe tests, you're right - those tests return a lot of exceptions, and we put these try / excepts in to not error out in those cases. The trouble was, that was suppressing those error messages that we really should be debugging (someday); now, those same errors are caught by the try/except at the test running step in AutoQC.py, Lines 40 to 45 in 6f75133
I agree with your suggestion for skipping over the profiles that have no usable data at the db writing step - that should have been included from the beginning, I'll add that in, run one more regression and unit test run to make absolutely certain everything is correct, then merge. |
Alright, @s-good the modifications you requested are in place and everything still checks out, so I'm going to go ahead and merge - thanks for your feedback on this long PR, I think this is a substantial step forward in design and a number of bugs caught along the way. |
Summary
bashrc
for setting up the Docker container: smooths out problems running on Travis.Discussion
100 profiles
m4.2xlarge 8 CPU / 32 GiB mem
c4.4xlarge 16 CPU / 30 GiB mem
c4.8xlarge 36 CPU / 60 GiB mem
1000 profilies
c4.8xlarge 36 CPU / 60 GiB mem
In all cases, execution time drops linearly with processes until the number of CPUs in the instance begins to saturate; also note that doubling the amount of memory while using the same number of processors in going from c4.4 to c4.8xlarge didn't change execution times for small numbers of CPUs, suggesting that this behavior is not a memory limitation.