Machine learning strategy (was 'Combinatorics intelligence') #60

bkatiemills · 2015-05-22T23:23:19Z

The brute-force combinatorics examinations implemented in #38 are acceptable for small numbers of tests, but compute time will diverge very badly as the number of tests grows. We need a stronger strategy.

The numbers reported in #59 make the individual tests seem much too permissive on their own; one idea could be to look for tests that flag disjoint sets of profiles, and OR them all together.

bkatiemills · 2015-09-08T20:54:14Z

After looking into machine learning strategies in greater depth, this paper by Sadohara suggests that a support vector machine (SVM) might be the best choice for this problem. Sadohara's paper examines learning boolean functions (which is exactly what we have - whether a profile passed or failed each individual test are our boolean inputs, and whether the profile should ultimately pass or fail is the boolean output) by examining all possible conjunctions of boolean inputs (which is exactly what we were naively doing in #38, albeit in a much cruder fashion). Sadohara presents two SVM implementations that:

produce a lower misclassification rate than other leading techniques for the same amount of training data.
remain computationally efficient even when considering a large number of inputs (QC tests).

Furthermore, scikit-learn supports SVM with custom kernels, meaning this ought to be not very difficult to implement.

bkatiemills · 2015-10-17T20:16:42Z

cc @s-good @BecCowley @BoyerWOD

#101 presents a detailed report on a first pass at using machine learning techniques to perform final classification. Executive summary:

the most effective techniques correctly flag about 55% of profiles in the full quota dataset that ought to be flagged.
corresponding false positive rates on datasets that should not be flagged are around 10%
execution time for the full quota dataset on a low-end machine are under a minute.

These results are preliminary and much more remains to be done to optimize them; comments and suggestions very welcome.

bkatiemills · 2015-10-18T06:41:39Z

One thing that comes to mind that will produce systematic biases in any final decision strategy is the nature of flagged profiles in the dataset trained on; for example, the machine learning techniques explored above offer only a small improvement over just using EN_background; if the quota sample overrepresents profiles flagged by one test (like EN_background), this will bias any strategy.

Put another way, absolutely none of quota fails the impossible_location test; we therefore can't learn anything about its performance from this dataset.

Is there a way to know exactly why a given profile in the quota dataset has been marked as bad?

s-good · 2015-11-04T16:16:36Z

Hi Bill, the information about why a profile has been rejected is available but probably not in the ASCII file that you have. There are some other data available that we can run on to try to uncover biases from the training dataset. It also might be useful to split up the Quota dataset and see if different parts give similar answers. We should include all these things in the discussions at the upcoming workshop.

bkatiemills mentioned this issue Jun 7, 2015

first pass at combining test results with a decision tree #79

Closed

bkatiemills changed the title ~~Combinatorics intelligence~~ Machine learning strategy (was 'Combinatorics intelligence') Sep 8, 2015

bkatiemills mentioned this issue Sep 24, 2015

Goals for Hamburg #95

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine learning strategy (was 'Combinatorics intelligence') #60

Machine learning strategy (was 'Combinatorics intelligence') #60

bkatiemills commented May 22, 2015

bkatiemills commented Sep 8, 2015

bkatiemills commented Oct 17, 2015

bkatiemills commented Oct 18, 2015

s-good commented Nov 4, 2015

Machine learning strategy (was 'Combinatorics intelligence') #60

Machine learning strategy (was 'Combinatorics intelligence') #60

Comments

bkatiemills commented May 22, 2015

bkatiemills commented Sep 8, 2015

bkatiemills commented Oct 17, 2015

bkatiemills commented Oct 18, 2015

s-good commented Nov 4, 2015