Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine learning strategy (was 'Combinatorics intelligence') #60

Open
bkatiemills opened this issue May 22, 2015 · 4 comments
Open

Machine learning strategy (was 'Combinatorics intelligence') #60

bkatiemills opened this issue May 22, 2015 · 4 comments

Comments

@bkatiemills
Copy link
Member

The brute-force combinatorics examinations implemented in #38 are acceptable for small numbers of tests, but compute time will diverge very badly as the number of tests grows. We need a stronger strategy.

The numbers reported in #59 make the individual tests seem much too permissive on their own; one idea could be to look for tests that flag disjoint sets of profiles, and OR them all together.

@bkatiemills
Copy link
Member Author

After looking into machine learning strategies in greater depth, this paper by Sadohara suggests that a support vector machine (SVM) might be the best choice for this problem. Sadohara's paper examines learning boolean functions (which is exactly what we have - whether a profile passed or failed each individual test are our boolean inputs, and whether the profile should ultimately pass or fail is the boolean output) by examining all possible conjunctions of boolean inputs (which is exactly what we were naively doing in #38, albeit in a much cruder fashion). Sadohara presents two SVM implementations that:

  • produce a lower misclassification rate than other leading techniques for the same amount of training data.
  • remain computationally efficient even when considering a large number of inputs (QC tests).

Furthermore, scikit-learn supports SVM with custom kernels, meaning this ought to be not very difficult to implement.

@bkatiemills bkatiemills changed the title Combinatorics intelligence Machine learning strategy (was 'Combinatorics intelligence') Sep 8, 2015
@bkatiemills
Copy link
Member Author

cc @s-good @BecCowley @BoyerWOD

#101 presents a detailed report on a first pass at using machine learning techniques to perform final classification. Executive summary:

  • the most effective techniques correctly flag about 55% of profiles in the full quota dataset that ought to be flagged.
  • corresponding false positive rates on datasets that should not be flagged are around 10%
  • execution time for the full quota dataset on a low-end machine are under a minute.

These results are preliminary and much more remains to be done to optimize them; comments and suggestions very welcome.

@bkatiemills
Copy link
Member Author

One thing that comes to mind that will produce systematic biases in any final decision strategy is the nature of flagged profiles in the dataset trained on; for example, the machine learning techniques explored above offer only a small improvement over just using EN_background; if the quota sample overrepresents profiles flagged by one test (like EN_background), this will bias any strategy.

Put another way, absolutely none of quota fails the impossible_location test; we therefore can't learn anything about its performance from this dataset.

Is there a way to know exactly why a given profile in the quota dataset has been marked as bad?

@s-good
Copy link
Contributor

s-good commented Nov 4, 2015

Hi Bill, the information about why a profile has been rejected is available but probably not in the ASCII file that you have. There are some other data available that we can run on to try to uncover biases from the training dataset. It also might be useful to split up the Quota dataset and see if different parts give similar answers. We should include all these things in the discussions at the upcoming workshop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants