-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine learning strategy (was 'Combinatorics intelligence') #60
Comments
After looking into machine learning strategies in greater depth, this paper by Sadohara suggests that a support vector machine (SVM) might be the best choice for this problem. Sadohara's paper examines learning boolean functions (which is exactly what we have - whether a profile passed or failed each individual test are our boolean inputs, and whether the profile should ultimately pass or fail is the boolean output) by examining all possible conjunctions of boolean inputs (which is exactly what we were naively doing in #38, albeit in a much cruder fashion). Sadohara presents two SVM implementations that:
Furthermore, scikit-learn supports SVM with custom kernels, meaning this ought to be not very difficult to implement. |
cc @s-good @BecCowley @BoyerWOD #101 presents a detailed report on a first pass at using machine learning techniques to perform final classification. Executive summary:
These results are preliminary and much more remains to be done to optimize them; comments and suggestions very welcome. |
One thing that comes to mind that will produce systematic biases in any final decision strategy is the nature of flagged profiles in the dataset trained on; for example, the machine learning techniques explored above offer only a small improvement over just using Put another way, absolutely none of quota fails the Is there a way to know exactly why a given profile in the quota dataset has been marked as bad? |
Hi Bill, the information about why a profile has been rejected is available but probably not in the ASCII file that you have. There are some other data available that we can run on to try to uncover biases from the training dataset. It also might be useful to split up the Quota dataset and see if different parts give similar answers. We should include all these things in the discussions at the upcoming workshop. |
The brute-force combinatorics examinations implemented in #38 are acceptable for small numbers of tests, but compute time will diverge very badly as the number of tests grows. We need a stronger strategy.
The numbers reported in #59 make the individual tests seem much too permissive on their own; one idea could be to look for tests that flag disjoint sets of profiles, and OR them all together.
The text was updated successfully, but these errors were encountered: