Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When we run an analysis, what do we want to get back? #3

Open
cgreene opened this issue Jul 19, 2016 · 11 comments
Open

When we run an analysis, what do we want to get back? #3

cgreene opened this issue Jul 19, 2016 · 11 comments
Labels

Comments

@cgreene
Copy link
Member

cgreene commented Jul 19, 2016

We need to design our results json so that we can later visualize the most important results via the results viewer from the UI team.

@autokad
Copy link

autokad commented Jul 19, 2016

F1 Score

@autokad
Copy link

autokad commented Jul 19, 2016

Confusion Matrix

@autokad
Copy link

autokad commented Jul 19, 2016

Y Hat

@cgreene
Copy link
Member Author

cgreene commented Jul 19, 2016

prediction scores

@cgreene cgreene added the task label Jul 20, 2016
@yl565
Copy link
Contributor

yl565 commented Jul 20, 2016

Feature ranking, a list of selected features. For GLM, F-stat/t-stat and p-values of predictors, model goodness of fit

@dhimmel
Copy link
Member

dhimmel commented Jul 27, 2016

We should probably save the sklearn estimators representing any transformations and the classifier. The sklearn doc recommends pickle for estimator persistence. Pickle is a binary serialization format in Python. @dcgoss, @awm33, and others -- can we store binary files in our database?

@dcgoss
Copy link
Member

dcgoss commented Jul 28, 2016

@dhimmel
Copy link
Member

dhimmel commented Jul 28, 2016

Python object serialization to base64 encoded text

@dcgoss cool. I think we the following solution will work:

import base64
import pickle
payload = ['a', 'list', 2, 'encode']
byte_pickle = pickle.dumps(payload, protocol=4)
base64_text = base64.b64encode(byte_pickle).decode()
# Save `base64_text` using a text field in the database
byte_pickle = base64.b64decode(base64_text.encode())
pickle.loads(byte_pickle)

FYI base64_text, which would be saved the database is gANdcQAoWAEAAABhcQFYBAAAAGxpc3RxAksCWAYAAABlbmNvZGVxA2Uu.

@awm33
Copy link
Member

awm33 commented Jul 28, 2016

@dhimmel base64 text is usually fine for small sizes. Can also be stored as text in JSON fields. How big are the binaries? Is gANdcQAoWAEAAABhcQFYBAAAAGxpc3RxAksCWAYAAABlbmNvZGVxA2Uu a typical example?

@dhimmel
Copy link
Member

dhimmel commented Jul 28, 2016

I pickle-->base64-->text converted best_clf from the example notebook. The resulting string had 219,788 characters. I assume different types of classifiers will have different sizes.

If I add an extra step to compress, so the entire compression becomes:

byte_pickle = pickle.dumps(best_clf, protocol=4)
byte_pickle = zlib.compress(byte_pickle)
base64_text = base64.b64encode(byte_pickle).decode()

Then base64_text is only 11,468 characters. @awm33, is that okay?

@awm33
Copy link
Member

awm33 commented Jul 28, 2016

@dhimmel Compressing is a good move. If we think this would go into the tens of megabytes or more, we may want to consider using blob storage such as S3 or GCS. Postgres can handle gigabytes of text, but it's not great for performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants