-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Take into account the non completness of the data for category prediction training #47
Comments
My personal though so far, I see three pathes:
(2) might be the easiest to implement without loosing too much data (two rounds training, with not the same data) and might be efficient to train a multi-layer network (3) might ask to find a specific Tensorflow component or write the implementation ourselves. However if someone do some bibliography / investigation to find proper lib, this would be great :-) |
Just to stress out one aspect: we will get a good accuracy at training time, because model will learn higher classes, but this won't help us categorize as we wanted. To measure that, we have can see what is the number of agribalyse category predicted (but remember we have them on few items) |
I began something in ag-cat-loss branch. We have a clear rule for compatible categories and I am able to add them as features. I know have to use that in loss. From this interesting exchange in data for good channel #needs_data-science, Gabriel Olympie wrote: import tensorflow as tf
import numpy as np
## assume the label is zero if label false, 1 if label true, 2 if label unknown
## Prediction are sigmoided values between 0 and 1
## tensors have shape (batch_size, number of nodes in the DAG)
loss_object = tf.keras.losses.BinaryCrossentropy()
def masked_loss(true, pred):
valid_indices = tf.where(tf.math.logical_not(tf.math.equal(true, 2)))
true = tf.gather(true, valid_indices)
pred = tf.gather(pred, valid_indices)
loss = loss_object(true, pred)
return loss
true = np.random.randint(0,3,size = (32,10))
pred = np.random.uniform(0,1,size = (32,10))
loss = masked_loss(true, pred) |
Brushing my teeth this morning I though about what I see as an important issue:
We have a lot of data which does not have all the classes it should have.
Say I have a food which should be categorized 13% red wine but the category in the dataset is only beverages.
As we train the model, this is the problem for if the model guesses "red wine" we will tell it this is not true (while this is).
Unfortunately for us, "agribalyse" categories are quite deep categories…
How do we tackle this issue ?
The text was updated successfully, but these errors were encountered: