Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take into account the non completness of the data for category prediction training #47

Open
alexgarel opened this issue May 3, 2022 · 3 comments

Comments

@alexgarel
Copy link
Member

alexgarel commented May 3, 2022

Brushing my teeth this morning I though about what I see as an important issue:
We have a lot of data which does not have all the classes it should have.
Say I have a food which should be categorized 13% red wine but the category in the dataset is only beverages.

As we train the model, this is the problem for if the model guesses "red wine" we will tell it this is not true (while this is).

Unfortunately for us, "agribalyse" categories are quite deep categories…

How do we tackle this issue ?

@alexgarel
Copy link
Member Author

My personal though so far, I see three pathes:

  1. only train on data with deep-enough categories (and maybe in this case we should take all what we can from the database, not a random selection)
  2. pre-train data on higher level categories only (more data), then fine tune on deep categories (with less data)
  3. integrate into the model and training method the lack of knowledge. That is, if we only know about high categories on an example, do not train, for this example, on finer categories

(2) might be the easiest to implement without loosing too much data (two rounds training, with not the same data) and might be efficient to train a multi-layer network

(3) might ask to find a specific Tensorflow component or write the implementation ourselves.

However if someone do some bibliography / investigation to find proper lib, this would be great :-)

@alexgarel
Copy link
Member Author

Just to stress out one aspect: we will get a good accuracy at training time, because model will learn higher classes, but this won't help us categorize as we wanted.

To measure that, we have can see what is the number of agribalyse category predicted (but remember we have them on few items)

@teolemon teolemon changed the title Take into account the non completness of the data Take into account the non completness of the data for category prediction training May 3, 2022
@alexgarel
Copy link
Member Author

I began something in ag-cat-loss branch.

We have a clear rule for compatible categories and I am able to add them as features.

I know have to use that in loss.

From this interesting exchange in data for good channel #needs_data-science, Gabriel Olympie wrote:

import tensorflow as tf
import numpy as np
## assume the label is zero if label false, 1 if label true, 2 if label unknown
## Prediction are sigmoided values between 0 and 1
## tensors have shape (batch_size, number of nodes in the DAG)
loss_object = tf.keras.losses.BinaryCrossentropy()
def masked_loss(true, pred):
    valid_indices = tf.where(tf.math.logical_not(tf.math.equal(true, 2)))
    true = tf.gather(true, valid_indices)
    pred = tf.gather(pred, valid_indices)
    loss = loss_object(true, pred)
    return loss
        
true = np.random.randint(0,3,size = (32,10))
pred = np.random.uniform(0,1,size = (32,10))
loss = masked_loss(true, pred)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To discuss and validate
Development

No branches or pull requests

1 participant