Take into account the non completness of the data for category prediction training #47

alexgarel · 2022-05-03T08:35:01Z

Brushing my teeth this morning I though about what I see as an important issue:
We have a lot of data which does not have all the classes it should have.
Say I have a food which should be categorized 13% red wine but the category in the dataset is only beverages.

As we train the model, this is the problem for if the model guesses "red wine" we will tell it this is not true (while this is).

Unfortunately for us, "agribalyse" categories are quite deep categories…

How do we tackle this issue ?

alexgarel · 2022-05-03T08:41:32Z

My personal though so far, I see three pathes:

only train on data with deep-enough categories (and maybe in this case we should take all what we can from the database, not a random selection)
pre-train data on higher level categories only (more data), then fine tune on deep categories (with less data)
integrate into the model and training method the lack of knowledge. That is, if we only know about high categories on an example, do not train, for this example, on finer categories

(2) might be the easiest to implement without loosing too much data (two rounds training, with not the same data) and might be efficient to train a multi-layer network

(3) might ask to find a specific Tensorflow component or write the implementation ourselves.

However if someone do some bibliography / investigation to find proper lib, this would be great :-)

alexgarel · 2022-05-03T12:50:07Z

Just to stress out one aspect: we will get a good accuracy at training time, because model will learn higher classes, but this won't help us categorize as we wanted.

To measure that, we have can see what is the number of agribalyse category predicted (but remember we have them on few items)

alexgarel · 2022-06-07T13:10:27Z

I began something in ag-cat-loss branch.

We have a clear rule for compatible categories and I am able to add them as features.

I know have to use that in loss.

From this interesting exchange in data for good channel #needs_data-science, Gabriel Olympie wrote:

import tensorflow as tf
import numpy as np
## assume the label is zero if label false, 1 if label true, 2 if label unknown
## Prediction are sigmoided values between 0 and 1
## tensors have shape (batch_size, number of nodes in the DAG)
loss_object = tf.keras.losses.BinaryCrossentropy()
def masked_loss(true, pred):
    valid_indices = tf.where(tf.math.logical_not(tf.math.equal(true, 2)))
    true = tf.gather(true, valid_indices)
    pred = tf.gather(pred, valid_indices)
    loss = loss_object(true, pred)
    return loss
        
true = np.random.randint(0,3,size = (32,10))
pred = np.random.uniform(0,1,size = (32,10))
loss = masked_loss(true, pred)

teolemon changed the title ~~Take into account the non completness of the data~~ Take into account the non completness of the data for category prediction training May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take into account the non completness of the data for category prediction training #47

Take into account the non completness of the data for category prediction training #47

alexgarel commented May 3, 2022 •

edited

Loading

alexgarel commented May 3, 2022

alexgarel commented May 3, 2022

alexgarel commented Jun 7, 2022

Take into account the non completness of the data for category prediction training #47

Take into account the non completness of the data for category prediction training #47

Comments

alexgarel commented May 3, 2022 • edited Loading

alexgarel commented May 3, 2022

alexgarel commented May 3, 2022

alexgarel commented Jun 7, 2022

alexgarel commented May 3, 2022 •

edited

Loading