ingredients_text is truncated in mongodb dump #7244

alexgarel · 2022-08-24T12:28:51Z

Describe the bug

Some items in the jsonl and bson export of the mongodb have "ingredients_text" field truncated, whereas the API have it all, as well as mongodb.

Those are not new items.

To Reproduce

On mongo:

 db.products.find({"_id": "1340951640901"}, {"ingredients_text":1})
{ "_id" : "1340951640901", "ingredients_text" : "Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and/or canola), paprika, xanthan gum, propylene glycol alginate, citric acid." }

whereas in the json:

{"_id":"01340951640901", ..., ingredients_text":"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and"

Additional examples:

00637480006835
00016571950293
00016571910303
07203671232

Expected behavior

"ingredients_text" should be complete.

The text was updated successfully, but these errors were encountered:

alexgarel · 2022-08-24T12:33:03Z

I took time to investigate this morning but didn't find any clue doing internet research.

I read the mongodump and mongoexport doc in search of a config parameter or so… nothing.

The only limit I can find is about document size which should be under 16Mb, and:

> Object.bsonsize(db.products.find({"_id": "1340951640901"}))
78782

our object is 78k so we are far from this limit.

Another limit is for index fields (1024) but this is normally removed in our version of mongodb and our string are truncated well before (while being ascii, so utf-8 encoding is not longer).

alexgarel · 2022-08-24T13:26:09Z

I made a test to retrieve 1340951640901 on a test mongo instance where @CharlesNepote has imported data some monthes ago from the dump… the string is not truncated !

alexgarel · 2022-08-24T13:26:19Z

I continue investigation to find differences.

Here is the value I get using:

in the bson: (decode with bsondump products.bson |grep -A 3 -e '\(1340951640901\|00637480006835\|00016571950293\|00016571910303\|07203671232\)' > truncated.json
in jsonl: (zcat ../../openfoodfacts-products.jsonl.gz |grep -A 3 -e '\(1340951640901\|00637480006835\|00016571950293\|00016571910303\|07203671232\)' > truncated-json.json
in mongosh: db.products.find({"_id": "xxxxx"}, {"ingredients_text":1})
search api: https://world.openfoodfacts.org/api/v2/search/?codes_tags=xxxxxxx&fields=ingredients_text,last_modified_t,last_editor,code
direct access: https://world.openfoodfacts.org/api/v2/product/xxxxxx&fields=ingredients_text

1340951640901
"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and"
"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and"
"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and/or canola), paprika, xanthan gum, propylene glycol alginate, citric acid."
"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and/or canola), paprika, xanthan gum, propylene glycol alginate, citric acid."
"Aged cayenne pepper, distilled vinegar, margarine [soybean oil, hydrogenated soybean oil, soy lecithin (soy), artificial butter flavor (diacetyl-free), colored with beta carotene], water, salt, garlic,* contains less than 2% of: vegetable oil (soybean and/or canola), paprika, xanthan gum, propylene glycol alginate, citric acid."

00637480006835
"Water, dairy protein blend (milk protein concentrate,calcium ca seinate, whey protein concentrate), sunflower oil, pasteurized cream, isolated soy protein, cellulose gel, cellulose gum, magnesium phosphate, potassium citrste, sodium hexametaphosphate, nat"
"Water, dairy protein blend (milk protein concentrate,calcium ca seinate, whey protein concentrate), sunflower oil, pasteurized cream, isolated soy protein, cellulose gel, cellulose gum, magnesium phosphate, potassium citrste, sodium hexametaphosphate, nat"
"Water, dairy protein blend (milk protein concentrate,calcium ca seinate, whey protein concentrate), sunflower oil, pasteurized cream, isolated soy protein, cellulose gel, cellulose gum, magnesium phosphate, potassium citrste, sodium hexametaphosphate, nat"
"Water, dairy protein blend (milk protein concentrate, calcium caseinate, whey protein concentrate), sunflower oil, pasteurized cream, isolated soy protein, cellulose gel, cellulose gum, magnesium phosphate, potassium citrate, sodium hexametaphosphate, natural and artificial flavors, calcium phosphate, salt, acesulfame potassium, carrageenan, soy lecithin, sucralose. vitamin mineral blend: sodium ascorbate (vitamin c), zinc gluconate, dl-apha-tocopheryl acetate (vitamin e), niacinamide (vitamin b3), manganese gluconate, d-calcium pantothenate (vitamin b5), pyridoxine hydrochloride (vitamin b6), thiamin hydrochloride (vitamin b1), riboflavin (vitamin b2), chromium chloride, folic acid (vitamin b9), biotin (vitamin b7), potassium iodide, sodium molybdate, sodium selenite, phylloquinone (vitamin k1), cyanocobalamin (vitamin b12), cholecalciferol (vitamin d3)."
"Water, dairy protein blend (milk protein concentrate, calcium caseinate, whey protein concentrate), sunflower oil, pasteurized cream, isolated soy protein, cellulose gel, cellulose gum, magnesium phosphate, potassium citrate, sodium hexametaphosphate, natural and artificial flavors, calcium phosphate, salt, acesulfame potassium, carrageenan, soy lecithin, sucralose. vitamin mineral blend: sodium ascorbate (vitamin c), zinc gluconate, dl-apha-tocopheryl acetate (vitamin e), niacinamide (vitamin b3), manganese gluconate, d-calcium pantothenate (vitamin b5), pyridoxine hydrochloride (vitamin b6), thiamin hydrochloride (vitamin b1), riboflavin (vitamin b2), chromium chloride, folic acid (vitamin b9), biotin (vitamin b7), potassium iodide, sodium molybdate, sodium selenite, phylloquinone (vitamin k1), cyanocobalamin (vitamin b12), cholecalciferol (vitamin d3)."


00016571950293
"Carbonated mountain spring water, natural flavors, citric acid, apple juice concentrate, lemon juice concentrate, potassium benzoate (to ensure freshness), sucralose, ester gum, green tea extract, calcium disodium edta (to protect flavor), red #40, yellow"
"Carbonated mountain spring water, natural flavors, citric acid, apple juice concentrate, lemon juice concentrate, potassium benzoate (to ensure freshness), sucralose, ester gum, green tea extract, calcium disodium edta (to protect flavor), red #40, yellow"
"Carbonated mountain spring water, natural flavors, citric acid, apple juice concentrate, lemon juice concentrate, potassium benzoate (to ensure freshness), sucralose, ester gum, green tea extract, calcium disodium edta (to protect flavor), red #40, yellow"
"Carbonated mountain spring water, natural flavors, citric acid, apple juice concentrate, lemon juice concentrate, potassium benzoate (to ensure freshness), sucralose, ester gum, green tea extract, calcium disodium edta (to protect flavor), red #40, yellow"
"Carbonated water, citric acid, natural flavors, lemon juice concentrate, potassium benzoate (to ensure freshness), fruit and vegetable juice (for color), sucralose, beta carotene (for color), green tea extract, ester gum, calcium disodium edta (to protect flavor), biotin, niacinamide (vitamin b3), calcium pantothenate (vitamin b5), vitamin a, vitamin b12, vitamin d3, pyridoxine hydrochloride (vitamin b6)."

00016571910303
"Carbonated mountain spring water, natural flavors, blackberry juice concentrate, malic acid, potassium benzoate (to ensure freshness), sucralose, green tea extract, red #40, biotin 1% trit. (maltodextrin), niacinamide (b3), d-calcium pantothenate (b5), vi"
"Carbonated mountain spring water, natural flavors, blackberry juice concentrate, malic acid, potassium benzoate (to ensure freshness), sucralose, green tea extract, red #40, biotin 1% trit. (maltodextrin), niacinamide (b3), d-calcium pantothenate (b5), vi"
"Carbonated mountain spring water, natural flavors, blackberry juice concentrate, malic acid, potassium benzoate (to ensure freshness), sucralose, green tea extract, red #40, biotin 1% trit. (maltodextrin), niacinamide (b3), d-calcium pantothenate (b5), vi"
"carbonated water, natural flavors, malic acid, vegetable juice (for color), blackberry juice concentrate, potassium benzoate (to ensure freshness), sucralose, gum arabic, green tea extract, biotin, niacinamide (vitamin b3), vitamin a, calcium pantothenate (vitamin b5), vitamin b12, vitamin d3, pyridoxine hydrochloride (vitamin b6),"
"carbonated water, natural flavors, malic acid, vegetable juice (for color), blackberry juice concentrate, potassium benzoate (to ensure freshness), sucralose, gum arabic, green tea extract, biotin, niacinamide (vitamin b3), vitamin a, calcium pantothenate (vitamin b5), vitamin b12, vitamin d3, pyridoxine hydrochloride (vitamin b6),"

07203671232
"Sea salt, spices and herbs, natural bleu cheese flavor (maltodextrin, whey solids, natural bleu cheese), dehydrated garlic, dehydrated onion, whey powder, cane sugar, ground mustard, worcestershire sauce powder ([distilled vinegar, molasses, corn syrup, s"
"Sea salt, spices and herbs, natural bleu cheese flavor (maltodextrin, whey solids, natural bleu cheese), dehydrated garlic, dehydrated onion, whey powder, cane sugar, ground mustard, worcestershire sauce powder ([distilled vinegar, molasses, corn syrup, s"
"Sea salt, spices and herbs, natural bleu cheese flavor (maltodextrin, whey solids, natural bleu cheese), dehydrated garlic, dehydrated onion, whey powder, cane sugar, ground mustard, worcestershire sauce powder ([distilled vinegar, molasses, corn syrup, s"
"Sea salt, spices and herbs, natural bleu cheese flavor (maltodextrin, whey solids, natural bleu cheese), dehydrated garlic, dehydrated onion, whey powder, cane sugar, ground mustard, worcestershire sauce powder ([distilled vinegar, molasses, corn syrup, s"
"Sea salt, spices and herbs, natural bleu cheese flavor (maltodextrin, whey solids, natural bleu cheese), dehydrated garlic, dehydrated onion, whey powder, cane sugar, ground mustard, worcestershire sauce powder ([distilled vinegar, molasses, corn syrup, s"

My conclusions so far:

07203671232 is ok, it's primary data that is truncated
there is no real consistent pattern between data in dum, mongosh, search api, product ! (strange)
when string is truncated, it's always truncated at 255 characters (this does not seems coincidental…)

alexgarel · 2022-08-24T13:39:07Z

for 0016571950293 it seems it's rev1 data which are in the index.
This come from an old usda import.

for 00016571910303 I do not see a version with this ingredients, but in json / index we have the same ingredients as 0016571950293, stranegly

stephanegigandet · 2022-08-24T13:44:54Z

the first USDA dump import had ingredients truncated at 255 characters, that was a bug in the USDA data.

alexgarel · 2022-08-24T17:00:40Z

Hum for 00016571950293 and 00016571910303, I think I'got the trick: as they are 14 digits, our API removes the first 0.

If you look carefully at https://world.openfoodfacts.org/api/v2/product/00016571950293&fields=code,ingredients_text,codes_tags you see the leading 0 is removed in the returned object.

But it seems we have old 14-digit references in the mongodb ! Which are not up to date…

On staging:

> db.products.find({_id: "00016571950293"}, {ingredients_text:1})
{ "_id" : "00016571950293", "ingredients_text" : "Carbonated mountain spring water, natural flavors, citric acid, apple juice concentrate, lemon juice concentrate, potassium benzoate (to ensure freshness), sucralose, ester gum, green tea extract, calcium disodium edta (to protect flavor), red #40, yellow" }
> db.products.find({_id: "0016571950293"}, {ingredients_text:1})
{ "_id" : "0016571950293", "ingredients_text" : "Carbonated water, citric acid, natural flavors, lemon juice concentrate, potassium benzoate (to ensure freshness), fruit and vegetable juice (for color), sucralose, beta carotene (for color), green tea extract, ester gum, calcium disodium edta (to protect flavor), biotin, niacinamide (vitamin b3), calcium pantothenate (vitamin b5), vitamin a, vitamin b12, vitamin d3, pyridoxine hydrochloride (vitamin b6)." }

There seems to be only 147 of them. See https://world.openfoodfacts.org/api/v2/search?codes_tags=0xxxxxxxxxxxxx&fields=code

alexgarel · 2022-08-24T17:03:42Z

Note that the product exists on the filesystem (here on staging):

/mnt/podata/products$ ls 000/165/719/10303
1.sto  changes.sto  product.sto

alexgarel · 2022-08-24T17:06:40Z

@stephanegigandet should we remove those products and their reference in the mongodb ?

Should Kristina deal with those 14-digit and remove the first digit if it's a 0 ?

That would leave us only with the first problematic case.

stephanegigandet · 2022-08-25T07:30:50Z

The ideal solution would be to move them to the code without 0 (if the product does not exist), or delete them (if the product already exists).

It would be good to also dump all codes from MongoDB and run them through normalize_code() to see if we have other instances of products that are now stored differently.

alexgarel · 2022-08-25T08:08:10Z

I opened #7249 and #7248

github-actions · 2022-11-24T00:09:25Z

This issue is stale because it has been open 90 days with no activity.

alexgarel added the 🐛 bug This is a bug, not a feature request. label Aug 24, 2022

CharlesNepote added MongoDB We have 2 mongodb collections: one for current products, and one for obsolete products Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data labels Aug 24, 2022

This was referenced Aug 25, 2022

Write a script that fix mongodb items out of sync #7248

Open

Remove or move entries with non normalized codes in mongodb #7249

Open

github-actions bot added the ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. label Nov 24, 2022

teolemon removed the 🐛 bug This is a bug, not a feature request. label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingredients_text is truncated in mongodb dump #7244

ingredients_text is truncated in mongodb dump #7244

alexgarel commented Aug 24, 2022 •

edited

Loading

alexgarel commented Aug 24, 2022

alexgarel commented Aug 24, 2022

alexgarel commented Aug 24, 2022 •

edited

Loading

alexgarel commented Aug 24, 2022

stephanegigandet commented Aug 24, 2022

alexgarel commented Aug 24, 2022 •

edited

Loading

alexgarel commented Aug 24, 2022

alexgarel commented Aug 24, 2022

stephanegigandet commented Aug 25, 2022

alexgarel commented Aug 25, 2022

github-actions bot commented Nov 24, 2022

ingredients_text is truncated in mongodb dump #7244

ingredients_text is truncated in mongodb dump #7244

Comments

alexgarel commented Aug 24, 2022 • edited Loading

Describe the bug

To Reproduce

Expected behavior

alexgarel commented Aug 24, 2022

alexgarel commented Aug 24, 2022

alexgarel commented Aug 24, 2022 • edited Loading

alexgarel commented Aug 24, 2022

stephanegigandet commented Aug 24, 2022

alexgarel commented Aug 24, 2022 • edited Loading

alexgarel commented Aug 24, 2022

alexgarel commented Aug 24, 2022

stephanegigandet commented Aug 25, 2022

alexgarel commented Aug 25, 2022

github-actions bot commented Nov 24, 2022

alexgarel commented Aug 24, 2022 •

edited

Loading

alexgarel commented Aug 24, 2022 •

edited

Loading

alexgarel commented Aug 24, 2022 •

edited

Loading