-
-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingredients_text is truncated in mongodb dump #7244
Comments
I took time to investigate this morning but didn't find any clue doing internet research. I read the mongodump and mongoexport doc in search of a config parameter or so… nothing. The only limit I can find is about document size which should be under 16Mb, and:
our object is 78k so we are far from this limit. Another limit is for index fields (1024) but this is normally removed in our version of mongodb and our string are truncated well before (while being ascii, so utf-8 encoding is not longer). |
I made a test to retrieve 1340951640901 on a test mongo instance where @CharlesNepote has imported data some monthes ago from the dump… the string is not truncated ! |
I continue investigation to find differences. Here is the value I get using:
My conclusions so far:
|
for 0016571950293 it seems it's rev1 data which are in the index. for 00016571910303 I do not see a version with this ingredients, but in json / index we have the same ingredients as 0016571950293, stranegly |
the first USDA dump import had ingredients truncated at 255 characters, that was a bug in the USDA data. |
Hum for 00016571950293 and 00016571910303, I think I'got the trick: as they are 14 digits, our API removes the first 0. If you look carefully at https://world.openfoodfacts.org/api/v2/product/00016571950293&fields=code,ingredients_text,codes_tags you see the leading 0 is removed in the returned object. But it seems we have old 14-digit references in the mongodb ! Which are not up to date… On staging:
There seems to be only 147 of them. See https://world.openfoodfacts.org/api/v2/search?codes_tags=0xxxxxxxxxxxxx&fields=code |
Note that the product exists on the filesystem (here on staging):
|
@stephanegigandet should we remove those products and their reference in the mongodb ? Should Kristina deal with those 14-digit and remove the first digit if it's a 0 ? That would leave us only with the first problematic case. |
The ideal solution would be to move them to the code without 0 (if the product does not exist), or delete them (if the product already exists). It would be good to also dump all codes from MongoDB and run them through normalize_code() to see if we have other instances of products that are now stored differently. |
This issue is stale because it has been open 90 days with no activity. |
Describe the bug
As reported by Kristina on slack
Some items in the jsonl and bson export of the mongodb have "ingredients_text" field truncated, whereas the API have it all, as well as mongodb.
Those are not new items.
To Reproduce
On mongo:
whereas in the json:
Additional examples:
00637480006835
00016571950293
00016571910303
07203671232
Expected behavior
"ingredients_text" should be complete.
The text was updated successfully, but these errors were encountered: