Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mid dot is not processed #8373

Closed
duhow opened this issue Apr 27, 2023 · 4 comments · Fixed by #8690
Closed

Mid dot is not processed #8373

duhow opened this issue Apr 27, 2023 · 4 comments · Fixed by #8690
Assignees
Labels
🐛 bug This is a bug, not a feature request. 🧽 Data quality https://wiki.openfoodfacts.org/Quality 🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis 🥗 Ingredients

Comments

@duhow
Copy link
Contributor

duhow commented Apr 27, 2023

Describe the bug

Additive E460 in Catalan cel·lulosa is shown as two words: cel, lulosa.

To Reproduce

https://es.openfoodfacts.org/producto/8422410814206/queso-rallado-mozzarella-bonpreu

Expected behavior

cel·lulosa is accepted as E460.

Screenshots

image

Additional context

Word is registered:

ca:E460, Cel·lulosa, cellulosa

Part of

@duhow duhow added the 🐛 bug This is a bug, not a feature request. label Apr 27, 2023
@benbenben2
Copy link
Collaborator

The unicode is \N{U+00B7} https://unicodeplus.com/U+00B7, which is listed as separator in Ingredients.pm:

my $middle_dot
	= qr/(?:\N{U+00B7}|\N{U+2022}|\N{U+2023}|\N{U+25E6}|\N{U+2043}|\N{U+204C}|\N{U+204D}|\N{U+2219}|\N{U+22C5})/i;

There are some more entries having this middle dot, not only Catalan:

taxonomies/additives.txt:ca:E310, Gal·lat de propil
taxonomies/additives.txt:ca:E311, Gal·lat d'octil
taxonomies/additives.txt:ca:E312, Gal·lat de dodecil
taxonomies/additives.txt:ca:E426, Hemicel·lulosa de soja
taxonomies/additives.txt:ca:E460, Cel·lulosa, cellulosa
taxonomies/additives.txt:ca:E461, Metilcel·lulosa, metilcellulosa
taxonomies/additives.txt:ca:E462, Etilcel·lulosa
taxonomies/additives.txt:ca:E463, Hidroxipropilcel·lulosa
taxonomies/additives.txt:ca:E464, Hidroxipropilmetilcel·lulosa
taxonomies/additives.txt:ca:E465, Etilmetilcel·lulosa
taxonomies/additives.txt:ca:E469, Carboximetilcel·lulosa hidrolitzada enzimàticament
taxonomies/additives.txt:fi:E519, Kuparisulfaatti, Kupari(II)sulfaatti, CuSO4·5H2O, Kuparivihtrilli, CuSO4, Kuparisulfaattia, Kupari(II)sulfaattia, Kuparivihtrilliä
taxonomies/additives.txt:it:E519, Solfato rameico, CuSO4·5H2O, Solfato di rame, CuSO4
taxonomies/additives.txt:it:E522, Solfato di alluminio e potassio, Allume di potassio, KAl(SO4)2·12H2O, Allume potassico, Solfato d'alluminio e potassio, Allume di rocca, Solfato doppio di alluminio e potassio dodecaidrato, Solfato di alluminio e potassio dodecaidrato
taxonomies/additives.txt:it:E523, Solfato di alluminio e ammonio, Allume di ammonio, Solfato di alluminio e ammonio dodecaidrato, NH4Al(SO4)2·12H2O
taxonomies/additives.txt:it:E536, Ferrocianuro di potassio, K4(Fe(CN)6)·3H2O
taxonomies/additives.txt:it:E539, tiosolfato di sodio, Iposolfito di sodio, Na2S2O3·5H2O
taxonomies/additives.txt:ca:E999, Extracte de quil·laia

taxonomies/categories.txt:ca:Llets hipoal·lergèniques

taxonomies/ingredients.txt:zh:格拉娜·帕達諾芝士
taxonomies/ingredients.txt:zh:布勒·德·奧福格芝士
taxonomies/ingredients.txt:zh:布勒·德·布瑞塞芝士
taxonomies/ingredients.txt:zh:布勒·德·吉克斯芝士

@CharlesNepote CharlesNepote added 🥗 Ingredients 🧽 Data quality https://wiki.openfoodfacts.org/Quality 🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis labels Apr 30, 2023
@benbenben2
Copy link
Collaborator

benbenben2 commented May 25, 2023

It seems to have many ingredients_text containing this middle dot see examples on Mirabelle:

  • due to bad text extraction and no correction from the contributor after the extraction
  • because ingredients are separated by middle dot

Suggestion:

  1. Replace "•" (middle dot, "my $middle_dot" in Ingredients.pm) parsing symbols, by " • " (space-middle dot-space)
  2. Create an alert (warning?) when the text contains "[not-a-space]•[not-a-space]" AND there is no ingredient (in [ingredients][i]["text"] in the product_ref) with a middle-dot.

This can be related to data quality for ingredients. This new alert would detect products where ingredients should be reviewed due to bad text extraction that are not corrected.

What do you guys think? @stephanegigandet @alexgarel @CharlesNepote @teolemon

@stephanegigandet
Copy link
Contributor

  • Replace "•" (middle dot, "my $middle_dot" in Ingredients.pm) parsing symbols, by " • " (space-middle dot-space)

@benbenben2 Good idea. I think we could replace:

my $separators_except_comma = qr/(;|:|$middle_dot|[|{|(|\N{U+FF08}|( $dashes ))|(/|\N{U+FF0F})/i

by:

my $separators_except_comma = qr/(;|:| $middle_dot |[|{|(|\N{U+FF08}|( $dashes ))|(/|\N{U+FF0F})/i

(spaces around $middle_dot)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug This is a bug, not a feature request. 🧽 Data quality https://wiki.openfoodfacts.org/Quality 🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis 🥗 Ingredients
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants