Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingredients parsing issue - Croatian - "E500 i E503" becomes "E500i", "E503" #7927

Closed
stephanegigandet opened this issue Jan 2, 2023 · 3 comments · Fixed by #8905
Closed
Labels
🧪 additives 🇭🇷 Croatia https://hr.openfoodfacts.org/ 🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it.

Comments

@stephanegigandet
Copy link
Contributor

stephanegigandet commented Jan 2, 2023

From @benbenben2 #7925 (comment)

I noticed that for additives, when we find "E500 i E503", for example, then, the list of ingredients becomes "E500i", "E503" in the Details of the analysis of the ingredients (https://hr.openfoodfacts.org/product/3850102522866/moto-kakao-mlijeko-kra%C5%A1).

Although the variable "my %and =" is populated with "i" for Croatian in the ingredients.pm file and "i" is in the stopwords for the taxonomy file ingredients.txt

Any idea to tackle this?

Part of

@stephanegigandet stephanegigandet added the 🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis label Jan 2, 2023
@teolemon teolemon added the 🇭🇷 Croatia https://hr.openfoodfacts.org/ label Feb 11, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity.

@github-actions github-actions bot added the ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. label May 17, 2023
@CharlesNepote
Copy link
Member

It seems to represent ~40 products.

@benbenben2
Copy link
Collaborator

benbenben2 commented Aug 23, 2023

  1. The source of the issue is that 1 in roman number ("i") that can be found for additives, can conflict with Croatian word "and" ("i" in Croatian)

Roman numbers is defined in Ingredients.pm:
my $roman_numerals = "i|ii|iii|iv|v|vi|vii|viii|ix|x|xi|xii|xii|xiv|xv";

And "and" is defined in Ingredients.pm around line 386 :
my %and = (

  1. The exact part of the code that will replace "tvari za rahljenje: E 500 i E 503, " by "tvari za rahljenje: e500i e503, " is:
    $text =~ s/(\b)e( |-|\.)?$additivesregexp(\b|\s|,|\.|;|\/|-|\\|\)|\]|$)/replace_additive($3,$6,$9, $and) . $12/ieg;

This regular expression will extract the "i" (and) as a variant and send it to the subroutine called replace_additive().

  1. wrong solution:
    a) include "$and" as parameter of the replace_additive() subroutine.
    b) keep the variant only in the case when it is no equal to $and (i).

-> this solution is bad because it will ignore any additive like e123i (example, "tvari za rahljenje: E 500i, E 503," -> "tvari za rahljenje (e500), e503" => missing i)

  1. solution to try:
    Now, I am thinking to move
    # E100 et E120 -> E100, E120 $text =~ s/\be($additivesregexp)$and/e$1, /ig; $text =~ s/${and}e($additivesregexp)/, e$1/ig;
    before
    $text =~ s/(\b)e( |-|\.)?$additivesregexp(\b|\s|,|\.|;|\/|-|\\|\)|\]|$)/replace_additive($3,$6,$9) . $12/ieg;
    Bit afraid that changing the order like that could introduce some bugs (for other languages and different lists of ingredients). Any advice is welcome @stephanegigandet, @alexgarel

  2. also wondering @CharlesNepote if other languages do not have same issue.
    If you look the variable my %and = ( around line 386 in Ingredients.pm, there are:

ca => " i ",
cs => " a ",
gl => " e ",
hr => " i ",
it => " e ",
oc => " e ",
pl => " i ",
pt => " e ",
sk => " a ",
uk => " i ",

and all these letters can be used in additives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧪 additives 🇭🇷 Croatia https://hr.openfoodfacts.org/ 🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants