Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taxonomy: Added unknown Croatian ingredients to the taxonomy (part 8) #9227

Merged
merged 4 commits into from
Nov 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 32 additions & 16 deletions lib/ProductOpener/Ingredients.pm
Original file line number Diff line number Diff line change
Expand Up @@ -773,6 +773,12 @@ my %min_regexp = (
hr => "min|min\.|mini|minimum",
);

my %max_regexp = (
en => "max|max\.|maximum",
fr => "max|max\.|maxi|maximum",
hr => "max|max\.|maxi|maximum",
);

# Words that can be ignored after a percent
# e.g. 50% du poids total, 30% of the total weight
# groups need to be non-capturing: prefixed with (?:
Expand Down Expand Up @@ -1548,15 +1554,6 @@ sub parse_processing_from_ingredient ($ingredients_lc, $ingredient) {
ingredient_recognized => $ingredient_recognized
}
) if $log->is_debug();
$log->debug(
"processing - return",
{
processings => \@processings,
ingredient => $ingredient,
ingredient_id => $ingredient_id,
ingredient_recognized => $ingredient_recognized
}
) if $log->is_debug();

return (\@processings, $ingredient, $ingredient_id, $ingredient_recognized);
}
Expand Down Expand Up @@ -1796,15 +1793,17 @@ sub parse_ingredients_text_service ($product_ref, $updated_product_fields_ref) {

my $min_regexp = $min_regexp{$ingredients_lc} || '';

my $max_regexp = $max_regexp{$ingredients_lc} || '';

my $ignore_strings_after_percent = $ignore_strings_after_percent{$ingredients_lc} || '';

# Regular expression to find percent or quantities
# $percent_or_quantity_regexp has 2 capturing group: one for the number, and one for the % sign or the unit
my $percent_or_quantity_regexp = '(?:' . "(?:$prepared_with )" . ' )?' # optional produced with
. '(?:<|' . $min_regexp . '|\s|\.|:)*' # optional minimum, and separators
. '(?:>|' . $max_regexp . '|<|' . $min_regexp . '|\s|\.|:)*' # optional maximum, minimum, and separators
. '(\d+(?:(?:\,|\.)\d+)?)\s*' # number, possibly with a dot or comma
. '(\%|g|gr|mg|kg|ml|cl|dl|l)\s*' # % or unit
. '(?:' . $min_regexp . '|' # optional minimum
. '(?:' . $min_regexp . '|' . $max_regexp . '|' # optional minimum, optional maximum
. $ignore_strings_after_percent . '|\s|\)|\]|\}|\*)*'; # strings that can be ignored

my $per = $per{$ingredients_lc} || ' per ';
Expand Down Expand Up @@ -2500,13 +2499,16 @@ sub parse_ingredients_text_service ($product_ref, $updated_product_fields_ref) {

'hr' => [
'^u tragovima$', # in traces
'čokolada sadrži biljne masnoće uz kakaov maslac'
, # Chocolate contains vegetable fats along with cocoa butter
'označene podebljano', # marked in bold
'savjet kod alergije', # allergy advice
'u čokoladi kakaovi dijelovi'
, # Cocoa parts in chocolate 48%. Usually at the end of the ingredients list. Chocolate can contain many sub-ingredients (cacao, milk, sugar, etc.)
'u promjenjivim omjerima|u promjenjivim udjelima|u promijenljivom udjelu'
, # in variable proportions
'uključujući žitarice koje sadrže gluten', # including grains containing gluten
'za alergene', # for allergens
'u promjenjivim udjelima' # in variable proportions
],

'it' => ['^in proporzion[ei] variabil[ei]$',],
Expand Down Expand Up @@ -4015,7 +4017,16 @@ sub normalize_a_of_b ($lc, $a, $b, $of_bool, $alternate_names_ref = undef) {
my $a_of_b;

if (($lc eq "en") or ($lc eq "hr")) {
$a_of_b = $b . " " . $a;
# start by "with" (example: "mlijeko (s 1.0% mliječne masti)"), in which case it $b should be added after $a
# start by "with etc." should be added at the end of the previous ingredient
my %with = (hr => '(s | sa )',);
my $with = $with{$lc} || " will not match ";
if ($b =~ /^$with/i) {
$a_of_b = $a . " " . $b;
}
else {
$a_of_b = $b . " " . $a;
}
}
elsif ($lc eq "es") {
$a_of_b = $a . " de " . $b;
Expand Down Expand Up @@ -5257,6 +5268,11 @@ my %ingredients_categories_and_types = (
categories => ["slad",],
types => ["ječmeni", "pšenični",]
},
# milk
{
categories => ["mlijeko",],
types => ["s 1.0% mliječne masti",]
},
],

pl => [
Expand Down Expand Up @@ -5394,18 +5410,18 @@ sub develop_ingredients_categories_and_types ($ingredients_lc, $text) {
or ($ingredients_lc eq "ru")
or ($ingredients_lc eq "pl"))
{
# vegetable oil (palm, sunflower and olive)
# vegetable oil (palm, sunflower and olive) -> palm vegetable oil, sunflower vegetable oil, olive vegetable oil
$text
=~ s/($category_regexp)(?::|\(|\[| | $of )+((($type_regexp)($symbols_regexp|\s)*( |\/| \/ | - |,|, |$and|$of|$and_of|$and_or)+)+($type_regexp)($symbols_regexp|\s)*)\b(\s?(\)|\]))?/normalize_enumeration($ingredients_lc,$1,$2,$of_bool, $categories_and_types_ref->{alternate_names})/ieg;

# vegetable oil (palm)
# vegetable oil (palm) -> palm vegetable oil
$text
=~ s/($category_regexp)\s?(?:\(|\[)\s?($type_regexp)\b(\s?(\)|\]))/normalize_enumeration($ingredients_lc,$1,$2,$of_bool,$categories_and_types_ref->{alternate_names})/ieg;
# vegetable oil: palm
$text
=~ s/($category_regexp)\s?(?::)\s?($type_regexp)(?=$separators|.|$)/normalize_enumeration($ingredients_lc,$1,$2,$of_bool,$categories_and_types_ref->{alternate_names})/ieg;

# ječmeni i pšenični slad (barley and wheat malt)
# ječmeni i pšenični slad (barley and wheat malt) -> ječmeni slad, pšenični slad
$text
=~ s/((?:(?:$type_regexp)(?: |\/| \/ | - |,|, |$and|$of|$and_of|$and_or)+)+(?:$type_regexp))\s*($category_regexp)/normalize_enumeration($ingredients_lc,$2,$1,$of_bool,$categories_and_types_ref->{alternate_names})/ieg;
}
Expand Down
6 changes: 3 additions & 3 deletions taxonomies/additives.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6706,7 +6706,7 @@ es:E300, Ácido ascórbico, Ácido l-ascórbico, Ácido L-ascórbico
et:E300, Askorbiinhape
fi:E300, Askorbiinihappo, L-askorbiinihappo, Askorbiinihappoa, L-askorbiinihappoa, c-vitamiini
fr:E300, Acide ascorbique, Acide L-ascorbique, Acide ascorbique (L-), Acide L(+)-ascorbique, Ascorbate, vitamine c
hr:E300, askorbinska kiselina, l-askorbinska kiselina, askrobinska kiselina
hr:E300, askorbinska kiselina, l-askorbinska kiselina, askrobinska kiselina, askorbinska
hu:E300, Aszkorbinsav, l-aszkorbinsav
it:E300, Acido ascorbico, acido l-ascorbico, Ascorbato
lt:E300, Askorbo rūgštis, l-askorbo rūgštis, Askorbinas
Expand Down Expand Up @@ -12060,7 +12060,7 @@ es:E433, Monooleato de sorbitán polioxietilenado, Polioxietilen sorbitan monool
et:E433, Polüoksüetüleen sorbitaanmonooleaat, Polüsorbaat 80
fi:E433, Polyoksyetyleenisorbitaanimono-oleaatti, Polysorbaatti 80, Polyoksyetyleenisorbitaanimono-oleaattia
fr:E433, Monooléate de polyoxyéthylène de sorbitane, polysorbate 80, Polyoxyethylene sorbitan monooleate (polysorbate 80)
hr:E433
hr:E433, polioksietilen sorbitan monooleat
hu:E433, Polioxietilén-szorbitan-monooleát, Poliszorbát 80, Polioxietilén(20)-szorbitán-oleát
it:E433, Monoleato di poliossietilene sorbitano, Polisorbato 80
lt:E433, Polioksietileno sorbitano monooleatas, Polisorbatas 80
Expand Down Expand Up @@ -14289,7 +14289,7 @@ es:E471, mono- y diglicéridos de ácidos grasos, Monoglicéridos y diglicérido
et:E471, Rasvhapete mono- ja diglütseriidid, Glütserüülmonostearaat, glütserüülmonopalmitaat, glütserüülmonooleaat, monosteariin, monopalmitiin
fi:E471, Rasvahappojen mono- ja diglyseridit, Glyseryylimonostearaatti, Glyseryylimonopalmitaatti, Glyseryylimono-oleaatti, Monosteariini, Monopalmitiini, Mono-oleiini
fr:E471, Mono- et diglycérides d'acides gras, Mono- et diglycérides d'acides gras alimentaires, Monoglycérides et diglycérides d'acides gras, Mono et diglycérides d'acides gras, Monostéarate de glycérine, Monopalmitate de glycérine, Monooléate de glycérine, Monostéarine, monopalmitine, monooléine, Mono and diglycerides of fatty acids, glyceryl monostearate, glyceryl distearate , Monostéarine
hr:E471, mono- i digliceridi masnih kiselina, mono - i digliceridi masnih kiselina, emulgator mono - i digliceridi masnih kiselina, emuglator e471, emulgator mono i digliceridi masnih kiselina, monogliceridi i digliceridi masnih kiselina e471
hr:E471, mono- i digliceridi masnih kiselina, mono - i digliceridi masnih kiselina, emulgator mono - i digliceridi masnih kiselina, emuglator e471, emulgator mono i digliceridi masnih kiselina, monogliceridi i digliceridi masnih kiselina e471, mono- i diglicerida masnih kiselina, mono - i diglicerida masnih kiselina
hu:E471, Zsírsavak mono- és digliceridjei, Gliceril-monosztearát, Gliceril-monopalmitát, Gliceril-monooleát, Monosztearin, monopalmitin, monoolein
it:E471, Mono- e digliceridi degli acidi grassi, Monostearato di glicerile, monopalmitato di glicerile, monooleato di glicerile, monostearina, monopalmitina, monooleina, mono- e digliceridi degli acidi grassi alimentari
lt:E471, Riebalų rūgščių mono- ir digliceridai, Glicerilmonostearatas, glicerilmonopalmitatas, glicerilmonooleatas, monostearinas, monopalmitinas, monooleinas
Expand Down
Loading
Loading