Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve rezeptwelt.de recipe parsing #1295

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

wummel
Copy link

@wummel wummel commented Oct 16, 2024

This change improves the parser of recipes at rezeptwelt.de:

  • detect ingredient groups
  • support HTML layout for newer recipes, especially for instruction parsing
  • add prep time
  • add equipment entries

This change improves the parser of recipes at rezeptwelt.de:
- detect ingredient groups
- support HTML layout for newer recipes, especially for instruction parsing
- add prep time
- add equipment entries
@@ -9,19 +25,69 @@ def host(cls):
return "rezeptwelt.de"

def site_name(self):
raise StaticValueException(return_value="Rezeptwelt")
return "Thermomix Rezeptwelt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "Thermomix Rezeptwelt"
raise StaticValueException(return_value="Thermomix Rezeptwelt")

I admit this is a slightly unusual pattern that we use; it is used so that the interface of the library can indicate whether values were retrieved from the source HTML or whether they are static/constant values returned by the code.

Comment on lines +106 to +111
def prep_time(self):
tag = self.soup.find(itemprop="performTime", content=nonempty)
return get_minutes(tag['content']) if tag else None

def equipment(self):
return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=nonempty)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BeautifulSoup (bs4 / self.soup) allows non-empty content filtering by passing a boolean True value, so I think we can simplify these methods slightly:

Suggested change
def prep_time(self):
tag = self.soup.find(itemprop="performTime", content=nonempty)
return get_minutes(tag['content']) if tag else None
def equipment(self):
return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=nonempty)]
def prep_time(self):
tag = self.soup.find(itemprop="performTime", content=True)
return get_minutes(tag['content']) if tag else None
def equipment(self):
return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=True)]

Comment on lines +31 to +35
tag = self.soup.find("div", itemprop="author")
if tag:
return normalize_string(tag.get_text())
tag = self.soup.find("span", {"id": "viewRecipeAuthor"})
return normalize_string(tag.get_text())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some observations here:

  • The retrieval from an itemprop="author" attribute is essentially schema.org metadata retrieval; we have an existing helper method to implement that, so let's re-use them here.
  • The information contained in the viewRecipeAuthor element seems more-specific than the schema metadata, which is sometimes generic. So let's prefer viewRecipeAuthor when mentioned.

What this leads me to when adapting the code locally is:

Suggested change
tag = self.soup.find("div", itemprop="author")
if tag:
return normalize_string(tag.get_text())
tag = self.soup.find("span", {"id": "viewRecipeAuthor"})
return normalize_string(tag.get_text())
name_from_schema = self.schema.author()
name_from_hyperlink = None
tag = self.soup.find("span", {"id": "viewRecipeAuthor"})
if tag:
name_from_hyperlink = tag.get_text()
return normalize_string(name_from_hyperlink or name_from_schema)

Note: the word von in some of the test data seems redundant, so we can remove that (these changes affect that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants