Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ottolenghi scraper #1209

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

ottolenghi scraper #1209

wants to merge 6 commits into from

Conversation

jacksgreen
Copy link
Contributor

No description provided.

@jknndy
Copy link
Collaborator

jknndy commented Aug 12, 2024

Hi @jacksgreen , thanks for the contribution! Just a few things to fix up here

The tests failing require a few fixes

  • - Our test jsons now follow a specific order for the expected keys, which is tested through tests\library\test_json_order.py . There is a helper script to auto-reorder the keys correctly which can be run via python .\scripts\reorder_json_keys.py
  • - You'll need to add a readme entry under the Scrapers available for: header (in alphabetical order). This can be tested via tests\library\test_readme.py

Improvements:

There is additional information available on this page that would be great to include in the output! Some of which may require a custom implementation.

  • - Category: this is available via recipe schema, you'll just need to add a test json entry. "category": "Desserts",
  • - Ingredient_groups: this site, even this specific test recipe, include ingredient_groups which functionality could be added for. This would require an additional test case showing not-grouped ingredients being added.
  • - yields, cook_time & prep_time all of this information is available on the page but will require custom implementations. *cook_time is displayed as 5o on this recipe but this appears to be an error, other recipes list it with a 0.

@jacksgreen
Copy link
Contributor Author

Hi @jacksgreen , thanks for the contribution! Just a few things to fix up here

The tests failing require a few fixes

  • Our test jsons now follow a specific order for the expected keys, which is tested through tests\library\test_json_order.py . There is a helper script to auto-reorder the keys correctly which can be run via python .\scripts\reorder_json_keys.py
  • You'll need to add a readme entry under the Scrapers available for: header (in alphabetical order). This can be tested via tests\library\test_readme.py

Improvements:

There is additional information available on this page that would be great to include in the output! Some of which may require a custom implementation.

  • Category: this is available via recipe schema, you'll just need to add a test json entry. "category": "Desserts",
  • Ingredient_groups: this site, even this specific test recipe, include ingredient_groups which functionality could be added for. This would require an additional test case showing not-grouped ingredients being added.
  • yields, cook_time & prep_time all of this information is available on the page but will require custom implementations. *cook_time is displayed as 5o on this recipe but this appears to be an error, other recipes list it with a 0.

@jknndy Thanks for the feedback, I've updated everything in the pr!

@jknndy
Copy link
Collaborator

jknndy commented Aug 13, 2024

Just one last thing and then it should be good to go!

For sites that have ingredient_groups support we like to have two test cases (one with groupings & one without) included. Usually with the naming convention...
sitename_1.json & testhtml for the case without groupings example page
sitename_2.json & testhmtl for case with. (the existing test data)

@jayaddison
Copy link
Collaborator

A note for anyone who gets a bit confused by this, as I did briefly: we recently added support for books.ottolenghi.co.uk -- but the recipes listed within that website are published distinctly from the recipes on ottolenghi.co.uk. I think the existing scraper-selection code (that uses the SCRAPERS dictionary) should handle this without a problem.

@jacksgreen
Copy link
Contributor Author

Just one last thing and then it should be good to go!

For sites that have ingredient_groups support we like to have two test cases (one with groupings & one without) included. Usually with the naming convention... sitename_1.json & testhtml for the case without groupings example page sitename_2.json & testhmtl for case with. (the existing test data)

Hey @jknndy,
I've added the second test case, let me know if it's all good to merge!

@@ -0,0 +1,69 @@
from ._abstract import AbstractScraper
from ._grouping_utils import group_ingredients

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from ._utils import get_minutes, get_yields

Comment on lines +50 to +69
def yields(self):
return (
self.soup.find("div", class_="c-recipe-header__timings")
.find("span")
.get_text(strip=True)
)

def prep_time(self):
return (
self.soup.find("div", class_="c-recipe-header__timings")
.find_all("span")[1]
.get_text(strip=True)
)

def cook_time(self):
return (
self.soup.find("div", class_="c-recipe-header__timings")
.find_all("span")[2]
.get_text(strip=True)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def yields(self):
return (
self.soup.find("div", class_="c-recipe-header__timings")
.find("span")
.get_text(strip=True)
)
def prep_time(self):
return (
self.soup.find("div", class_="c-recipe-header__timings")
.find_all("span")[1]
.get_text(strip=True)
)
def cook_time(self):
return (
self.soup.find("div", class_="c-recipe-header__timings")
.find_all("span")[2]
.get_text(strip=True)
)
def _extract_timing_elements(self, prefix):
timings = self.soup.find("div", class_="c-recipe-header__timings").find_all("span")
for timing in timings:
if prefix.lower() in timing.get_text().lower():
return timing.get_text()
def yields(self):
yield_text = self._extract_timing_elements("serves")
return get_yields(yield_text)
def prep_time(self):
prep_text = self._extract_timing_elements("prep")
return get_minutes(prep_text)
def cook_time(self):
cook_text = self._extract_timing_elements("cook")
if cook_text and '5o' in cook_text:
cook_text = cook_text.replace('5o', '50')
return get_minutes(cook_text)

What do you think about refactoring this to use a shared helper and using the starting text as the matching parameter instead of position?

Also I implemented two existing utils get_minutes and get_yields to normalize the fields outputs. this will require some changes to the test JSONs

Copy link
Collaborator

@jknndy jknndy Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in cook_time I added coverage for an error on the recipe page where 50 is displayed as 5o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants