Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for quitoque.fr #1163

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

romainchassaigne
Copy link

Resolved the issue #725

  • Added quitoque.fr scrapper using graphql API.
  • Added tests using legacy tests because it uses an API.

Please let me know if anything is wrong as this is my first pull request to this project.

@romainchassaigne romainchassaigne changed the title feat: add scrapper for quitoque #725 feat: add scrapper for quitoque Jul 2, 2024
Copy link
Collaborator

@jayaddison jayaddison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @romainchassaigne - thank you for developing this!

I notice that the size of the JSON returned from the API is fairly large - more than 10 megabytes. Also, most of the information we'd like to extract from the recipe does appear already in the HTML, without making that second (large) network request.

Perhaps we could retrieve the information from the HTML? Or otherwise, at least reduce the size of the network request and test data?

@romainchassaigne
Copy link
Author

Hi @jayaddison ! Thank you for your reply.

I'll try to reduce the graphql request. Maybe it could be possible to only get the 2 serving (as this is the only information I want) recipe.

@romainchassaigne
Copy link
Author

Hi @jayaddison !

I've successfuly reduced the query from 10MB to approx 302KB. I've refined the graphql query only with the informations I really want. Unfortunnatly I have tried to remove "extensions" from the query, but it seems to be related to the server.

Please let me know if anything needs to be corrected.

@jayaddison jayaddison changed the title feat: add scrapper for quitoque Add support for quitoque.fr Jul 15, 2024
@romainchassaigne
Copy link
Author

Quitoque.fr have changed the way to retrieve the recipes. I need to change everything to scrap the html instead. GraphQL API is not available anymore.

@jayaddison
Copy link
Collaborator

Drats - sorry to hear that, @romainchassaigne. Would you like to hold this pull request open for a while in case that API returns? Thanks for making the suggested adjustments along the way so far (despite it seemingly not working out!).

@romainchassaigne
Copy link
Author

Drats - sorry to hear that, @romainchassaigne. Would you like to hold this pull request open for a while in case that API returns? Thanks for making the suggested adjustments along the way so far (despite it seemingly not working out!).

No I think I'm good to rework using the HTML scrapper, it seems to be simpler than later.

@romainchassaigne
Copy link
Author

romainchassaigne commented Jul 18, 2024

Hi @jayaddison, I'm running into a weird error while running the tests.

======================================================================
FAIL: tests/test_data/quitoque.fr/quitoque.json (tests.RecipeTestCase.tests/test_data/quitoque.fr/quitoque.json) [canonical_url]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/romain.chassaigne/Documents/Perso/recipe-scrapers/recipe-scrapers/tests/__init__.py", line 92, in test_func
    self.assertEqual(
AssertionError: 'https://quitoque.fr/products/cuisse-de-ca[48 chars]3720' != 'quitoque.fr'
- https://quitoque.fr/products/cuisse-de-canard-au-romarin-et-pommes-de-terre-sarladaises-13720
+ quitoque.fr
 : The actual value for .canonical_url() did not match the expected value.

----------------------------------------------------------------------
Ran 1 test in 0.046s

FAILED (failures=1)

This is (partially) my quitoque.json file :

{
  "author": "QuiToque",
  "canonical_url": "https://quitoque.fr/products/cuisse-de-canard-au-romarin-et-pommes-de-terre-sarladaises-13720",
  "site_name": null,
  "host": "quitoque.fr",
  "language": "fr",
  "title": "Cuisse de canard au romarin et pommes de terre sarladaises"
}

It seems that cannonical URL is waiting for host instead of canonical_url. Do you have any clues ? Maybe I should open an issue instead ?

EDIT:
I got it, it is simply because testhtml (and HTML pages) does not have a link with canonical relationship. What shoud I do between thos 2 solutions :

  • Set the canonical_url as quitoque.fr
  • Add manually something like this into the quitoque.testhtml file
<link href="https://www.quitoque.fr/products/cuisse-de-canard-au-romarin-et-pommes-de-terre-sarladaises-13720" rel="canonical">

EDIT2:

I found another way implemented in #1131

@jayaddison
Copy link
Collaborator

Hi again @romainchassaigne!

It seems that cannonical URL is waiting for host instead of canonical_url. Do you have any clues ? Maybe I should open an issue instead ?

EDIT: I got it, it is simply because testhtml (and HTML pages) does not have a link with canonical relationship.

Yep, you figured it out - that's exactly it 👍

If I remember correctly, the library uses the input URL as a fallback for this if it can't be retrieved from the page; but during the test suite, there is no web URL.

I found another way implemented in #1131

Perfect, good find. If this becomes more common across websites, we could move this into a utility function for re-use by others.

I'll review the HTML-based retrieval approach in this pull request soon (either later today or tomorrow) - thanks for implementing it!


def keywords(self):
product_tags = self.soup.find(id="product-tags").find_all(class_="badge")
keywords = ",".join(self._get_text(tag) for tag in product_tags)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure about this -- your thoughts are welcome -- but I think it might be safer to filter out None values here and in similar string-join cases.

I would suggest:

Suggested change
keywords = ",".join(self._get_text(tag) for tag in product_tags)
keywords = ",".join(normalize_string(tag.text) for tag in product_tags if tag and tag.text)

@jayaddison
Copy link
Collaborator

Apologies for the delay @romainchassaigne - the pull request generally looks good to me; thank you for accommodating my code review suggestions. I have one more optional change suggestion, and this should be ready for merge soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants