-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for quitoque.fr #1163
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @romainchassaigne - thank you for developing this!
I notice that the size of the JSON returned from the API is fairly large - more than 10 megabytes. Also, most of the information we'd like to extract from the recipe does appear already in the HTML, without making that second (large) network request.
Perhaps we could retrieve the information from the HTML? Or otherwise, at least reduce the size of the network request and test data?
Hi @jayaddison ! Thank you for your reply. I'll try to reduce the graphql request. Maybe it could be possible to only get the 2 serving (as this is the only information I want) recipe. |
Hi @jayaddison ! I've successfuly reduced the query from 10MB to approx 302KB. I've refined the graphql query only with the informations I really want. Unfortunnatly I have tried to remove "extensions" from the query, but it seems to be related to the server. Please let me know if anything needs to be corrected. |
Co-authored-by: James Addison <[email protected]>
Quitoque.fr have changed the way to retrieve the recipes. I need to change everything to scrap the html instead. GraphQL API is not available anymore. |
Drats - sorry to hear that, @romainchassaigne. Would you like to hold this pull request open for a while in case that API returns? Thanks for making the suggested adjustments along the way so far (despite it seemingly not working out!). |
No I think I'm good to rework using the HTML scrapper, it seems to be simpler than later. |
Hi @jayaddison, I'm running into a weird error while running the tests. ======================================================================
FAIL: tests/test_data/quitoque.fr/quitoque.json (tests.RecipeTestCase.tests/test_data/quitoque.fr/quitoque.json) [canonical_url]
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/romain.chassaigne/Documents/Perso/recipe-scrapers/recipe-scrapers/tests/__init__.py", line 92, in test_func
self.assertEqual(
AssertionError: 'https://quitoque.fr/products/cuisse-de-ca[48 chars]3720' != 'quitoque.fr'
- https://quitoque.fr/products/cuisse-de-canard-au-romarin-et-pommes-de-terre-sarladaises-13720
+ quitoque.fr
: The actual value for .canonical_url() did not match the expected value.
----------------------------------------------------------------------
Ran 1 test in 0.046s
FAILED (failures=1) This is (partially) my {
"author": "QuiToque",
"canonical_url": "https://quitoque.fr/products/cuisse-de-canard-au-romarin-et-pommes-de-terre-sarladaises-13720",
"site_name": null,
"host": "quitoque.fr",
"language": "fr",
"title": "Cuisse de canard au romarin et pommes de terre sarladaises"
} It seems that cannonical URL is waiting for EDIT:
<link href="https://www.quitoque.fr/products/cuisse-de-canard-au-romarin-et-pommes-de-terre-sarladaises-13720" rel="canonical"> EDIT2: I found another way implemented in #1131 |
Hi again @romainchassaigne!
Yep, you figured it out - that's exactly it 👍 If I remember correctly, the library uses the input URL as a fallback for this if it can't be retrieved from the page; but during the test suite, there is no web URL.
Perfect, good find. If this becomes more common across websites, we could move this into a utility function for re-use by others. I'll review the HTML-based retrieval approach in this pull request soon (either later today or tomorrow) - thanks for implementing it! |
Co-authored-by: James Addison <[email protected]>
|
||
def keywords(self): | ||
product_tags = self.soup.find(id="product-tags").find_all(class_="badge") | ||
keywords = ",".join(self._get_text(tag) for tag in product_tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure about this -- your thoughts are welcome -- but I think it might be safer to filter out None
values here and in similar string-join cases.
I would suggest:
keywords = ",".join(self._get_text(tag) for tag in product_tags) | |
keywords = ",".join(normalize_string(tag.text) for tag in product_tags if tag and tag.text) |
Apologies for the delay @romainchassaigne - the pull request generally looks good to me; thank you for accommodating my code review suggestions. I have one more optional change suggestion, and this should be ready for merge soon. |
Resolved the issue #725
Please let me know if anything is wrong as this is my first pull request to this project.