-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrapegraph returns relative path URLs instead of absolute path **Possible Bug?** #544
Comments
I think for your case, returned HTML content from the fetch node doesn't contain any link and LLM is making up some random links that are similar to the prompted ones. Did you try disabling headless in graph config? |
@ekinsenler Tried this with Is this anything to do with |
|
@ekinsenler Hmm... I'm curious - Is this not a functional example yet? There seems to be a |
I don't think |
You're right I had the same issue. I suppose there's not much I can do now other than wait for the developers to push |
I am waiting on a confirmation if devs are already working on the fix of this issue or else I am going to create a pull request. |
@sandeepchittilla @ekinsenler I can confirm that the DeepScraper is broken, and your issues are caused by a problem on our side - I'll give you a better explanation in a few days, and then see what can be done to fix the problem. |
Let's get back to this. Basically, a "deep scraper" with crawling capabilities has been on the roadmap for a while. We had a contributor make a system to implement it, but it was very heavy and slow. It didn't check for loops, it used a Around that time, part of the team started working on a proper deep scraper, with a more modular desing that fit better within the exsiting framework. See #260 for more information. The work was left undone, due to shifting priorities. The The examples for the Thanks for taking interest in this library, and in this topic in particular. We'll leave this issue open for now. |
In my Pydantic model, I add a description which says to use the absolute URLs and it works.
|
@f-aguzzi thank you for taking the time out to respond and explaining the context! Understood! @datashaman thanks for the response. Just to be clear, you request a response in your pydantic class format (prob. JSON?) and sent this as part of the prompt; where you specify the description for the field, yes? edit: @datashaman additionally, may i ask which model are you calling? |
@sandeepchittilla correct, the field description hints to the LLM that it should use absolute URLs in the response. |
Describe the bug
When using gpt4o as the llm and scraping a webpage to return a list of links, sometimes the paths returned are :
The behaviour was consistent until 3 days ago i.e. it always returned full paths on a large dataset as well. Since then, I had to uninstall Scrapegraph and reinstall the library and that's when this issue started popping up.
Expected behavior
For example : asking to scrape a website
www.some-actual-website.com
and return a list of webpages that contain information about the contact details of the company, used to consistently/always return a json like :However, now I get either :
OR
I'm curious , shouldn't the list of URLs being parsed/scraped be a straightforward output? Is the final output always produced by the LLM?
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: