Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapegraph returns relative path URLs instead of absolute path **Possible Bug?** #544

Closed
sandeepchittilla opened this issue Aug 13, 2024 · 12 comments

Comments

@sandeepchittilla
Copy link
Contributor

sandeepchittilla commented Aug 13, 2024

Describe the bug
When using gpt4o as the llm and scraping a webpage to return a list of links, sometimes the paths returned are :

  • relative paths (OR)
  • full path with an incorrect prefix/domain usually "http://example.com"

The behaviour was consistent until 3 days ago i.e. it always returned full paths on a large dataset as well. Since then, I had to uninstall Scrapegraph and reinstall the library and that's when this issue started popping up.

Expected behavior
For example : asking to scrape a website www.some-actual-website.com and return a list of webpages that contain information about the contact details of the company, used to consistently/always return a json like :

{"list_of_urls": "['www.some-actual-website.com/about','www.some-actual-website.com/contact-us']"}

However, now I get either :

{"list_of_urls": "['https://example.com/about', 'https://example.com/contact-us']"}

OR

{"list_of_urls": "['/about','/contact-us']"}

I'm curious , shouldn't the list of URLs being parsed/scraped be a straightforward output? Is the final output always produced by the LLM?

Desktop (please complete the following information):

  • Ubuntu 22.04
  • Chromium Browser with Playwright
@sandeepchittilla sandeepchittilla changed the title Scrapegraph returns relative path URLs instead of absolute path **Possible Issue** Scrapegraph returns relative path URLs instead of absolute path **Possible Bug?** Aug 13, 2024
@ekinsenler
Copy link
Contributor

I think for your case, returned HTML content from the fetch node doesn't contain any link and LLM is making up some random links that are similar to the prompted ones. Did you try disabling headless in graph config?
graph_config = { ... "verbose": True, "headless": False, }

@sandeepchittilla
Copy link
Contributor Author

@ekinsenler Tried this with "headless": False and Xvfb and still yields the same results.

Is this anything to do with SmartScraperGraph which scrapes the source url only and instead i should use DeepScraperGraph which can go further? Or am i completely off? 🤔

@ekinsenler
Copy link
Contributor

SmartScraperGraph doesn't use URL search inside the HTML content as far as I know. For that purpose, I am using DeepScraperGraph. But you need to implement a filter on some level inside the search_link_node to prevent unrelated links from getting fetched.
I am also working on a similar project. DeepScraperGraph doesn't have a parameter to control the max_depth, therefore scraping doesn't terminate even with some filters such as filtering out URLs that are outside of the domain name or filtering out different language versions of the website.

@sandeepchittilla
Copy link
Contributor Author

@ekinsenler Hmm... I'm curious - Is this not a functional example yet? There seems to be a max_depth parameter. However, I am facing errors using this Scraper https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/2333b513aafae3c358225a8f82f6c01964c0514e/examples/openai/deep_scraper_openai.py

@ekinsenler
Copy link
Contributor

I don't think max_depth is functional yet. DeepScraperGraph is also buggy as it requires you to use an embedder_model that seems not functioning inside the code. I fixed the error on my local and run the graph but it gets lost inside the URL tree potentially looping between the same URLs.

@sandeepchittilla
Copy link
Contributor Author

You're right I had the same issue. I suppose there's not much I can do now other than wait for the developers to push DeepScraperGraph to release? I would be really interested if you get this working on your local/branch :)

@ekinsenler
Copy link
Contributor

I am waiting on a confirmation if devs are already working on the fix of this issue or else I am going to create a pull request.

@f-aguzzi
Copy link
Member

@sandeepchittilla @ekinsenler I can confirm that the DeepScraper is broken, and your issues are caused by a problem on our side - I'll give you a better explanation in a few days, and then see what can be done to fix the problem.

@f-aguzzi
Copy link
Member

Let's get back to this.

Basically, a "deep scraper" with crawling capabilities has been on the roadmap for a while. We had a contributor make a system to implement it, but it was very heavy and slow. It didn't check for loops, it used a SmartScraperGraph instance on every page, and more importantly, it introduced a signal-based approach that was hard to parallelize and required significant modifications to the existing graph engine. Therefore, this design was rejected.

Around that time, part of the team started working on a proper deep scraper, with a more modular desing that fit better within the exsiting framework. See #260 for more information. The work was left undone, due to shifting priorities. The SearchLinkNode, for example, was intended to be a piece of the DeepScraperGraph pipeline.

The examples for the DeepScraperGraph are still based on deprecated code that relies on a RAG-based approach. This was removed around a month ago. We're currently focusing on cleaning up and refactoring our baseline, so most dead code will soon be removed, along with broken examples. Hopefully we'll get back to developing the deep scraper, as it would be a killer feature for the library. It's also hard to implement, though, and there's already much work on our plates.

Thanks for taking interest in this library, and in this topic in particular. We'll leave this issue open for now.

@datashaman
Copy link

In my Pydantic model, I add a description which says to use the absolute URLs and it works.

class Content(BaseModel):
    url: str = Field(description="The absolute URL of the content")
    title: str

@sandeepchittilla
Copy link
Contributor Author

sandeepchittilla commented Aug 21, 2024

@f-aguzzi thank you for taking the time out to respond and explaining the context! Understood!

@datashaman thanks for the response. Just to be clear, you request a response in your pydantic class format (prob. JSON?) and sent this as part of the prompt; where you specify the description for the field, yes?

edit: @datashaman additionally, may i ask which model are you calling?

@datashaman
Copy link

@sandeepchittilla correct, the field description hints to the LLM that it should use absolute URLs in the response.
this was using the smart graph (not deep) with gpt-4o-mini as the LLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants