Scrapegraph returns relative path URLs instead of absolute path Possible Bug? #544

sandeepchittilla · 2024-08-13T10:28:49Z

Describe the bug
When using gpt4o as the llm and scraping a webpage to return a list of links, sometimes the paths returned are :

relative paths (OR)
full path with an incorrect prefix/domain usually "http://example.com"

The behaviour was consistent until 3 days ago i.e. it always returned full paths on a large dataset as well. Since then, I had to uninstall Scrapegraph and reinstall the library and that's when this issue started popping up.

Expected behavior
For example : asking to scrape a website www.some-actual-website.com and return a list of webpages that contain information about the contact details of the company, used to consistently/always return a json like :

{"list_of_urls": "['www.some-actual-website.com/about','www.some-actual-website.com/contact-us']"}

However, now I get either :

{"list_of_urls": "['https://example.com/about', 'https://example.com/contact-us']"}

OR

{"list_of_urls": "['/about','/contact-us']"}

I'm curious , shouldn't the list of URLs being parsed/scraped be a straightforward output? Is the final output always produced by the LLM?

Desktop (please complete the following information):

Ubuntu 22.04
Chromium Browser with Playwright

The text was updated successfully, but these errors were encountered:

ekinsenler · 2024-08-13T10:48:39Z

I think for your case, returned HTML content from the fetch node doesn't contain any link and LLM is making up some random links that are similar to the prompted ones. Did you try disabling headless in graph config?
graph_config = { ... "verbose": True, "headless": False, }

sandeepchittilla · 2024-08-13T11:09:35Z

@ekinsenler Tried this with "headless": False and Xvfb and still yields the same results.

Is this anything to do with SmartScraperGraph which scrapes the source url only and instead i should use DeepScraperGraph which can go further? Or am i completely off? 🤔

ekinsenler · 2024-08-13T11:23:05Z

SmartScraperGraph doesn't use URL search inside the HTML content as far as I know. For that purpose, I am using DeepScraperGraph. But you need to implement a filter on some level inside the search_link_node to prevent unrelated links from getting fetched.
I am also working on a similar project. DeepScraperGraph doesn't have a parameter to control the max_depth, therefore scraping doesn't terminate even with some filters such as filtering out URLs that are outside of the domain name or filtering out different language versions of the website.

sandeepchittilla · 2024-08-13T11:26:02Z

@ekinsenler Hmm... I'm curious - Is this not a functional example yet? There seems to be a max_depth parameter. However, I am facing errors using this Scraper https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/2333b513aafae3c358225a8f82f6c01964c0514e/examples/openai/deep_scraper_openai.py

ekinsenler · 2024-08-13T11:42:32Z

I don't think max_depth is functional yet. DeepScraperGraph is also buggy as it requires you to use an embedder_model that seems not functioning inside the code. I fixed the error on my local and run the graph but it gets lost inside the URL tree potentially looping between the same URLs.

sandeepchittilla · 2024-08-13T11:47:56Z

You're right I had the same issue. I suppose there's not much I can do now other than wait for the developers to push DeepScraperGraph to release? I would be really interested if you get this working on your local/branch :)

ekinsenler · 2024-08-13T11:59:49Z

I am waiting on a confirmation if devs are already working on the fix of this issue or else I am going to create a pull request.

f-aguzzi · 2024-08-15T14:59:55Z

@sandeepchittilla @ekinsenler I can confirm that the DeepScraper is broken, and your issues are caused by a problem on our side - I'll give you a better explanation in a few days, and then see what can be done to fix the problem.

f-aguzzi · 2024-08-20T09:12:08Z

Let's get back to this.

Basically, a "deep scraper" with crawling capabilities has been on the roadmap for a while. We had a contributor make a system to implement it, but it was very heavy and slow. It didn't check for loops, it used a SmartScraperGraph instance on every page, and more importantly, it introduced a signal-based approach that was hard to parallelize and required significant modifications to the existing graph engine. Therefore, this design was rejected.

Around that time, part of the team started working on a proper deep scraper, with a more modular desing that fit better within the exsiting framework. See #260 for more information. The work was left undone, due to shifting priorities. The SearchLinkNode, for example, was intended to be a piece of the DeepScraperGraph pipeline.

The examples for the DeepScraperGraph are still based on deprecated code that relies on a RAG-based approach. This was removed around a month ago. We're currently focusing on cleaning up and refactoring our baseline, so most dead code will soon be removed, along with broken examples. Hopefully we'll get back to developing the deep scraper, as it would be a killer feature for the library. It's also hard to implement, though, and there's already much work on our plates.

Thanks for taking interest in this library, and in this topic in particular. We'll leave this issue open for now.

datashaman · 2024-08-21T08:05:02Z

In my Pydantic model, I add a description which says to use the absolute URLs and it works.

class Content(BaseModel):
    url: str = Field(description="The absolute URL of the content")
    title: str

sandeepchittilla · 2024-08-21T12:04:30Z

@f-aguzzi thank you for taking the time out to respond and explaining the context! Understood!

@datashaman thanks for the response. Just to be clear, you request a response in your pydantic class format (prob. JSON?) and sent this as part of the prompt; where you specify the description for the field, yes?

edit: @datashaman additionally, may i ask which model are you calling?

datashaman · 2024-08-21T15:05:06Z

@sandeepchittilla correct, the field description hints to the LLM that it should use absolute URLs in the response.
this was using the smart graph (not deep) with gpt-4o-mini as the LLM.

sandeepchittilla changed the title Scrapegraph returns relative path URLs instead of absolute path **Possible Issue** Scrapegraph returns relative path URLs instead of absolute path **Possible Bug?** Aug 13, 2024

ekinsenler mentioned this issue Aug 13, 2024

embedder_model AttributeError in /examples/openai/deep_scraper_openai.py #545

Open

VinciGit00 closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapegraph returns relative path URLs instead of absolute path Possible Bug? #544

Scrapegraph returns relative path URLs instead of absolute path Possible Bug? #544

sandeepchittilla commented Aug 13, 2024 •

edited

Loading

ekinsenler commented Aug 13, 2024

sandeepchittilla commented Aug 13, 2024

ekinsenler commented Aug 13, 2024

sandeepchittilla commented Aug 13, 2024

ekinsenler commented Aug 13, 2024

sandeepchittilla commented Aug 13, 2024

ekinsenler commented Aug 13, 2024

f-aguzzi commented Aug 15, 2024

f-aguzzi commented Aug 20, 2024

datashaman commented Aug 21, 2024

sandeepchittilla commented Aug 21, 2024 •

edited

Loading

datashaman commented Aug 21, 2024

Scrapegraph returns relative path URLs instead of absolute path **Possible Bug?** #544

Scrapegraph returns relative path URLs instead of absolute path **Possible Bug?** #544

Comments

sandeepchittilla commented Aug 13, 2024 • edited Loading

ekinsenler commented Aug 13, 2024

sandeepchittilla commented Aug 13, 2024

ekinsenler commented Aug 13, 2024

sandeepchittilla commented Aug 13, 2024

ekinsenler commented Aug 13, 2024

sandeepchittilla commented Aug 13, 2024

ekinsenler commented Aug 13, 2024

f-aguzzi commented Aug 15, 2024

f-aguzzi commented Aug 20, 2024

datashaman commented Aug 21, 2024

sandeepchittilla commented Aug 21, 2024 • edited Loading

datashaman commented Aug 21, 2024

Scrapegraph returns relative path URLs instead of absolute path Possible Bug? #544

Scrapegraph returns relative path URLs instead of absolute path Possible Bug? #544

sandeepchittilla commented Aug 13, 2024 •

edited

Loading

sandeepchittilla commented Aug 21, 2024 •

edited

Loading