v1.15.0: OmniScraperGraph not working: Error parsing input keys for ImageToText #580

LorenzoPaleari · 2024-08-24T13:14:08Z

Describe the bug
OmniScraperGraph throws error. Tested on minimal example on GitHub.
omni_scraper_openai.py

To Reproduce

mkdir test
cd test
python3 -m venv venv
source venv/bin/activate

pip install scrapegraphai \
    "scrapegraphai[burr]" \ 
    "scrapegraphai[more-browser-options]" \
    "pip install scrapegraphai[other-language-models]" \
    langchain_google_vertexai --no-cache     # to have a clean environment 
# It do not start without all of this libraries. This is potentially a bug itself
playwright install

# Using the provided example for openai found on GitHub
# Set up openai key in .env
python3 omni_scraper_openai.py

Output

--- Executing Fetch Node ---
--- (Fetching HTML from: https://perinim.github.io/projects/) ---
--- Executing Parse Node ---
--- Executing ImageToText Node ---
Traceback (most recent call last):
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/nodes/base_node.py", line 112, in get_input_keys
    input_keys = self._parse_input_keys(state, self.input)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/nodes/base_node.py", line 236, in _parse_input_keys
    raise ValueError("No state keys matched the expression.")
ValueError: No state keys matched the expression.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/lollo/Desktop/test/test.py", line 42, in <module>
    result = omni_scraper_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/omni_scraper_graph.py", line 124, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 263, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 185, in _execute_standard
    raise e
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 169, in _execute_standard
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/nodes/image_to_text_node.py", line 54, in execute
    input_keys = self.get_input_keys(state)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/nodes/base_node.py", line 116, in get_input_keys
    raise ValueError(f"Error parsing input keys for {self.node_name}: {str(e)}")
ValueError: Error parsing input keys for ImageToText: No state keys matched the expression.

Adding burr arguments to graph config:

"burr_kwargs": {
        "project_name": "test-scraper",
        "app_instance_id":"1234",
    }

Starting action: Fetch
--- Executing Fetch Node ---
--- (Fetching HTML from: https://perinim.github.io/projects/) ---

********************************************************************************
-------------------------------------------------------------------
Oh no an error! Need help with Burr?
Join our discord and ask for help! https://discord.gg/4FxBMyzW5n
-------------------------------------------------------------------
> Action: `Fetch` encountered an error!<
> State (at time of action):
{'__SEQUENCE_ID': 0,
 'url': 'https://perinim.github.io/projects/',
 'user_prompt': "'List me all the projects with their titles and im..."}
> Inputs (at time of action):
{}
********************************************************************************
Traceback (most recent call last):
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 561, in _step
    new_state = _run_reducer(next_action, self._state, result, next_action.name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 199, in _run_reducer
    _validate_reducer_writes(reducer, new_state, name)
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 174, in _validate_reducer_writes
    raise ValueError(
ValueError: State is missing write keys after running: Fetch. Missing keys are: {'link_urls', 'img_urls'}. Has writes: ['doc', 'link_urls', 'img_urls']
Finishing action: Fetch
Traceback (most recent call last):
  File "/Users/lollo/Desktop/test/test.py", line 46, in <module>
    result = omni_scraper_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/omni_scraper_graph.py", line 124, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 260, in execute
    result = bridge.execute(initial_state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/integrations/burr_bridge.py", line 215, in execute
    last_action, result, final_state = self.burr_app.run(
                                       ^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/telemetry.py", line 273, in wrapped_fn
    return call_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 893, in run
    next(gen)
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 838, in iterate
    prior_action, result, state = self.step(inputs=inputs)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 515, in step
    out = self._step(inputs=inputs, _run_hooks=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 568, in _step
    raise e
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 561, in _step
    new_state = _run_reducer(next_action, self._state, result, next_action.name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 199, in _run_reducer
    _validate_reducer_writes(reducer, new_state, name)
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 174, in _validate_reducer_writes
    raise ValueError(
ValueError: State is missing write keys after running: Fetch. Missing keys are: {'link_urls', 'img_urls'}. Has writes: ['doc', 'link_urls', 'img_urls']

Desktop

OS: MacOsX (Intel)
ScrapeGraphAi Version: 1.14.0 / 1.14.1 / 1.15.0 / 1.13.3 (Tested on all)
python3.11.9 (not python3.12, see Package google-crc32c does not support Python 12 #568 )

The text was updated successfully, but these errors were encountered:

skrawcz · 2024-08-28T19:54:30Z

CC @elijahbenizzy

VinciGit00 · 2024-08-28T20:00:14Z

@skrawcz will you do it?

elijahbenizzy · 2024-08-28T21:07:39Z

@skrawcz will you do it?

Yes, we will take a look shortly. Thanks for bringing up!

elijahbenizzy · 2024-08-29T20:26:12Z

Hey! I thought this was a Burr error, but it looks like this is an error with the workflow. It looks like the problem is that not all the state items are written.

So, I'm not sure I'm the best to debug this (I don't have full context), but I did create this to make it easier to debug/read. In the burr case, it's fairly clear what's happening, the non-burr case (with this fix) displays the same information.

Thoughts on how to proceed? Did something change recently?

https://github.com/ScrapeGraphAI/Scrapegraph-ai/pull/611/files

LorenzoPaleari · 2024-08-29T22:49:54Z

Hi, I do not know if this is helpful or not.

The error I encountered is in the "standard" ScrapeGraphAi work mode: without using burr. When I firstly saw the error I tried to understand if it was an issue on my side and tried out burr to see if I could figure out something. Unfortunately also burr was giving an error.

To me it seemed that burr was raising an error due to the underling structure of OmniScraper not working properly, I included both logs in my Issue simply because to me burr error felt much more detailed than the generic error thrown by ScrapeGraphAi. Thinking it could be of help figuring out where the bug is I included both log.

elijahbenizzy · 2024-08-29T23:42:39Z

Hi, I do not know if this is helpful or not.

The error I encountered is in the "standard" ScrapeGraphAi work mode: without using burr. When I firstly saw the error I tried to understand if it was an issue on my side and tried out burr to see if I could figure out something. Unfortunately also burr was giving an error.

To me it seemed that burr was raising an error due to the underling structure of OmniScraper not working properly, I included both logs in my Issue simply because to me burr error felt much more detailed than the generic error thrown by ScrapeGraphAi. Thinking it could be of help figuring out where the bug is I included both log.

Got it! Yeah I think that Burr was slightly more helpful (it showed the keys that were missing), but I added that back to the core library too :)

LorenzoPaleari · 2024-08-30T15:22:16Z

I dug down and found the issue.

Looking at the omni_scraper_graph.py-L59 file it can be easily seen how the AbstractGraph gets created:

def _create_graph(self) -> BaseGraph:
        """
        Creates the graph of nodes representing the workflow for web scraping.

        Returns:
            BaseGraph: A graph instance representing the web scraping workflow.
        """
        fetch_node = FetchNode(
            input="url | local_dir",
            output=["doc", "link_urls", "img_urls"],
            node_config={
                "loader_kwargs": self.config.get("loader_kwargs", {}),
            }
        )
        parse_node = ParseNode(
            input="doc",
            output=["parsed_doc"],
            node_config={
                "chunk_size": self.model_token
            }
        )
        image_to_text_node = ImageToTextNode(
            input="img_urls",
            output=["img_desc"],
            node_config={
                "llm_model": OpenAIImageToText(self.config["llm"]),
                "max_images": self.max_images
            }
        )

        generate_answer_omni_node = GenerateAnswerOmniNode(
            input="user_prompt & (relevant_chunks | parsed_doc | doc) & img_desc",
            output=["answer"],
            node_config={
                "llm_model": self.llm_model,
                "additional_info": self.config.get("additional_info"),
                "schema": self.schema
            }
        )

        return BaseGraph(
            nodes=[
                fetch_node,
                parse_node,
                image_to_text_node,
                generate_answer_omni_node,
            ],
            edges=[
                (fetch_node, parse_node),
                (parse_node, image_to_text_node),
                (image_to_text_node, generate_answer_omni_node)
            ],
            entry_point=fetch_node,
            graph_name=self.__class__.__name__
        )

Possible First Issue (Minor)
I think the Fetch node is missing some configurations to work properly, proposed changes:

fetch_node = FetchNode(
            input="url| local_dir",
            output=["doc", "link_urls", "img_urls"],
            node_config={
                "llm_model": self.llm_model,
                "force": self.config.get("force", False),
                "cut": self.config.get("cut", True),
                "loader_kwargs": self.config.get("loader_kwargs", {}),
                "browser_base": self.config.get("browser_base")
            }
        )

MAIN ISSUE
OmniScraperGraph starting from FetchNode, through ParseNode and ImageToTextNode should gather all the necessary information to generate an answer.

By just looking at the input/output parameters it theoretically all makes sense. FetchNode provides in output doc, link_urls and img_urls . This output is then used by ParseNode and ImageToTextNode to generate better data.

The issue is in FetchNode. It actually only generate in output doc and nothing else. That causes ImageToTextNode to fail since the state does not contain the requested image_urls. It also causes Burr to fail since the node is "promising" doc, link_urls and img_urls , but it is generating just doc.

Proposed Solution
It should be fairly simple to add a small parsing for urls. Inside helpers/default_filters it already exists a list of image extensions that can be leveraged to distinguish image urls from link urls.

I do not know if you want this feature to be implemented inside the ParseNode or the FetchNode so I didn't try to push a fix for this, but with a very simple patch I putted testing, with this described small function it should start working again.

VinciGit00 · 2024-08-30T15:24:50Z

yes please can you fix it and make a pull request?

LorenzoPaleari · 2024-08-31T04:07:16Z

Should I add the function on FetchNode or on ParseNode?

VinciGit00 · 2024-08-31T06:33:06Z

Parse

VinciGit00 · 2024-09-01T10:24:42Z

Hi @LorenzoPaleari can you update and say if everything is ok?

LorenzoPaleari · 2024-09-03T11:20:14Z

I pushed the changes and it should work, although beta5 is broken, I'm opening an issue

VinciGit00 · 2024-09-10T07:46:30Z

now beta is stable, can you try again?

LorenzoPaleari · 2024-09-12T20:09:17Z

Sorry, I got busy last days.

v1.18.3 - Do not have the fixes I made
v1.19.0-beta1 - Has the changes I made WORKING
v1.19.0-beta2+ - NOT WORKING - Changes I made got removed trying to fix another bug with urls.

ParseNode needs to be able to extract from the document url_links and images_links and update the status with this values.
On the changes I made, when creating parse node with flag parse_url = True, the links scraping was active and the status was updated correctly allowing ImageToText node to execute flawlessly.

I used a parameter with a flag since I didn't know if this particular function were going to be used somewhere else that is not OmniScraperGraph, but can always be used in CustomGraph creation.

On 1.19.0-beta2 the fix I made got removed, the code I added didn't changed in any way the normal execution workflow of the ParseNode, it was just going to call a couple of functions to extract links, and than the document parsing was flowing as it was before. With the only difference that before parsing I had extracted links to update the status with.
While resolving #637 on Merge #648 my code was removed, probably considering the similarity of not being able to extract urls and my url extractor. They are actually two distinct features that should go along well.

My change is just changing the state variable in order to correctly pass to all other nodes that are called later the images_links and url_links. While the issue was referring to GPT (or any LLM) not being able to extract urls from the parsed document itself which I didn't change.

VinciGit00 · 2024-09-12T21:29:54Z

Hi @LorenzoPaleari, sorry for what I've done.
Can you re upload your lines of code?

LorenzoPaleari · 2024-09-13T00:20:39Z

@VinciGit00
No problem at all, just wanted to provide more info on why that part of code is harmless but useful for OmniScraper.

Pull Request
#662

VinciGit00 · 2024-09-13T07:06:47Z

hi please update to the ne version

LorenzoPaleari changed the title ~~OmniScraperGraph not working: Error parsing input keys for ImageToText~~ v1.14.0/1: OmniScraperGraph not working: Error parsing input keys for ImageToText Aug 26, 2024

LorenzoPaleari changed the title ~~v1.14.0/1: OmniScraperGraph not working: Error parsing input keys for ImageToText~~ v1.15.0: OmniScraperGraph not working: Error parsing input keys for ImageToText Aug 26, 2024

VinciGit00 assigned skrawcz Aug 28, 2024

VinciGit00 assigned elijahbenizzy Aug 28, 2024

elijahbenizzy mentioned this issue Aug 29, 2024

Burr update #570

Open

VinciGit00 closed this as completed Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.15.0: OmniScraperGraph not working: Error parsing input keys for ImageToText #580

v1.15.0: OmniScraperGraph not working: Error parsing input keys for ImageToText #580

LorenzoPaleari commented Aug 24, 2024 •

edited

Loading

skrawcz commented Aug 28, 2024

VinciGit00 commented Aug 28, 2024

elijahbenizzy commented Aug 28, 2024

elijahbenizzy commented Aug 29, 2024

LorenzoPaleari commented Aug 29, 2024

elijahbenizzy commented Aug 29, 2024

LorenzoPaleari commented Aug 30, 2024

VinciGit00 commented Aug 30, 2024

LorenzoPaleari commented Aug 31, 2024

VinciGit00 commented Aug 31, 2024

VinciGit00 commented Sep 1, 2024

LorenzoPaleari commented Sep 3, 2024

VinciGit00 commented Sep 10, 2024

LorenzoPaleari commented Sep 12, 2024 •

edited

Loading

VinciGit00 commented Sep 12, 2024

LorenzoPaleari commented Sep 13, 2024 •

edited

Loading

VinciGit00 commented Sep 13, 2024

v1.15.0: OmniScraperGraph not working: Error parsing input keys for ImageToText #580

v1.15.0: OmniScraperGraph not working: Error parsing input keys for ImageToText #580

Comments

LorenzoPaleari commented Aug 24, 2024 • edited Loading

skrawcz commented Aug 28, 2024

VinciGit00 commented Aug 28, 2024

elijahbenizzy commented Aug 28, 2024

elijahbenizzy commented Aug 29, 2024

LorenzoPaleari commented Aug 29, 2024

elijahbenizzy commented Aug 29, 2024

LorenzoPaleari commented Aug 30, 2024

VinciGit00 commented Aug 30, 2024

LorenzoPaleari commented Aug 31, 2024

VinciGit00 commented Aug 31, 2024

VinciGit00 commented Sep 1, 2024

LorenzoPaleari commented Sep 3, 2024

VinciGit00 commented Sep 10, 2024

LorenzoPaleari commented Sep 12, 2024 • edited Loading

VinciGit00 commented Sep 12, 2024

LorenzoPaleari commented Sep 13, 2024 • edited Loading

VinciGit00 commented Sep 13, 2024

LorenzoPaleari commented Aug 24, 2024 •

edited

Loading

LorenzoPaleari commented Sep 12, 2024 •

edited

Loading

LorenzoPaleari commented Sep 13, 2024 •

edited

Loading