Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.15.0: OmniScraperGraph not working: Error parsing input keys for ImageToText #580

Closed
LorenzoPaleari opened this issue Aug 24, 2024 · 17 comments
Assignees

Comments

@LorenzoPaleari
Copy link
Contributor

LorenzoPaleari commented Aug 24, 2024

Describe the bug
OmniScraperGraph throws error. Tested on minimal example on GitHub.
omni_scraper_openai.py

To Reproduce

mkdir test
cd test
python3 -m venv venv
source venv/bin/activate
pip install scrapegraphai \
    "scrapegraphai[burr]" \ 
    "scrapegraphai[more-browser-options]" \
    "pip install scrapegraphai[other-language-models]" \
    langchain_google_vertexai --no-cache     # to have a clean environment 
# It do not start without all of this libraries. This is potentially a bug itself
playwright install

# Using the provided example for openai found on GitHub
# Set up openai key in .env
python3 omni_scraper_openai.py

Output

--- Executing Fetch Node ---
--- (Fetching HTML from: https://perinim.github.io/projects/) ---
--- Executing Parse Node ---
--- Executing ImageToText Node ---
Traceback (most recent call last):
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/nodes/base_node.py", line 112, in get_input_keys
    input_keys = self._parse_input_keys(state, self.input)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/nodes/base_node.py", line 236, in _parse_input_keys
    raise ValueError("No state keys matched the expression.")
ValueError: No state keys matched the expression.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/lollo/Desktop/test/test.py", line 42, in <module>
    result = omni_scraper_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/omni_scraper_graph.py", line 124, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 263, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 185, in _execute_standard
    raise e
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 169, in _execute_standard
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/nodes/image_to_text_node.py", line 54, in execute
    input_keys = self.get_input_keys(state)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/nodes/base_node.py", line 116, in get_input_keys
    raise ValueError(f"Error parsing input keys for {self.node_name}: {str(e)}")
ValueError: Error parsing input keys for ImageToText: No state keys matched the expression.

Adding burr arguments to graph config:

"burr_kwargs": {
        "project_name": "test-scraper",
        "app_instance_id":"1234",
    }
Starting action: Fetch
--- Executing Fetch Node ---
--- (Fetching HTML from: https://perinim.github.io/projects/) ---

********************************************************************************
-------------------------------------------------------------------
Oh no an error! Need help with Burr?
Join our discord and ask for help! https://discord.gg/4FxBMyzW5n
-------------------------------------------------------------------
> Action: `Fetch` encountered an error!<
> State (at time of action):
{'__SEQUENCE_ID': 0,
 'url': 'https://perinim.github.io/projects/',
 'user_prompt': "'List me all the projects with their titles and im..."}
> Inputs (at time of action):
{}
********************************************************************************
Traceback (most recent call last):
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 561, in _step
    new_state = _run_reducer(next_action, self._state, result, next_action.name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 199, in _run_reducer
    _validate_reducer_writes(reducer, new_state, name)
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 174, in _validate_reducer_writes
    raise ValueError(
ValueError: State is missing write keys after running: Fetch. Missing keys are: {'link_urls', 'img_urls'}. Has writes: ['doc', 'link_urls', 'img_urls']
Finishing action: Fetch
Traceback (most recent call last):
  File "/Users/lollo/Desktop/test/test.py", line 46, in <module>
    result = omni_scraper_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/omni_scraper_graph.py", line 124, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/graphs/base_graph.py", line 260, in execute
    result = bridge.execute(initial_state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/scrapegraphai/integrations/burr_bridge.py", line 215, in execute
    last_action, result, final_state = self.burr_app.run(
                                       ^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/telemetry.py", line 273, in wrapped_fn
    return call_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 893, in run
    next(gen)
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 838, in iterate
    prior_action, result, state = self.step(inputs=inputs)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 515, in step
    out = self._step(inputs=inputs, _run_hooks=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 568, in _step
    raise e
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 561, in _step
    new_state = _run_reducer(next_action, self._state, result, next_action.name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 199, in _run_reducer
    _validate_reducer_writes(reducer, new_state, name)
  File "/Users/lollo/Desktop/test/venv/lib/python3.11/site-packages/burr/core/application.py", line 174, in _validate_reducer_writes
    raise ValueError(
ValueError: State is missing write keys after running: Fetch. Missing keys are: {'link_urls', 'img_urls'}. Has writes: ['doc', 'link_urls', 'img_urls']

Desktop

@LorenzoPaleari LorenzoPaleari changed the title OmniScraperGraph not working: Error parsing input keys for ImageToText v1.14.0/1: OmniScraperGraph not working: Error parsing input keys for ImageToText Aug 26, 2024
@LorenzoPaleari LorenzoPaleari changed the title v1.14.0/1: OmniScraperGraph not working: Error parsing input keys for ImageToText v1.15.0: OmniScraperGraph not working: Error parsing input keys for ImageToText Aug 26, 2024
@skrawcz
Copy link
Contributor

skrawcz commented Aug 28, 2024

CC @elijahbenizzy

@VinciGit00
Copy link
Collaborator

@skrawcz will you do it?

@elijahbenizzy
Copy link
Contributor

@skrawcz will you do it?

Yes, we will take a look shortly. Thanks for bringing up!

@elijahbenizzy
Copy link
Contributor

Hey! I thought this was a Burr error, but it looks like this is an error with the workflow. It looks like the problem is that not all the state items are written.

So, I'm not sure I'm the best to debug this (I don't have full context), but I did create this to make it easier to debug/read. In the burr case, it's fairly clear what's happening, the non-burr case (with this fix) displays the same information.

Thoughts on how to proceed? Did something change recently?

https://github.com/ScrapeGraphAI/Scrapegraph-ai/pull/611/files

@LorenzoPaleari
Copy link
Contributor Author

Hi, I do not know if this is helpful or not.

The error I encountered is in the "standard" ScrapeGraphAi work mode: without using burr. When I firstly saw the error I tried to understand if it was an issue on my side and tried out burr to see if I could figure out something. Unfortunately also burr was giving an error.

To me it seemed that burr was raising an error due to the underling structure of OmniScraper not working properly, I included both logs in my Issue simply because to me burr error felt much more detailed than the generic error thrown by ScrapeGraphAi. Thinking it could be of help figuring out where the bug is I included both log.

@elijahbenizzy
Copy link
Contributor

Hi, I do not know if this is helpful or not.

The error I encountered is in the "standard" ScrapeGraphAi work mode: without using burr. When I firstly saw the error I tried to understand if it was an issue on my side and tried out burr to see if I could figure out something. Unfortunately also burr was giving an error.

To me it seemed that burr was raising an error due to the underling structure of OmniScraper not working properly, I included both logs in my Issue simply because to me burr error felt much more detailed than the generic error thrown by ScrapeGraphAi. Thinking it could be of help figuring out where the bug is I included both log.

Got it! Yeah I think that Burr was slightly more helpful (it showed the keys that were missing), but I added that back to the core library too :)

@LorenzoPaleari
Copy link
Contributor Author

I dug down and found the issue.

Looking at the omni_scraper_graph.py-L59 file it can be easily seen how the AbstractGraph gets created:

def _create_graph(self) -> BaseGraph:
        """
        Creates the graph of nodes representing the workflow for web scraping.

        Returns:
            BaseGraph: A graph instance representing the web scraping workflow.
        """
        fetch_node = FetchNode(
            input="url | local_dir",
            output=["doc", "link_urls", "img_urls"],
            node_config={
                "loader_kwargs": self.config.get("loader_kwargs", {}),
            }
        )
        parse_node = ParseNode(
            input="doc",
            output=["parsed_doc"],
            node_config={
                "chunk_size": self.model_token
            }
        )
        image_to_text_node = ImageToTextNode(
            input="img_urls",
            output=["img_desc"],
            node_config={
                "llm_model": OpenAIImageToText(self.config["llm"]),
                "max_images": self.max_images
            }
        )

        generate_answer_omni_node = GenerateAnswerOmniNode(
            input="user_prompt & (relevant_chunks | parsed_doc | doc) & img_desc",
            output=["answer"],
            node_config={
                "llm_model": self.llm_model,
                "additional_info": self.config.get("additional_info"),
                "schema": self.schema
            }
        )

        return BaseGraph(
            nodes=[
                fetch_node,
                parse_node,
                image_to_text_node,
                generate_answer_omni_node,
            ],
            edges=[
                (fetch_node, parse_node),
                (parse_node, image_to_text_node),
                (image_to_text_node, generate_answer_omni_node)
            ],
            entry_point=fetch_node,
            graph_name=self.__class__.__name__
        )

Possible First Issue (Minor)
I think the Fetch node is missing some configurations to work properly, proposed changes:

fetch_node = FetchNode(
            input="url| local_dir",
            output=["doc", "link_urls", "img_urls"],
            node_config={
                "llm_model": self.llm_model,
                "force": self.config.get("force", False),
                "cut": self.config.get("cut", True),
                "loader_kwargs": self.config.get("loader_kwargs", {}),
                "browser_base": self.config.get("browser_base")
            }
        )

MAIN ISSUE
OmniScraperGraph starting from FetchNode, through ParseNode and ImageToTextNode should gather all the necessary information to generate an answer.

By just looking at the input/output parameters it theoretically all makes sense. FetchNode provides in output doc, link_urls  and img_urls . This output is then used by ParseNode and ImageToTextNode to generate better data.

The issue is in FetchNode. It actually only generate in output doc and nothing else. That causes ImageToTextNode to fail since the state does not contain the requested image_urls. It also causes Burr to fail since the node is "promising" doc, link_urls  and img_urls , but it is generating just doc.

Proposed Solution
It should be fairly simple to add a small parsing for urls. Inside helpers/default_filters it already exists a list of image extensions that can be leveraged to distinguish image urls from link urls.

I do not know if you want this feature to be implemented inside the ParseNode or the FetchNode so I didn't try to push a fix for this, but with a very simple patch I putted testing, with this described small function it should start working again.

@VinciGit00
Copy link
Collaborator

yes please can you fix it and make a pull request?

@LorenzoPaleari
Copy link
Contributor Author

Should I add the function on FetchNode or on ParseNode?

@VinciGit00
Copy link
Collaborator

Parse

@VinciGit00
Copy link
Collaborator

Hi @LorenzoPaleari can you update and say if everything is ok?

@LorenzoPaleari
Copy link
Contributor Author

I pushed the changes and it should work, although beta5 is broken, I'm opening an issue

@VinciGit00
Copy link
Collaborator

now beta is stable, can you try again?

@LorenzoPaleari
Copy link
Contributor Author

LorenzoPaleari commented Sep 12, 2024

Sorry, I got busy last days.

  • v1.18.3 - Do not have the fixes I made
  • v1.19.0-beta1 - Has the changes I made WORKING
  • v1.19.0-beta2+ - NOT WORKING - Changes I made got removed trying to fix another bug with urls.

ParseNode needs to be able to extract from the document url_links and images_links and update the status with this values.
On the changes I made, when creating parse node with flag parse_url = True, the links scraping was active and the status was updated correctly allowing ImageToText node to execute flawlessly.

I used a parameter with a flag since I didn't know if this particular function were going to be used somewhere else that is not OmniScraperGraph, but can always be used in CustomGraph creation.

On 1.19.0-beta2 the fix I made got removed, the code I added didn't changed in any way the normal execution workflow of the ParseNode, it was just going to call a couple of functions to extract links, and than the document parsing was flowing as it was before. With the only difference that before parsing I had extracted links to update the status with.
While resolving #637 on Merge #648 my code was removed, probably considering the similarity of not being able to extract urls and my url extractor. They are actually two distinct features that should go along well.

My change is just changing the state variable in order to correctly pass to all other nodes that are called later the images_links and url_links. While the issue was referring to GPT (or any LLM) not being able to extract urls from the parsed document itself which I didn't change.

@VinciGit00
Copy link
Collaborator

Hi @LorenzoPaleari, sorry for what I've done.
Can you re upload your lines of code?

@LorenzoPaleari
Copy link
Contributor Author

LorenzoPaleari commented Sep 13, 2024

@VinciGit00
No problem at all, just wanted to provide more info on why that part of code is harmless but useful for OmniScraper.

Pull Request
#662

@VinciGit00
Copy link
Collaborator

hi please update to the ne version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@skrawcz @elijahbenizzy @VinciGit00 @LorenzoPaleari and others