Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: 'NoneType' object is not subscriptable in __generate_result method #145

Open
christian-unlockai opened this issue Jun 1, 2024 · 2 comments

Comments

@christian-unlockai
Copy link

Description

Hi,

I encountered an issue with the deutschland package while trying to fetch financial reports using the Bundesanzeiger class. When running my tests, I received a TypeError indicating that a 'NoneType' object is not subscriptable. This occurs in the __generate_result method when trying to access the captcha_wrapper div.

Error Details

TypeError: 'NoneType' object is not subscriptable

Steps to Reproduce

  1. Initialize the Bundesanzeiger class.
  2. Call the get_reports method with a valid search term.
  3. Observe the error in the __generate_result method.

Code Snippet

Here is the relevant part of the code where the error occurs:

def __generate_result(self, content: str):
        """iterate trough all results and try to fetch single reports"""
        result = {}
        for element in self.__find_all_entries_on_page(content):
            get_element_response = self.__get_response(element.content_url)

            if self.__is_captcha_needed(get_element_response.text):
                soup = BeautifulSoup(get_element_response.text, "html.parser")
                captcha_image_src = soup.find("div", {"class": "captcha_wrapper"}).find(
                    "img"
                )["src"]
                img_response = self.__get_response(captcha_image_src)
                captcha_result = self.captcha_callback(img_response.content)
                captcha_endpoint_url = soup.find_all("form")[1]["action"]
                get_element_response = self.session.post(
                    captcha_endpoint_url,
                    data={"solution": captcha_result, "confirm-button": "OK"},
                )

            content_soup = BeautifulSoup(get_element_response.text, "html.parser")
            content_element = content_soup.find(
                "div", {"class": "publication_container"}
            )

            if not content_element:
                continue

            element.report = content_element.text
            element.raw_report = content_element.prettify()

            result[element.to_hash()] = element.to_dict()

        return result

Additional Information

  • Python version: 3.10.13
  • deutschland package version: latest
  • OS: macOS

Logs

2024-06-01 14:17:21 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchen2?4-1.-search~table~panel-rows-2-search~table~row~panel-publication~link HTTP/1.1" 302 0 (connectionpool.py:549)
2024-06-01 14:17:21 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchergebnis?7 HTTP/1.1" 200 None (connectionpool.py:549)
2024-06-01 14:17:21 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchergebnis?7--captcha~panel-captcha_form-captcha_image&antiCache=1717244241383 HTTP/1.1" 200 None (connectionpool.py:549)
2024-06-01 14:17:23 [DEBUG] https://www.bundesanzeiger.de:443 "POST /pub/de/suchergebnis?7-1.-captcha~panel-captcha_form HTTP/1.1" 302 0 (connectionpool.py:549)
2024-06-01 14:17:23 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchergebnis?9 HTTP/1.1" 200 None (connectionpool.py:549)
2024-06-01 14:17:23 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchen2?4-1.-search~table~panel-rows-3-search~table~row~panel-publication~link HTTP/1.1" 302 0 (connectionpool.py:549)
2024-06-01 14:17:23 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchergebnis?10 HTTP/1.1" 200 None (connectionpool.py:549)

Please let me know if further information is needed.

Thank you!

Christian

@davidrzs
Copy link

davidrzs commented Jun 1, 2024

Can confirm the issue.

@wirthual
Copy link
Member

wirthual commented Jun 3, 2024

Hi,

thank you for the detailed description. From the error I would assume the captcha was either removed or the site structure changed. If its number one, we can simply take out the test.

I removed the following section and I was able to retrieve a result.

if self.__is_captcha_needed(get_element_response.text):
          soup = BeautifulSoup(get_element_response.text, "html.parser")
          captcha_image_src = soup.find("div", {"class": "captcha_wrapper"}).find(
              "img"
          )["src"]
          img_response = self.__get_response(captcha_image_src)
          captcha_result = self.captcha_callback(img_response.content)
          captcha_endpoint_url = soup.find_all("form")[1]["action"]
          get_element_response = self.session.post(
              captcha_endpoint_url,
              data={"solution": captcha_result, "confirm-button": "OK"},
          )

This was the code I ran:

from deutschland.bundesanzeiger import Bundesanzeiger
ba = Bundesanzeiger()
# search term
data = ba.get_reports("Deutsche Bahn AG")
# returns a dictionary with all reports found as fulltext reports
print(data.keys())

With results: dict_keys(['4442fe462193acf9a4bf741516a00dfa'])

The question is if this works for all cases, or if the captcha still appears with a changed structure. In that case we would need to adapt the detection of the captcha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants