Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏗 Build Spider: Fort Worth Boards & Commissions #9

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cruznunez
Copy link

@cruznunez cruznunez commented Oct 8, 2024

What's this PR do?

This PR adds a scraper for Forth Worth Boards & Commissions. The scraper parses an API endpoint for upcoming meetings.

Why are we doing this?

Scraper requested from spreadsheet.

Steps to manually test

  1. Ensure the project is installed:
pipenv sync --dev
  1. Activate the virtual env and enter the pipenv shell:
pipenv shell
  1. Run the spider:
scrapy crawl fortx_Fort_Worth_Boards -O test_output.csv
  1. Monitor the output and ensure no errors are raised.
  2. Inspect test_output.csv to ensure the data looks valid.
  3. Ensure all tests pass
pytest

Are there any smells or added technical debt to note?

This scraper uses multiple API endpoints. After the page initially loads, a request is made to the index endpoint. When we click on a calendar item, another request is made to a meeting endpoint.

The test is using a file to act as the response for the first API request but all other requests actually are fetched over the network. I believe we should be stubbing all requests but not sure how to do that with our current setup. I tried downloading the other responses and adding similar code like from the scrapy generator but it didn't seem to do much. Tests would still fail if I turned my wifi off.

# tests/test_fortx_Fort_Worth_Boards.py
test_response = file_response(
    join(dirname(__file__), "files", "fortx_Fort_Worth_Boards.json"),
    url="https://www.fortworthtexas.gov/ocapi/calendars/getcalendaritems?Ids=788ffb59-05d1-457d-b9dd-423d4b95a06e&LanguageCode=en-US",
)
other_response = file_response(
    join(dirname(__file__), "files", "other_response.json"),
    url="https://www.fortworthtexas.gov/ocapi/get/contentinfo?calendarId=788ffb59-05d1-457d-b9dd-423d4b95a06e&contentId=e3182d81-2385-4796-809f-8a330d1c7ec9&language=en-US&mainContentId=e3182d81-2385-4796-809f-8a330d1c7ec9",
)

Summary by CodeRabbit

  • New Features

    • Introduced a new spider to scrape meeting information from the Fort Worth Boards and Commissions API.
    • Added a structured dataset in JSON format representing scheduled events in Fort Worth for early October 2024.
  • Tests

    • Implemented a comprehensive test suite for the new spider, validating various attributes of the parsed meeting data.

Copy link

coderabbitai bot commented Oct 8, 2024

Walkthrough

The changes introduce a new spider class, FortxFortWorthBoardsSpider, in the city_scrapers/spiders/fortx_Fort_Worth_Boards.py file, designed to scrape meeting information from the Fort Worth Boards and Commissions API. It includes methods for parsing API responses and constructing Meeting objects. Additionally, a new JSON file, tests/files/fortx_Fort_Worth_Boards.json, has been added to provide structured event data for testing. A corresponding test suite in tests/test_fortx_Fort_Worth_Boards.py validates the functionality of the spider, ensuring accurate parsing and data handling.

Changes

File Change Summary
city_scrapers/spiders/fortx_Fort_Worth_Boards.py Added FortxFortWorthBoardsSpider class with methods for parsing API responses and meeting data.
tests/files/fortx_Fort_Worth_Boards.json New JSON file added containing structured event data for Fort Worth meetings.
tests/test_fortx_Fort_Worth_Boards.py Introduced a test suite for FortxFortWorthBoardsSpider to validate parsing and data accuracy.

Poem

In Fort Worth where meetings abound,
A spider now scours the ground.
With data so neat,
It’s quite the feat,
Parsing events all around! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (6)
tests/test_fortx_Fort_Worth_Boards.py (2)

39-71: LGTM with suggestion: Comprehensive attribute testing

These tests thoroughly verify various attributes of the parsed items, which is crucial for ensuring the spider's accuracy. The multiple-item checks for description and start time are particularly good for catching inconsistencies.

Consider making the test_id function more robust by checking the format of the id rather than an exact string match. This would make the test less brittle to minor changes in the id generation logic.


106-108: LGTM with suggestion: Efficient all-day attribute testing

This test efficiently verifies the all_day attribute for all parsed items using pytest's parametrize decorator. This approach ensures consistency across all items.

Consider adding an assertion message to provide more context if the test fails, e.g., assert item["all_day"] is False, f"Item {item['title']} should not be an all-day event".

tests/files/fortx_Fort_Worth_Boards.json (2)

1-1: Consider the trade-offs of multiple date representations.

The JSON includes multiple representations of dates (e.g., "Date", "DD", "MM", "YYYY", "Day", "Month"). While this provides flexibility, it also introduces redundancy. Consider if all these fields are necessary for the scraper's functionality and downstream applications.

If some fields are not required, you might want to remove them to reduce data size and potential inconsistencies. However, if all fields are actively used, maintaining them is acceptable.


1-1: LGTM: Event structure is comprehensive. Consider ISO 8601 for DateTime.

The event structure provides necessary details with unique identifiers. However, for improved standardization and easier parsing, consider using ISO 8601 format for the "DateTime" field (e.g., "2024-10-01T00:00:00").

Example of ISO 8601 format:

"DateTime": "2024-10-01T00:00:00"
city_scrapers/spiders/fortx_Fort_Worth_Boards.py (2)

33-33: Set a meaningful User-Agent header

An empty User-Agent header may lead to request denials or violate the server's usage policies. Providing a specific User-Agent helps identify your scraper and promotes responsible web scraping practices.

Update the headers as follows:

- headers = {"User-Agent": ""}
+ headers = {"User-Agent": "city-scrapers-fortx/1.0 (+https://github.com/yourusername/city-scrapers-fortx)"}

63-63: Avoid shadowing built-in function zip

Using zip as a variable name shadows the built-in zip() function, which can lead to unexpected behavior or confusion.

Rename the variable to postal_code:

- zip = obj["PostCode"]
+ postal_code = obj["PostCode"]

Also, update its usage in line 64:

- values = [street, suburb, zip, "TX"]
+ values = [street, suburb, postal_code, "TX"]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 438faae and cf1f8c4.

📒 Files selected for processing (3)
  • city_scrapers/spiders/fortx_Fort_Worth_Boards.py (1 hunks)
  • tests/files/fortx_Fort_Worth_Boards.json (1 hunks)
  • tests/test_fortx_Fort_Worth_Boards.py (1 hunks)
🧰 Additional context used
🔇 Additional comments (8)
tests/test_fortx_Fort_Worth_Boards.py (6)

1-23: LGTM: Well-structured test setup

The imports and test setup are well-organized. Freezing time and using a mock response from a JSON file are good practices for ensuring consistent and reproducible test results.


25-26: LGTM: Concise count verification

This test effectively verifies that the spider parses the expected number of items from the mock response.


29-36: LGTM: Thorough title verification

This test effectively checks the titles of multiple parsed items, which is a good practice for ensuring consistent title extraction across different entries.


74-83: LGTM: Comprehensive location testing

This test effectively verifies location parsing for multiple items with different formats. It's particularly good that it covers cases both with and without a location name, ensuring robust location extraction.


86-103: LGTM: Thorough metadata verification

These tests effectively verify important metadata of the parsed items, including source URL, links, and classification. The links test is particularly thorough, checking both the title and href of the link.


1-108: Overall: Well-structured and comprehensive test suite

This test file provides excellent coverage of the FortxFortWorthBoardsSpider's functionality. It includes tests for all relevant attributes of the parsed items and uses appropriate testing techniques such as time freezing, mock responses, and parameterized tests.

The structure is clear, and the tests are well-organized. The use of multiple test cases for some attributes helps ensure consistency across different types of parsed items.

While there are a few minor suggestions for improvements (such as making the id test more robust and adding assertion messages), the overall quality of the test suite is high. It should effectively catch any regressions or issues in the spider's parsing logic.

Great job on creating a thorough and well-structured test suite!

tests/files/fortx_Fort_Worth_Boards.json (1)

1-1: LGTM: Overall JSON structure is well-organized.

The JSON structure with "success" and "data" keys is appropriate for representing the collection of events. The "success" boolean is useful for error handling in the scraper.

city_scrapers/spiders/fortx_Fort_Worth_Boards.py (1)

40-40: Verify if dayfirst=True is appropriate in date parsing

Using dayfirst=True assumes dates are in DD/MM/YYYY format. Since the timezone is America/Chicago, the dates from the API might be in MM/DD/YYYY format.

Please confirm the date format returned by the API to ensure accurate date parsing. If the dates are in MM/DD/YYYY format, you should remove dayfirst=True or set it to False.

@@ -0,0 +1 @@
{"success":true,"data":[{"Date":"01/10/2024","DD":1,"MM":10,"YYYY":2024,"Items":[{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"e3182d81-2385-4796-809f-8a330d1c7ec9","MainContentId":"e3182d81-2385-4796-809f-8a330d1c7ec9","Name":"Notice of Public Comment for Brownsfields Program","DateTime":"1/10/2024 12:00:00 AM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"f812c8f3-336c-44d8-83e0-d6ffa3574d11","MainContentId":"f812c8f3-336c-44d8-83e0-d6ffa3574d11","Name":"DFW International Airport Board Operations Committee","DateTime":"1/10/2024 12:30:00 PM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"c03a2079-94c1-4ae9-b688-88751656ac90","MainContentId":"c03a2079-94c1-4ae9-b688-88751656ac90","Name":"DFW International Airport Board Finance, Audit, and IT Committee","DateTime":"1/10/2024 12:35:00 PM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"dbc79239-74ab-4406-9ec2-2e32784d83d8","MainContentId":"dbc79239-74ab-4406-9ec2-2e32784d83d8","Name":"DFW International Airport Board Concessions/Commercial Development","DateTime":"1/10/2024 12:45:00 PM"}],"Day":"Tuesday","Month":"October"},{"Date":"02/10/2024","DD":2,"MM":10,"YYYY":2024,"Items":[{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"e3182d81-2385-4796-809f-8a330d1c7ec9","MainContentId":"e3182d81-2385-4796-809f-8a330d1c7ec9","Name":"Notice of Public Comment for Brownsfields Program","DateTime":"2/10/2024 12:00:00 AM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"b37500ac-d047-41db-8c9d-c7cd4cd3da00","MainContentId":"b37500ac-d047-41db-8c9d-c7cd4cd3da00","Name":"Tax Increment Reinvestment Zone No. 8 (Lancaster Corridor TIF)","DateTime":"2/10/2024 2:00:00 PM"}],"Day":"Wednesday","Month":"October"},{"Date":"03/10/2024","DD":3,"MM":10,"YYYY":2024,"Items":[{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"e3182d81-2385-4796-809f-8a330d1c7ec9","MainContentId":"e3182d81-2385-4796-809f-8a330d1c7ec9","Name":"Notice of Public Comment for Brownsfields Program","DateTime":"3/10/2024 12:00:00 AM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"7e5b64cf-901b-4a5b-bd25-89c6df16ccc2","MainContentId":"7e5b64cf-901b-4a5b-bd25-89c6df16ccc2","Name":"DFW International Airport Board and Committee Meetings","DateTime":"3/10/2024 8:30:00 AM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"3fcb8be2-8129-42d9-9975-deac97c3abae","MainContentId":"3fcb8be2-8129-42d9-9975-deac97c3abae","Name":"Downtown Design Review Board (DDRB)","DateTime":"3/10/2024 3:00:00 PM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"664cfb35-e16e-4974-bdb0-de6a9b7bfc53","MainContentId":"664cfb35-e16e-4974-bdb0-de6a9b7bfc53","Name":"Library Advisory Board Meeting (LAB)","DateTime":"3/10/2024 6:30:00 PM"}],"Day":"Thursday","Month":"October"},{"Date":"04/10/2024","DD":4,"MM":10,"YYYY":2024,"Items":[{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"e3182d81-2385-4796-809f-8a330d1c7ec9","MainContentId":"e3182d81-2385-4796-809f-8a330d1c7ec9","Name":"Notice of Public Comment for Brownsfields Program","DateTime":"4/10/2024 12:00:00 AM"}],"Day":"Friday","Month":"October"},{"Date":"07/10/2024","DD":7,"MM":10,"YYYY":2024,"Items":[{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"822ad0c4-25c8-4150-ae67-b62d4a0cc706","MainContentId":"822ad0c4-25c8-4150-ae67-b62d4a0cc706","Name":"Business Equity Advisory Board","DateTime":"7/10/2024 11:30:00 AM"}],"Day":"Monday","Month":"October"},{"Date":"09/10/2024","DD":9,"MM":10,"YYYY":2024,"Items":[{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"88c422cf-e46c-4b77-adb3-dda32d722852","MainContentId":"88c422cf-e46c-4b77-adb3-dda32d722852","Name":"Zoning Commission","DateTime":"9/10/2024 12:00:00 PM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"f0f5f8da-12e6-4c2a-8e4a-a70b13da7fa2","MainContentId":"f0f5f8da-12e6-4c2a-8e4a-a70b13da7fa2","Name":"Collective Bargaining","DateTime":"9/10/2024 2:00:00 PM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"1f32c7ae-4860-4c61-9518-7db1882360dc","MainContentId":"1f32c7ae-4860-4c61-9518-7db1882360dc","Name":"Tax Increment Reinvestment Zone No. 3 (Downtown TIF)","DateTime":"9/10/2024 3:00:00 PM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"a272314f-e865-44ac-ad4a-906d6f1ee3cc","MainContentId":"a272314f-e865-44ac-ad4a-906d6f1ee3cc","Name":"Community Development Council","DateTime":"9/10/2024 6:30:00 PM"}],"Day":"Wednesday","Month":"October"},{"Date":"10/10/2024","DD":10,"MM":10,"YYYY":2024,"Items":[{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"08ce922d-9def-4564-a3ba-72dd3cb62d4a","MainContentId":"08ce922d-9def-4564-a3ba-72dd3cb62d4a","Name":"Mayor\u0027s Committee on Persons with Disabilities (MCPD)","DateTime":"10/10/2024 10:00:00 AM"},{"CalendarId":"788ffb59-05d1-457d-b9dd-423d4b95a06e","Id":"13359613-3931-4d7c-8f2b-d0049cc37e32","MainContentId":"13359613-3931-4d7c-8f2b-d0049cc37e32","Name":"Fort Worth After School Coordinating Board Meeting","DateTime":"10/10/2024 11:30:00 AM"}],"Day":"Thursday","Month":"October"}]}
Copy link

@coderabbitai coderabbitai bot Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Address Data Completeness and Event Duplication Issues

  1. Missing Dates:

    • Issue: The dataset for October 1-10, 2024, lacks entries for October 5, 6, and 8.
    • Action: Verify if events are expected on these dates. If events should be present, investigate the data scraping process to ensure it captures all necessary dates.
  2. Repeated Events:

    • Issue: The "Notice of Public Comment for Brownsfields Program" is listed on October 1st through 4th with the same ID.
    • Action: Confirm whether this event is intended to span multiple days. If it is a multi-day event, consider adding a duration or recurrence field to the data model to prevent duplication.
🔗 Analysis chain

Verify data completeness and repeated events.

  1. Data Completeness: The JSON covers October 1-10, 2024, but some dates are missing (e.g., October 5, 6, 8). Verify if this is expected behavior or if the scraper should retrieve data for all dates in the range.

  2. Repeated Events: The "Notice of Public Comment for Brownsfields Program" event appears on multiple dates with the same ID. If this is a multi-day event, consider adding a field to indicate its duration or recurring nature.

To check for missing dates and repeated events, you can run the following script:

This script will help you visualize the date coverage and the occurrences of the repeated event.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify data completeness and identify repeated events

# Check for missing dates
echo "Checking for missing dates:"
jq '.data[].Date' tests/files/fortx_Fort_Worth_Boards.json | sort | uniq

# Identify repeated events
echo -e "\nChecking for repeated events:"
jq '.data[].Items[] | select(.Name == "Notice of Public Comment for Brownsfields Program") | {Name, Id, Date: .DateTime[:10]}' tests/files/fortx_Fort_Worth_Boards.json

Length of output: 1027

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is data retrieved from the API. We do not have control over the dates or IDs. Missing dates represent no meetings on those dates. Duplicate IDs represent the same meeting, just on different dates.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cruznunez, thank you for the clarification. I've noted this information for future reviews.

(_/)
(•_•)
/>💭


✏️ Learnings added
Learnt from: cruznunez
PR: City-Bureau/city-scrapers-fortx#9
File: tests/files/fortx_Fort_Worth_Boards.json:1-1
Timestamp: 2024-10-08T19:11:25.402Z
Learning: The data retrieved from the API may contain missing dates, which represent days with no meetings.

Learnt from: cruznunez
PR: City-Bureau/city-scrapers-fortx#9
File: tests/files/fortx_Fort_Worth_Boards.json:1-1
Timestamp: 2024-10-08T19:11:25.402Z
Learning: Duplicate IDs in the API data represent the same meeting occurring on different dates.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +33 to +34
headers = {"User-Agent": ""}
info = requests.get(info_url, headers=headers).json()["data"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use scrapy.Request instead of requests.get

In line 34, replace requests.get with scrapy.Request to leverage Scrapy's asynchronous processing and built-in features.

Refactor the code as follows:

- headers = {"User-Agent": ""}
- info = requests.get(info_url, headers=headers).json()["data"]

+ yield scrapy.Request(
+     url=info_url,
+     headers={"User-Agent": "city-scrapers-fortx/1.0 (+https://github.com/yourusername/city-scrapers-fortx)"},
+     callback=self.parse_meeting_detail,
+     meta={'item': item}
+ )

Then, implement the parse_meeting_detail method to handle the response:

def parse_meeting_detail(self, response):
    item = response.meta['item']
    info = response.json()["data"]

    meeting = Meeting(
        title=item["Name"],
        description=info["Description"],
        classification=COMMISSION,
        start=parse(item["DateTime"]),
        end=None,
        all_day=False,
        time_notes="",
        location=self._parse_location(info),
        links=self._parse_links(info),
        source=self._parse_source(response),
    )

    meeting["status"] = self._get_status(meeting)
    meeting["id"] = self._get_id(meeting)

    yield meeting

Comment on lines +22 to +34
def parse(self, response):
"""
Parse meetings from multiple API responses.
"""

data = response.json()

for calendar_day in data["data"]:
for item in calendar_day["Items"]:
# make another API requests to get more info about meeting
info_url = f"https://www.fortworthtexas.gov/ocapi/get/contentinfo?calendarId={item['CalendarId']}&contentId={item['Id']}&language=en-US&mainContentId={item['Id']}" # noqa
headers = {"User-Agent": ""}
info = requests.get(info_url, headers=headers).json()["data"]
Copy link

@coderabbitai coderabbitai bot Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance testability by refactoring network requests

To facilitate testing without relying on network availability, consider refactoring your code to use Scrapy's request mechanisms, which can be more easily mocked in tests. This aligns with Scrapy's design and improves the reliability of your tests.

I can help refactor the code to better support mocking of network requests in your tests. Would you like me to provide an example of how to stub out these requests using Scrapy's facilities?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cruznunez, certainly! I can help refactor your spider to utilize Scrapy's request mechanisms, which will make it easier to mock network requests during testing.

Currently, you're making external requests using the requests library within the parse method. In Scrapy, it's recommended to use scrapy.Request and let the framework handle the requests. This approach leverages Scrapy's asynchronous request handling and makes it straightforward to mock requests in your tests.

Here's how you can refactor your spider:

import scrapy
from city_scrapers_core.constants import COMMISSION
from city_scrapers_core.items import Meeting
from city_scrapers_core.spiders import CityScrapersSpider
from dateutil.parser import parse


class FortxFortWorthBoardsSpider(CityScrapersSpider):
    name = "fortx_Fort_Worth_Boards"
    agency = "Fort Worth Boards and Commissions"
    timezone = "America/Chicago"
    start_urls = [
        "https://www.fortworthtexas.gov/ocapi/calendars/getcalendaritems?Ids=788ffb59-05d1-457d-b9dd-423d4b95a06e&LanguageCode=en-US"
    ]

    def parse(self, response):
        data = response.json()

        for calendar_day in data["data"]:
            for item in calendar_day["Items"]:
                # Build the URL for the meeting details
                info_url = (
                    f"https://www.fortworthtexas.gov/ocapi/get/contentinfo?"
                    f"calendarId={item['CalendarId']}&"
                    f"contentId={item['Id']}&"
                    f"language=en-US&"
                    f"mainContentId={item['Id']}"
                )
                # Yield a new request for each meeting
                yield scrapy.Request(
                    url=info_url,
                    callback=self.parse_meeting,
                    cb_kwargs={'item': item}
                )

    def parse_meeting(self, response, item):
        info = response.json()["data"]

        meeting = Meeting(
            title=info.get("Name", ""),
            description=info.get("Description", ""),
            classification=COMMISSION,
            start=parse(info.get("DateTime")),
            end=None,
            all_day=False,
            time_notes="",
            location=self._parse_location(info),
            links=self._parse_links(info),
            source=self._parse_source(response),
        )

        meeting["status"] = self._get_status(meeting)
        meeting["id"] = self._get_id(meeting)

        yield meeting

    def _parse_location(self, info):
        return {
            "address": info.get("Location", "").strip(),
            "name": "",
        }

    def _parse_links(self, info):
        links = []
        for file in info.get("Files", []):
            links.append({
                "href": file.get("Url", ""),
                "title": file.get("Title", ""),
            })
        return links

    def _parse_source(self, response):
        return response.url

Explanation:

  • Use of scrapy.Request: Instead of using requests.get, we use scrapy.Request to schedule new requests for the meeting details.
  • Callback functions: We define a new method parse_meeting to handle the response from each meeting's detail page.
  • Passing data using cb_kwargs: We pass the original item data to the parse_meeting method using cb_kwargs.
  • Response handling: The response in parse_meeting is a Scrapy Response object, and we can call response.json() to get the JSON data.

Benefits:

  • Testability: By using Scrapy's request and response objects, you can easily mock them in your tests without relying on actual network calls.
  • Efficiency: Scrapy handles request scheduling and concurrency efficiently, which can improve the performance of your spider.

Updating Your Tests:

You can now mock the network requests in your tests using Scrapy's facilities. Here's an example of how you might adjust your tests:

import pytest
from unittest.mock import MagicMock
from scrapy import Request
from scrapy.http import TextResponse

from city_scrapers_core.constants import COMMISSION
from city_scrapers_core.items import Meeting
from city_scrapers_core.spiders import CityScrapersSpider
from city_scrapers_core.tests import SpiderTest
from city_scrapers.spiders.fortx_Fort_Worth_Boards import FortxFortWorthBoardsSpider


@pytest.fixture
def spider():
    return FortxFortWorthBoardsSpider()


def test_parse(spider):
    with open('tests/files/fortx_Fort_Worth_Boards.json', 'r') as f:
        json_data = f.read()

    response = TextResponse(
        url=spider.start_urls[0],
        body=json_data,
        encoding='utf-8',
    )

    # Mock the scrapy.Request to return a predefined response
    def mock_request(url, callback, cb_kwargs):
        with open('tests/files/meeting_detail.json', 'r') as f:
            meeting_data = f.read()
        meeting_response = TextResponse(
            url=url,
            body=meeting_data,
            encoding='utf-8',
        )
        return callback(meeting_response, **cb_kwargs)

    with patch.object(spider, 'parse_meeting') as mock_parse_meeting:
        with patch('scrapy.Request', side_effect=mock_request):
            results = list(spider.parse(response))
            assert len(results) > 0
            # Add assertions for your meetings

            # Example assertion
            meeting = results[0]
            assert isinstance(meeting, Meeting)
            assert meeting['title'] != ''

Notes:

  • Mocking scrapy.Request: We patch scrapy.Request to return a predefined response from a local file (meeting_detail.json), simulating the network response.
  • Mocking parse_meeting: Alternatively, you can patch parse_meeting if you want to control or inspect how it's called.
  • Assertions: You can now write assertions based on the mocked data to ensure your spider extracts the correct information.

By refactoring your spider and tests this way, you eliminate the dependency on network availability during testing, making your tests more reliable and faster.

Let me know if you have any questions or need further assistance!

@@ -0,0 +1,83 @@
import requests
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid using the requests library within Scrapy spiders

Using the requests library inside Scrapy spiders bypasses Scrapy's asynchronous request handling, middleware, and caching mechanisms, which can lead to blocking behavior and inefficient resource utilization.

Consider removing the import of requests:

- import requests
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import requests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant