General Purpose Web Scraper

This is a flexible, general-purpose web scraper built with Python and Selenium. It can be used to crawl and extract information from any website, with options to restrict crawling to specific domains.

Features

Customizable starting URL for any website
Adjustable crawl depth
Option to restrict crawling to a specific domain or crawl unrestricted
Configurable output file
Chrome driver path specification
Robust error handling and retry mechanism

Requirements

Python 3.6+
Selenium
Chrome WebDriver

Installation

Clone this repository:

git clone https://github.com/yourusername/general-web-scraper.git
cd general-web-scraper

Install the required packages:
```
pip install -r requirements.txt
```
Download the appropriate Chrome WebDriver for your system and Chrome version.

Usage

Run the main script with various options:

Basic usage (will restrict to the starting URL domain):

python main.py https://example.com --chrome-driver "/path/to/chromedriver"

Unrestricted crawling:

python main.py https://example.com --chrome-driver "/path/to/chromedriver" --unrestricted

Set custom max depth and output file:

python main.py https://example.com --chrome-driver "/path/to/chromedriver" --max-depth 5 --output my_data.json

Combine options:

python main.py https://example.com --chrome-driver "/path/to/chromedriver" --unrestricted --max-depth 4 --output unrestricted_data.json

Command-line Arguments

start_url: The URL to start scraping from (required)
--chrome-driver: Path to the Chrome driver executable (required)
--max-depth: Maximum depth to crawl (default: 3)
--unrestricted: Disable URL prefix restriction
--output: Output file name (default: scraped_data.json)

Output

The scraper saves the extracted data in JSON format. Each entry in the JSON file contains:

URL
Page title
Page content
Links found on the page

How It Works

This web scraper operates through the following process:

Initialization:
- The scraper is initialized with a starting URL, Chrome driver path, and other optional parameters like max depth and domain restriction.
- It sets up a headless Chrome browser using Selenium WebDriver.
Crawling:
- The scraper starts with the initial URL and adds it to a queue.
- For each URL in the queue: a. It checks if the URL has been visited or if the max depth has been reached. b. If not, it proceeds to extract information from the page.
Information Extraction:
- The scraper navigates to the URL using the Chrome driver.
- It waits for the page to load (specifically for the tag to be present).
- It extracts the following information:
  - Page title
  - Page content (full text of the body)
  - All links on the page
Link Processing:
- For each extracted link, the scraper: a. Converts relative URLs to absolute URLs. b. Checks if the URL is valid and within the allowed domain (if restriction is enabled). c. Adds new, unvisited URLs to the queue for future processing.
Data Storage:
- The extracted information is stored in a dictionary, with URLs as keys.
Error Handling and Retries:
- If a page fails to load or an error occurs during extraction, the scraper will retry a specified number of times with increasing delays.
Completion:
- Once all URLs in the queue have been processed or the max depth has been reached, the scraper saves all extracted data to a JSON file.

The scraper uses a breadth-first search approach for crawling, which means it completely processes one depth level before moving to the next. This approach, combined with the max depth parameter, allows for controlled and systematic exploration of a website or network of websites.

Future Enhancements

Multi-threading support for faster crawling
Additional browser support (Firefox, Safari, Edge)
Custom CSS selector support for targeted content extraction
Integration with popular databases for data storage
Advanced filtering options for crawled content
Respect for robots.txt and implementation of crawl delays
Support for handling JavaScript-rendered content
Proxy rotation for IP management during large-scale crawling
Exportation of data in multiple formats (CSV, XML, etc.)
GUI for easier configuration and real-time crawl monitoring

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Disclaimer

This web scraper is for educational and research purposes only. Always respect the website's robots.txt file and terms of service when scraping. Use responsibly and ethically.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
ReadMe.md		ReadMe.md
main.py		main.py
my_data.json		my_data.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General Purpose Web Scraper

Features

Requirements

Installation

Usage

Command-line Arguments

Output

How It Works

Future Enhancements

Contributing

License

Disclaimer

About

Releases

Packages

Languages

License

rahulmysore23/No-Game-No-Scrape

Folders and files

Latest commit

History

Repository files navigation

General Purpose Web Scraper

Features

Requirements

Installation

Usage

Command-line Arguments

Output

How It Works

Future Enhancements

Contributing

License

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages