Web Content Crawler

A simple yet powerful web crawler that extracts content from websites and saves it in markdown format. Built using crawl4ai and playwright-stealth for reliable web scraping.

Features

Asynchronous web crawling for efficient performance
Automatic content extraction with smart detection
Markdown output format
Interactive filename selection
Error handling and verbose logging
Stealth mode to avoid detection

Installation

Clone the repository:

git clone <repository-url>
cd <repository-name>

Install the required dependencies:

pip install -r requirements.txt

Usage

Run the crawler from the command line:

python crawl.py <url>

Example:

python crawl.py https://example.com

The script will:

Prompt for an output filename (with a default suggestion)
Crawl the specified URL
Extract the content
Save it as a markdown file

Output

The crawler saves the content in markdown format. Output files are named using one of these conventions:

User-specified filename
Auto-generated filename based on the URL (e.g., crawled_example_com.md)

Requirements

Python 3.7+
crawl4ai
playwright-stealth
setuptools

Error Handling

The script handles various error cases:

No content found
Invalid URLs
Network errors
File system errors

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
crawl.py		crawl.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Content Crawler

Features

Installation

Usage

Output

Requirements

Error Handling

License

About

Releases

Packages

Languages

License

buiiz/crawl

Folders and files

Latest commit

History

Repository files navigation

Web Content Crawler

Features

Installation

Usage

Output

Requirements

Error Handling

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages