A simple yet powerful web crawler that extracts content from websites and saves it in markdown format. Built using crawl4ai
and playwright-stealth
for reliable web scraping.
- Asynchronous web crawling for efficient performance
- Automatic content extraction with smart detection
- Markdown output format
- Interactive filename selection
- Error handling and verbose logging
- Stealth mode to avoid detection
- Clone the repository:
git clone <repository-url>
cd <repository-name>
- Install the required dependencies:
pip install -r requirements.txt
Run the crawler from the command line:
python crawl.py <url>
Example:
python crawl.py https://example.com
The script will:
- Prompt for an output filename (with a default suggestion)
- Crawl the specified URL
- Extract the content
- Save it as a markdown file
The crawler saves the content in markdown format. Output files are named using one of these conventions:
- User-specified filename
- Auto-generated filename based on the URL (e.g.,
crawled_example_com.md
)
- Python 3.7+
- crawl4ai
- playwright-stealth
- setuptools
The script handles various error cases:
- No content found
- Invalid URLs
- Network errors
- File system errors