Skip to content

buiiz/crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Web Content Crawler

A simple yet powerful web crawler that extracts content from websites and saves it in markdown format. Built using crawl4ai and playwright-stealth for reliable web scraping.

Features

  • Asynchronous web crawling for efficient performance
  • Automatic content extraction with smart detection
  • Markdown output format
  • Interactive filename selection
  • Error handling and verbose logging
  • Stealth mode to avoid detection

Installation

  1. Clone the repository:
git clone <repository-url>
cd <repository-name>
  1. Install the required dependencies:
pip install -r requirements.txt

Usage

Run the crawler from the command line:

python crawl.py <url>

Example:

python crawl.py https://example.com

The script will:

  1. Prompt for an output filename (with a default suggestion)
  2. Crawl the specified URL
  3. Extract the content
  4. Save it as a markdown file

Output

The crawler saves the content in markdown format. Output files are named using one of these conventions:

  • User-specified filename
  • Auto-generated filename based on the URL (e.g., crawled_example_com.md)

Requirements

  • Python 3.7+
  • crawl4ai
  • playwright-stealth
  • setuptools

Error Handling

The script handles various error cases:

  • No content found
  • Invalid URLs
  • Network errors
  • File system errors

License

MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages