Tarantula: An Event-Driven Multithreaded Web Crawler

Welcome to Tarantula, an event-driven, multithreaded web crawler built in Rust. Tarantula is designed to efficiently crawl websites, respecting robots.txt, rate-limiting rules, and delivering results to callback endpoints for further processing. It's a project that has been crafted over several months and is the culmination of my journey into Rust programming.

Key Features

Event-Driven Crawling: Tarantula can be provided with tasks, and it will initiate crawling a website to a specified depth, utilizing an event-driven architecture for efficiency and responsiveness.
Respects Robots.txt: Tarantula abides by robots.txt rules, ensuring that it respects website-specific crawling restrictions.
Multithreaded Performance: Leveraging the power of multithreading, Tarantula optimizes its crawling process, making it faster and more parallelized.

A Learning Journey

Tarantula is more than just a web crawler; it's a testament to my journey in Rust programming. This project reflects my progress, lessons learned, and the evolution of my coding skills. The structure you see here is a result of this learning process, and I've since applied this knowledge to newer projects.

Getting Started

To get started with Tarantula, check out the code and compile it using rust nightly. cargo run should start the process and open up a server on port 8088. Feel free to explore the code and adapt it to your specific needs.

The best thing about tarantulas is that they don't even spin webs to catch their food :)

Requires openssl libssl-dev or rust-tokio-native-tls+default-devel.noarch Requires rust nightly

Trigger

Once the server is running, e.g., on http://127.0.0.1:8088, to fire up a new crawl task, send the RunConfig structure via PUT to the /crawl endpoint: http://127.0.0.1:8088/crawl

RunConfig structure: { "url": "https://example.com", "ignore_redirects": false, "maximum_redirects": 10, "maximum_depth": 16, "ignore_robots_txt": false, "keep_html_in_memory": false, "user_agent": "testbanane", "callback": "https://yourhost/crawl-results" }

The callback inside the RunConfig will be called with POST and the structure of PageResponse (page_loader::PageResponse) After a few seconds, the results should appear on the console and at the endpoint (hopefully)

Contributing

This project is not actively maintained or developed further. I have decided to use tarantula and the learning experience to build a new project completely from scratch (closed source). I appreciate all kinds of feedback and collaboration.

Thank you for visiting Tarantula, and I hope you find it valuable for your web crawling needs. Enjoy exploring the world of web data with this Rust-powered crawler. Please make sure, while using tarantula, to not stress the servers you are crawling. Be respectful and crawl responsibly.

Name		Name	Last commit message	Last commit date
Latest commit History 228 Commits
.cargo		.cargo
.github/workflows		.github/workflows
config		config
dom_parser		dom_parser
linkresult		linkresult
page_loader		page_loader
responses		responses
server		server
src		src
testutils		testutils
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
Rocket.toml		Rocket.toml
deploy.sh		deploy.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tarantula: An Event-Driven Multithreaded Web Crawler

Key Features

A Learning Journey

Getting Started

The best thing about tarantulas is that they don't even spin webs to catch their food :)

Trigger

Contributing

About

Contributors 3

Languages

DuncanConroy/tarantula

Folders and files

Latest commit

History

Repository files navigation

Tarantula: An Event-Driven Multithreaded Web Crawler

Key Features

A Learning Journey

Getting Started

The best thing about tarantulas is that they don't even spin webs to catch their food :)

Trigger

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3

Languages