IndexStream

IndexStream is a full-text web search engine designed for fast and efficient web scraping, indexing, and search functionality. The project is written in a combination of C++ and Python, leveraging the strengths of each language to handle different components of the search engine. The primary goal of IndexStream is to crawl web pages, store and index the content, and provide fast, relevant search results using a custom-built indexing and search algorithm.

Features

Web Scraping: Starts from Wikipedia and scrapes outward, respecting robots.txt.
Indexing: Builds a term-document matrix using TF-IDF for efficient document retrieval.
Multithreaded Search: Handles concurrent search queries with a custom thread pool implementation.
Web Crawler: Depth-controlled web crawler to prevent looping or excessive scraping of certain domains.
Full-Text Search: Implements TF-IDF for relevance ranking.
Persistent Storage: Indexed data is stored in an SQL database for fast retrieval.

Tech Stack

C++: Core engine for processing, parsing, indexing web content, and handling the web server.
Python: Web scraper, handling scraping logic and adhering to the robots.txt protocol.
SQLite: Lightweight database for storing and indexing web content.

Architecture

Web Scraper (Python):
- Starts at a base URL (e.g., Wikipedia).
- Scrapes web pages while respecting robots.txt.
- Stores raw web content for further indexing.
Indexer (C++):
- Processes the raw web data.
- Extracts URLs, parses documents, and updates the term-document frequency matrix.
- Calculates TF-IDF scores for efficient document retrieval.
- Stores processed data in an SQLite database.
Web Server (C++):
- Exposes a search interface to users.
- Handles HTTP requests and passes search queries to the indexer.
- Retrieves relevant search results from the indexed data and displays them.
Thread Pool:
- Manages multiple tasks such as database updates, query handling, and web scraping.
- Allows efficient concurrent processing without overloading the system.

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
crawler		crawler
db		db
public		public
raw_dump		raw_dump
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IndexStream

Table of Contents

Features

Tech Stack

Architecture

Contributing

License

About

Releases

Packages

Languages

mush1e/IndexStream

Folders and files

Latest commit

History

Repository files navigation

IndexStream

Table of Contents

Features

Tech Stack

Architecture

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages