SEE (or see, whatever) is a simple search engine written in Erlang.
It provides web crawler, search engine and web frontend. It's split
up into two applications: see_db
, see_crawler
.
see_db
application handles indexing and web interface.
It's designed to allow switching storage backend and ranking algorithms.
To start the application, run start_db_node
script.
Application parameters:
ip
(eg.{0,0,0,0}
) -- web server ip addressport
(eg.8888
) -- web server portdomain_filter
(eg."^localhost"
) -- regexp filter for URLs (useful for narrowing searching for only specific domain)storage
-- storage backend (see below for available storage backends)rank
-- ranking algorithm (see below for available ranking algorithms)
Storage backend is responsible for storing web pages with additional data structures that facilitates indexing. It abstracts away from engine logic, ie. computing final results, and is more or less a key/value storage. Selecting storage backend is done by setting storage
option in the app file. Currently only ETS and Mnesia backends are implemented.
ETS storage is easy to set up, but it lacks persistance and distribution.
Only one db
node is allowed, so the entire data must fit into RAM of a single machine.
To select ETS storage use see_db_storage_ets
value as storage
app option.
Mnesia storage can be used to gain persistance and distribution. There can
be as many db
nodes as needed, though it was only tested using a single node.
All tables are disc_copy
, so it still must fit into RAM as table fragmentation
is not yet implemented.
To select Mnesia storage use see_db_storage_mnesia
value as storage
app option.
Then you need to create schema and tables. To do it for a single node, run a script
create_mnesia_schema
.
Ranking is the most important part of a search engine, as queries may return thousands or millions of results, which is too many for a human to be useful of any kind. Users want only a dozen of the most relevant results, and this is the job of a ranking algorithm.
Selecting ranking algorithm is done by setting rank
option in the app file. Currently only tf-idf is implemented.
tf-idf is a simple ranking algorithm that takes into account only word occurences in a page vs in the whole index. To select this algorithm, use see_rank_tfidf
value as rank
option in the app file.
This application is responsible for crawling the web. There may be many nodes running this application.
To start the application, run start_crawler_node
.
Application parameters:
crawler_num
(eg.1
) -- number of crawler workersdb_node
(eg.'db@localhost'
) --see_db
node name
By default the web interface is available at http://localhost:8888
on db
node. You need to add first
URL to begin crawling with. To find a page, type your query in the search text box and click "Search" or press Enter. Only 100 most relevant results are shown.
Each crawler requests an unvisited URL from db
node and visits it, extracting words (as they are) and links from the page,
and sends them back to db
node. Words after normalization are saved into the index and links are inserted as
unvisited URLs.
Check out this site to see a running version: http://vps238545.ovh.net:8888/
It has indexed whole Erlang documentation.
- HTTPS support
- different encoding support (eg. iso-8859, cp-1250)
- td-idf ranking
- Mnesia storage backend
- Amazon S3 storage backend
- PageRank
- stemming
- complex queries (phrases, logic operators,
inurl:
,intitle:
,site:
) - periodically updating already visited pages