Restful, in-memory, full-text search engine written in Go.
- Full-text indexing of multiple fields in a document
- Exact phrase search
- Document ranking based on BM25
- Vector similarity search for semantic search
- Stemming-based query expansion for many languages
- Document deletion and updating with index garbage collection
To download and run minisearch from a precompiled binary:
- Download a precompiled version of minisearch from GitHub.
- Run the server binary:
$ ./server
To run minisearch with Docker, use the minisearch Docker image:
$ docker run -d --name minisearch -p 3000:3000 micpst/minisearch:latest
To build and run minisearch from the source code:
- Requirements: go & make
- Install dependencies:
$ make setup
- Build:
$ make build
- Run the server binary:
$ ./bin/server
Create a new document and add it to the index.
$ curl -X POST localhost:3000/api/v1/documents \
-H 'Content-Type: application/json' \
-d '{
"title": "The Silicon Brain",
"url": "https://micpst.com/posts/silicon-brain",
"abstract": "The human brain is often described as complex..."
}'
Fill the index with a large number of documents at once by uploading a document dumps.
$ curl -X POST localhost:3000/api/v1/upload \
-H 'Content-Type: multipart/form-data' \
-F 'file[]=@/path/to/dataset1.xml.gz' \
-F 'file[]=@/path/to/dataset2.xml.gz'
The dump should have the following structure:
<docs>
<doc>
<title>...</title>
<url>...</url>
<abstract>...</abstract>
</doc>
<doc>
<title>...</title>
<url>...</url>
<abstract>...</abstract>
</doc>
</docs>
Update the existing document and re-index it with the new fields.
$ curl -X PUT localhost:3000/api/v1/documents/<id> \
-H 'Content-Type: application/json' \
-d '{
"title": "The Silicon Brain",
"url": "https://micpst.com/posts/silicon-brain",
"abstract": "The human brain is often described as complex..."
}'
Permanently delete the document and remove it from the index.
$ curl -X DELETE localhost:3000/api/v1/documents/<id>
The properties
parameter defines in which property to run our query.
$ curl -X POST localhost:3000/api/v1/search \
-H 'Content-Type: application/json' \
-d '{
"query": "Brain",
"properties": ["title"]
}'
We are now searching for all the documents that contain the word Brain
in the title
property.
We can also search through nested properties:
$ curl -X POST localhost:3000/api/v1/search \
-H 'Content-Type: application/json' \
-d '{
"query": "Mic",
"properties": ["author.name"],
}'
By default, MiniSearch searches in all searchable properties.
The exact
property finds all the document with an exact match of the query
property.
$ curl -X POST localhost:3000/api/v1/search \
-H 'Content-Type: application/json' \
-d '{
"query": "Brain",
"properties": ["title"],
"exact": true
}'
We are now searching for all the documents that contain exactly
the word Brain
in the title
property.
Without the
exact
property, for example, the termBrain-busting
would be returned as well, as it contains the wordBrain
.
The tolerance
property allows specifying the maximum distance (following the Levenshtein algorithm) between the query and the searchable property.
$ curl -X POST localhost:3000/api/v1/search \
-H 'Content-Type: application/json' \
-d '{
"query": "Brin",
"properties": ["title"],
"tolerance": 1
}'
We are searching for all the documents that contain a term with an edit distance of 1
(e.g. Brain
) in the title
property.
tolerance
doesn't work together with theexact
parameter.exact
will have priority.
The offset
and limit
properties allow paginating the results.
$ curl -X POST localhost:3000/api/v1/search \
-H 'Content-Type: application/json' \
-d '{
"query": "Brain",
"properties": ["title"],
"offset": 10,
"limit": 5
}'
By default, MiniSearch limits the search results to 10, without any offset.
MiniSearch uses the BM25 algorithm to calculate the relevance of a document when searching.
You can edit the BM25 parameters by using the relevance
property in the search
configuration object.
$ curl -X POST localhost:3000/api/v1/search \
-H 'Content-Type: application/json' \
-d '{
"query": "Brain",
"properties": ["title"],
"relevance": {
// Term frequency saturation parameter.
// Default value: 1.2
// Recommended value: between 1.2 and 2
"k": 1.2,
// Length normalization parameter.
// Default value: 0.75
// Recommended value: > 0.75
"b": 0.75,
// Frequency normalization lower bound.
// Default value: 0.5
// Recommended value: between 0.5 and 1
"d": 0.5
}
}'
All my code is MIT licensed. Libraries follow their respective licenses.