Semantic Search

by Brayton Hall

Table of Contents:

Data
EDA
Doc2Vec
Front End on Heroku

Motivation

My aim was to build a specialized semantic search engine, which can return the most similar paragraphs to an input string. In other words, a 'search-between-lines' app, to search by connotations and misremembered quotes, rather than exact fragments. This was inspired by the frustrations of simple search engines on e-readers and websites, such as a Kindle.

Data from Project Gutenberg

The data includes 100 of the top 'free ebooks' from Project Gutenberg, scraped with BeautifulSoup. It contains approximately 12 million words, cleaned and tokenized, and approximately 57,000 paragraphs, each 12,000 characters long. Engineered features in Pandas include 'lexicon' or unique word count, and 'lexicon ratio', a marker of literary uniqueness for identifying more difficult prose or 'wordiness'.

EDA

The following shows general info about the novels collected.

The following graphic shows the sweet spot in the data of books with high (>5%) lexicon ratios which are also longer than 100,000 words, identifying a 'literary' subsection of prototypical difficult or wordy English novels, such as Ulysses and Moby Dick.

Doc2Vec Model

Tagging

Just like Word2Vec, Doc2Vec creates vector representations of words. Doc2Vec also creates vector representations of documents, which are paragraphs here (I arbitrarily decided 1200 characters == 1 paragraph). In order for Doc2Vec to vectorize tokenized docs, each doc (paragraph) must be 'Tagged'. For example, the first paragraph of Anna Karenina:

TaggedDocument(words=['happy', 'families', 'are', 'all', 'alike', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'own', 'way'...'after', 'the', 'quarrel', 'prince'], tags=['47831'])

This is paragraph 47,831 in my entire corpus of approx. 57, 000 paragraphs.

Vectorizing

Doc2Vec Parameters for Modelling:

Vector size of 300 (each paragraph is represented as a vector with 300 dimension)
Learning rate of .01
200 Epochs
Window size of 20 (for tokens, since word order in paragraphs matters)

The model was pickled (150 mb size, 20 min to build) and stored on an AWS bucket on S3.

Implementation

The following is an example of how the search function can be used locally in a jupyter notebook, though a front end for the app can be found on Heroku.

Front End on Heroku

Deployed using a Docker container.

Can be accessed at https://limitless-plains-12586.herokuapp.com/

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.ipynb_checkpoints		.ipynb_checkpoints
bin		bin
config		config
.dockerignore		.dockerignore
Dockerfile		Dockerfile
README.md		README.md
Semantic_Search.ipynb		Semantic_Search.ipynb
__init__.py		__init__.py
alice_ex.png		alice_ex.png
curr_imp.png		curr_imp.png
eda.png		eda.png
finalsem.pdf		finalsem.pdf
frontend.png		frontend.png
lit_eda.png		lit_eda.png
novel_stripper.ipynb		novel_stripper.ipynb
pickled_novel_dict_0_50		pickled_novel_dict_0_50
pickled_novel_dict_50_100		pickled_novel_dict_50_100
purpose.png		purpose.png
requirements.in		requirements.in
requirements.txt		requirements.txt
sem_functions.py		sem_functions.py
sem_search.py		sem_search.py
word.png		word.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Search

Motivation

Data from Project Gutenberg

EDA

Doc2Vec Model

Tagging

Vectorizing

Implementation

Front End on Heroku

About

Releases

Packages

Languages

0dB/semantic_search

Folders and files

Latest commit

History

Repository files navigation

Semantic Search

Motivation

Data from Project Gutenberg

EDA

Doc2Vec Model

Tagging

Vectorizing

Implementation

Front End on Heroku

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages