GitHub - okoye/snippet-extractor: A *rudimentary* snippet extractor to retrieve most relevant snippet from a supplied document.

okoye / snippet-extractor Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

A *rudimentary* snippet extractor to retrieve most relevant snippet from a supplied document.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
testdata		testdata
DocumentParser.java		DocumentParser.java
DocumentParserTest.java		DocumentParserTest.java
HighlightDocument.java		HighlightDocument.java
README		README
RelevanceEngine.java		RelevanceEngine.java
ScoreComparator.java		ScoreComparator.java
ScoreComputer.java		ScoreComputer.java
Snippet.java		Snippet.java
StartIndexComparator.java		StartIndexComparator.java
TestMain.java		TestMain.java
TrieNode.java		TrieNode.java
TrieStructure.java		TrieStructure.java
TrieStructureTest.java		TrieStructureTest.java
Word.java		Word.java

Repository files navigation

Snippet Extractor

########Running Snippet Extractor#########
1. Open 'TestMain.java' and ensure that the proper path to a test document
   is specified in the file variable. By default it uses the content specified
   in 'testdata/test1.txt'

2. When the proper document has been specified, set the search keyword and
   run the program.

3. It should print a list of all snippets extracted from the document and 
   a based on these results, the auto-generated most relevant snippet.

#######Brief Description of Program########
When supplied a document, the program parses the document only one time to 
extract all words in the document and store in a Trie Tree including the
index of each term. On completion, all search keywords are run against this
tree to determine all indexes each keyword occurs in the document. This
operation takes at most O(dm) time where d is the size of the alphabets
accepted by the tree (62) and m size of the word. Most tree queries take O(m)
time. Upon extracting all relevant indexes, it extracts the snippet surrounding
each of those terms and uses the 'RelevanceEngine' class to compute the
relevance of each snippet. The relevanceengine class also deals with merging
of multiple similar snippets, deletion of redundant snippets and creation of
newer snippets. Each snippet is approximately 15 words while the most relevant
snippet is on average 140-200 characters.

For more information on the methods of each class, a javadoc representation of
each class can be generated or you can contact me :).