GitHub - ritikavnair/Search-Engine-Implementation: CS6200 Information Retrieval Project

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Evaluation		Evaluation
OUTPUTS		OUTPUTS
TokenizedFile		TokenizedFile
cacm		cacm
GThakkar_RNair_SDas.pdf		GThakkar_RNair_SDas.pdf
Indexer.py		Indexer.py
Lucene.java		Lucene.java
Parser.py		Parser.py
ProximityIndexer.py		ProximityIndexer.py
ProximityParser.py		ProximityParser.py
ProximityRetriever.py		ProximityRetriever.py
PseudoRelevance.py		PseudoRelevance.py
README.txt		README.txt
RetrievalWithSnippets.py		RetrievalWithSnippets.py
Retriever.py		Retriever.py
SnippetGenerator.py		SnippetGenerator.py
StemmedIndexer.py		StemmedIndexer.py
StemmedParser.py		StemmedParser.py
StemmedRetriever.py		StemmedRetriever.py
StoppedRetriever.py		StoppedRetriever.py
cacm.query.txt		cacm.query.txt
cacm.rel.txt		cacm.rel.txt
cacm_stem.query.txt		cacm_stem.query.txt
cacm_stem.txt		cacm_stem.txt
common_words		common_words
queries.txt		queries.txt

Repository files navigation

PROJECT
=========
Goal: Design and Implement a Search Engine

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

SYNOPSIS:
--------

We have created an information retrieval system that has the following retrieval models:
1) BM25
2) QLM
3) Lucene
4) tf-idf

These retrieval models have been modified and made efficient by using stopping and stemming, and pseudo relevance feedback. 
The performance is judged by comparing the rankings generated by these models with the relevance judgements provided by CACM, in order to display the results, snippet generation has also been incorporated.  

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

GENERAL USAGE NOTES:
---------------------

-- This file contains instructions about installing softwares and running the programs in Windows Environment.
-- The instructions in the file may not match the installation procedures in other operating systems like Mac OS, Ubuntu OS etc.
-- However, the programs are independent of any operating systems and will run successfully in all platforms once the initial installation has been done. 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

SETUP: 
------

This code requires the following software packages installed for it to run successfully:
1. Python 3.6.3
	Download and install from "https://www.python.org/downloads/"
2. Lucene 4.7.2
	Download and install Lucene from
	https://lucene.apache.org/
	https://archive.apache.org/dist/lucene/java/4.7.2/
2. BeautifulSoup package
	Can be downloaded from "https://www.crummy.com/software/BeautifulSoup/"
	Can be installed using pip, by entering the following command 
	in Terminal or Command Line :

		 pip install beautifulsoup4
		 
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

COMPILE AND RUN: 
----------------
Unzip the given solution folder into a local directory. All necessary files required to run 
this project will be extracted.

PHASE 1:
	TASK 1:
	-- For Baseline runs:
		A. Open Windows PowerShell
		B. Navigate to the directory where the solution was unzipped. 
		C. Perform the following steps in order:
	    D. Run Retriver.py using the command 'python Retriver.py'. 
		   Internally this code invokes Parser.py and Indexer.py to parse the corpus and generate the inverted index.
		E. The program will prompt for user choice of retrieval model.	
		
			=> BM25: If you select BM25, the output will be generated in BM25RelevanceRun.txt in the RunOutputs folder.
			=> TFIDF: If you select TFIDF, the output will be generated in TFIDFRun.txt in the RunOutputs folder.
			=> QLM: If you select QLM, the output will be generated in QLRun.txt in the RunOutputs folder.	
			
		F. The output of the chosen retrieval will be generated in a folder called 'RunOutputs'. This folder is auto-generated by the program.
		
		
	
	-- For Lucene
		A. Make a new project in Java and use the Lucene.java file provided. 
		   Create this project in the same directory and set the path in inputLocation to the path containing Lucene.java.		   
		B. Add the three following jars into your project's list of referenced libraries:
			1. lucene-core-VERSION.jar
			2. lucene-queryparser-VERSION.jar
			3. lucene-analyzers-common-VERSION.jar
		C. TokenizedFile and queries.txt should be in the same directory, and the path should be set for them in the program.
		   It will take the documents from TokenizedFile and index them and rank them based on the queries, 
		   the output generated will be in the prog folder with the name LuceneOutput.txt
		D. Run the java program.
		
	TASK 2:
	A. Open Windows PowerShell
	B. Navigate to the directory where the solution was unzipped. 
	C. Run PseudoRelevance.py using the command 'python PseudoRelevance.py'. 
	   Internally this code invokes Parser.py and Indexer.py to parse the corpus and generate the inverted index.
	   Pseudo-relevance has been performed on the BM25 model.
	D. The output will be generated in a folder called 'RunOutputs'. This folder is auto-generated by the program.
	
	TASK 3:
	A. Open Windows PowerShell
	B. Navigate to the directory where the solution was unzipped. 
	Stopping:
	C. Run StoppedRetriver.py using the command 'python StoppedRetriver.py'. 
	   Internally this code invokes Parser.py and Indexer.py to parse the corpus and generate the inverted index.
	D. The program will prompt for user choice of retrieval model.
	
		=> BM25: If you select BM25, the output will be generated in StoppedBM25RelevanceRun.txt in the RunOutputs folder.
		=> TFIDF: If you select TFIDF, the output will be generated in StoppedTFIDFRun.txt in the RunOutputs folder.
		=> QLM: If you select QLM, the output will be generated in StoppedQLRun.txt in the RunOutputs folder.

	E. The output of the chosen retrieval will be generated in a folder called 'RunOutputs'. This folder is auto-generated by the program.
	Stemming:
	C. Run StemmedRetriver.py using the command 'python StemmedRetriver.py'. 
	   Internally this code invokes StemmedParser.py and StemmedIndexer.py to parse the corpus and generate the inverted index.
	D. The program will prompt for user choice of retrieval model.
	
		=> BM25: If you select BM25, the output will be generated in StemmedBM25RelevanceRun.txt in the RunOutputs folder.
		=> TFIDF: If you select TFIDF, the output will be generated in StemmedTFIDFRun.txt in the RunOutputs folder.
		=> QLM: If you select QLM, the output will be generated in StemmedQLRun.txt in the RunOutputs folder.
	E. The output of the chosen retrieval will be generated in a folder called 'RunOutputs'. 
	
PHASE 2:
	A. Open Windows PowerShell
	B. Navigate to the directory where the solution was unzipped. 
	C. Run RetrievalWithSnippets.py using the command 'python PseudoRelevance.py'. 
	   Internally this code invokes Parser.py and Indexer.py to parse the corpus and generate the inverted index.
	   It also invokes SnippetGenerator.py to produce snippets and perform query term highlighting.
	   
	D. The output will be generated on the console, with the query terms highlighted in snippets.
	
PHASE 3:
	A. Open Windows PowerShell
	B. Navigate to the directory where the solution was unzipped.
	C. Navigate to the 'Evaluation' folder.
	C. Copy and paste the name of the output file for which evaluation has to be performed in Runs1.txt
	   if Runs1.txt has LuceneOutput.txt, it will generate LuceneOutputresult.txt and LuceneOutputMAPMRRresult.txt
	C. Outputs : Precision, recall, p@5 and p@20 will be in the file LuceneOutputresult.txt ,
	   and MAP, MRR will be generated in the file LuceneOutputMAPMRRresult.txt. 
	D. In a similar way, the other 7 output files names can be pasted into Runs1.txt one at the time 
	   and the output files will be generated in the format "outputfilename"result.txt for
       precision, recall, p@5 and p@25 and "outputfilename"MAPMRRresult.txt for MAP and MRR for that system. 
	   
EXTRA CREDIT:
	
	A. Open Windows PowerShell
	B. Navigate to the directory where the solution was unzipped. 
	Without Stopping:
	C. Run ProximityRetriver.py using the command 'python ProximityRetriver.py'. 
	   Internally this code invokes ProximityParser.py and ProximityIndexer.py to parse the corpus and generate the inverted index.
	D. The program will by default run for BM25 model of retrieval.
	E. The output will be generated in a folder called 'RunOutputs'. This folder is auto-generated by the program.
	With Stopping:
	F. Open ProximityRetriver.py file. Scroll to the end to the main() function
	G. Update the following command as follows, to send 'True' to the unigram_index() function.
			ProximityIndexer.unigram_index(True)
	C. Run ProximityRetriver.py using the command 'python ProximityRetriver.py'. 
	   Internally this code invokes ProximityParser.py and ProximityIndexer.py to parse the corpus and generate the inverted index.
	D. The program will by default run for BM25 model of retrieval.
	E. The output of the chosen retrieval will be generated in a folder called 'RunOutputs'. 
	
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

CONTRIBUTORS and CITATIONS:
---------------------------

-- https://www.udacity.com/course/intro-to-computer-science--cs101 : Basics of Python Programming and web crawling
-- https://www.crummy.com/software/BeautifulSoup/ : BeautifulSoup has been used for extracting links from web pages
-- https://learnpythonthehardway.org/book/ : Python Programming

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

CONTACT DETAILS:
----------------

SAPTAPARNA DAS:
Phone: (+1) 8572729089
E-Mail: [email protected]

RITIKA NAIR:
Phone: (+1) 9198841551
E-Mail: [email protected]

GRISHMA THAKKAR:
Phone: (+1) 6176376901
E-Mail: [email protected]