Search-Engine-Covid-Papers

Creating a search engine for covid-19 papers.

Due to the file size restriction of LumiNUS, only a subset of data is put in this folder for a demonstration purpose.

Dependencies:

Spark version: 3.0.1
Scala version: 2.12
Pyspark version 3.0.1
BERT set-up instruction: https://bert-as-service.readthedocs.io/en/latest/
sparkml-som set-up instruction: https://github.com/FlorentF9/sparkml-som
The specific execution instruction is written in the main function of each script.

1. Data Pre-processing

The script will pre-process the input json file into a clean text format and vectorised formate after dimension reduction. It will be passed to clustering algorithm in the next step.

2. Clustering

The various clustering algorithm uses high dimentional vector input and output a dataframe, containing the document id and its corresponding cluster number.

3. Topic Modelling

Topic modeling takes in the clustered documents and analyse the keyword in each cluster. Top M important keywords are taken to be the tagging for each topic. Relevant documents for each topic can also be retrieved.

4. Searching

Searching makes use of the keyword tagging and search for the most similar cluster to the input search query.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.idea		.idea
clustering		clustering
data/downloaded_file/pdf_json		data/downloaded_file/pdf_json
preprocessing		preprocessing
search		search
supplementary		supplementary
topic_modelling		topic_modelling
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search-Engine-Covid-Papers

Dependencies:

1. Data Pre-processing

2. Clustering

3. Topic Modelling

4. Searching

About

Releases

Packages

Contributors 4

Languages

jingru-lin/Search-Engine-Covid-Papers

Folders and files

Latest commit

History

Repository files navigation

Search-Engine-Covid-Papers

Dependencies:

1. Data Pre-processing

2. Clustering

3. Topic Modelling

4. Searching

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages