Skip to content
This repository has been archived by the owner on Sep 5, 2023. It is now read-only.

UCLA-BD2K/bd2kcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#BD2K Crawler

Description

Spring MVC Web project that provides crawling services on all BD2K websites as of June 2016 and center publications found on PubMed. There are two separate crawlers for each service and they can be ran in parallel (but multiple instances of the same crawler cannot be running). Utilizes Crawler4j for crawling web pages.

##Building

This is a standard Maven project, so the easiest way to get a local version of the web service running is through Maven. For this walkthrough, I will assume that Apache tomcat is the Application server/container used, and that one wants to deploy to a local running instance.

###Dependencies

If you have not already, install Maven and Apache tomcat. Also, this project utilizes MongoDB as its datastore, so be sure to install and start a Mongo server before testing and running the web service. If you are on OSX, you can use something like Homebrew to manage these packages.

To verify that Maven is correctly installed, run:

mvn -v

And you should see the Maven version number as well as other metadata, such as the home directory for Maven and the Java version found.

Now that Maven is installed, head over to the root directory of the project and run

mvn compile

in order to compile the project source code. The default location of the .class files will be in the target/** directory. This step is good for a sanity check that there are no compilation errors, but overall, this is optional.

To compile and package the result into a WAR file for Tomcat deployment, run

mvn package

This should compile, run tests, and package the compiled bytecode into a WAR file located in the target/** directory.

Simply copy and paste (or through an IDE like Eclipse) the WAR file into the webapps directory of your Tomcat installation. If you are not certain, see the official Tomcat deployment documentation.

After (re)starting Tomcat, you should see the login page for BD2KCrawler.

###Application dependencies

Though the build dependencies should be ready to go, there is one more thing to do to get the web service working locally: creating an authorized user to access the dashboard and initiate crawling. In the future, we can add a registration service, but as of now it must be done manually.

We need to create a new database named BD2KCrawlerDB, and a collection named "Users". Add a minimal document

{
	firstName:"",
	lastName:"",
	email:"[email protected]",
	password:"<Some BCRYPT hashed password>",
	role: "ROLE_ADMIN" 
}

Note that it is important to use a BCRYPT hashed password, as the authentication service (spring-security) is configured to hash input passwords automatically. Use something like BCrypt Hash Generator to quickly obtain some hash.

After this, you are set to login and access all services from the site.

Note*: In src/main/resources/app.properties, there is a property named email.recipients that contains the comma separated list of email addresses to send crawl results to. Please update this value before initiating any crawls. E.g.

###Misc Currently, the crawler is publicly available at: http://old.bd2kccc.org:8080/BD2KCrawler/index

###Deployment & Encountered issues After specifying the desired deployment options in pom.xml, you can run:

mvn deploy

to deploy the application directly to your destination server.

Encountered issues and their fixes:

  1. Tomcat server down (connection refused at URL + port) This actually is not related to our application, but you can head over to the tomcat installtion directory (e.g. /usr/share/ for AWS Linux AMIs) and run the start/stop scripts as needed.

##License

--

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published