If you are running on Windows, please follow the README here
-
Java (OpenJDK 8/Oracle JDK) (HBase, Zookeeper, Nutch, Ant, and Gradle are also required, but will be installed for you when you set up the Gradle wrapper.)
-
Set the
JAVA_HOME
environment variableOn MAC OS, JAVA_HOME can be found at (or in a similar location):
/Library/Java/Home/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home
On LINUX, JAVA_HOME can be found at (or in a similar location):
/usr/lib/jvm/java-8-openjdk-amd64
./gradlew
You can now use the built-in gradle task to setup Hbase.
./gradlew setupHbase
- This will create directories within the project directory to store Hbase and Zookeeper data.
- Downloads hbase-0.98.8-hadoop2 in build directory.
Now, startup hbase service by going into the hbase directory: projectDir/build/hbase-0.98.8-hadoop2/bin/ and run:
./start-hbase.sh
Download and extract apache-nutch-2.3.1 in the build directory:
./gradlew setupNutch
Then edit conf/nutch-discovery/nutch-site.xml
with Discovery credentials. The values for the first three properties (endpoint, username, and password) are provided by the Discovery service. The others are provided by your specific instance of the Discovery service.
Note: If you are using a Discovery Service Instance, which needs IAM authentication, then set discovery.username
to apikey
and discovery.password
to the value of the apikey.
<property>
<name>discovery.endpoint</name>
<value></value>
</property>
<property>
<name>discovery.username</name>
<value></value>
</property>
<property>
<name>discovery.password</name>
<value></value>
</property>
<property>
<name>discovery.configuration.id</name>
<value></value>
</property>
<property>
<name>discovery.environment.id</name>
<value></value>
</property>
<property>
<name>discovery.api.version</name>
<value></value>
</property>
<property>
<name>discovery.collection.id</name>
<value></value>
</property>
To build the plugin, run:
./gradlew buildPlugin
This will take about 4-5 minutes to complete; please be patient. That's it. Everything is now set up to crawl websites.
- Edit the text file
seed/urls.txt
to specify a list of seed URLs.
$ mkdir seed
$ echo "https://en.wikipedia.org/wiki/Apache_Nutch" >> $projectDir/seed/urls.txt
$projectDir/crawl
Note: On the first run, this will only crawl the injected URLs. The procedure above is supposed to be repeated regularly to keep the index up to date.