Spark Java Wordcount

Introduction

The current project is a simple proof of concept on how to work with Apache Spark for calculating the number of words within a document. The initial sample code was taken from the examples provided with the Spark installation and afterwards adapted to only return the most frequent words in the document.

Setup of the environment

The reason why I am writing this document is because I’ve stumbled on several issues while performing the setup of the test environment.

I opted to use a Spark Standalone Cluster based on Docker and the files to be used as input for the word counters are retrieved via HDFs.

setup-docker-cluster.adoc setup-hdfs.adoc

Running the example

With everything setup, we can proceed to execute the word count example

➜  spark-2.0.0-bin-hadoop2.7 bin/spark-submit /
--class com.samples.spark.JavaWordCount /
--master spark://172.17.0.1:7077 /
/home/marius/Downloads/spark-workcount/build/libs/spark-workcount-all-1.0-SNAPSHOT.jar  /
hdfs://10.0.0.2:9000/user/marius/gutenberg/james-joyce-ulysses.txt

In order to retrieve the 20 most frequent words use the following call:

➜  spark-2.0.0-bin-hadoop2.7 bin/spark-submit /
--class com.samples.spark.JavaMostFrequentWords /
--master spark://172.17.0.1:7077 /
/home/marius/Downloads/spark-workcount/build/libs/spark-workcount-all-1.0-SNAPSHOT.jar /
hdfs://10.0.0.2:9000/user/marius/gutenberg/james-joyce-ulysses.txt /
hdfs://10.0.0.2:9000/user/marius/gutenberg/stopwords.txt /
20

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/main/java/com/samples/spark		src/main/java/com/samples/spark
.gitignore		.gitignore
README.adoc		README.adoc
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle
setup-docker-cluster.adoc		setup-docker-cluster.adoc
setup-hdfs.adoc		setup-hdfs.adoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Java Wordcount

Introduction

Setup of the environment

Running the example

About

Releases

Packages

Languages

mariusneo/spark-wordcount

Folders and files

Latest commit

History

Repository files navigation

Spark Java Wordcount

Introduction

Setup of the environment

Running the example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages