Skip to content

hdsf, hadoop & spark dockers for study

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



33 Commits

Repository files navigation

hadoop 3 ecosystem

This project have a intent to study the hadoop ecosystem for BigData, starting on haddop 3.1 and going on the newest versions.

Table of Contents


Has created a image that contemples mostly hadoop dependencies and according the command pass to entrypoint start one or another service. This save resources and build time in the final.

About Hue

1) Images

# hadoop image
$ docker build -t hadoop-3 hadoop/

2) Compose

$ docker-compose up

3) Open Hue

Open Hue in your favorite web browser

other UI

  • Hive
  • Hdfs - Namenode
  • Yarn - Resource Manager
  • Spark - Worker
  • Spark - Master
  • Spark - Livy



hdfs is the hadoop distributed file system, has divided primarily into 2 main components, that acts like master & slaves


Is a master node, that hold metadata, about data localization, permissions, and locations of data blocks etc

# format metadata 
$ hdfs namenode -format -nonInteractive

# start namenode
$ hdfs --daemon start namenode

# set permissions to all users in root folder
$ hdfs dfs -chmod 777 /

# TODO - ckeck this need
$ hdfs dfs -chown -R dr.who:dr.who /


Is a slave node, that holds data, splited in blocks and send heartbeats to namenode, to keep namenode with updated infos.

# start datanode
$ hdfs --daemon start namenode


yarn is the resource manager, has divided in 2 main components to, that acts like master & slaves

resorce manager

The master node, holds info about total resources of cluster(memory and cpus): total, used, and used by jobs.

# start resourcemanager
$ yarn --daemon start resourcemanager

node manager

The node manager, sends info about your resources to resource manager.

# start nodemanager
$ yarn --daemon start nodemanager


Spark is a alternative to Hadoop mapReduce layer, but uses memory else disk.

To works, need distribute your dependencies across system by hdfs.

# make a jar with spark jars
jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .

# using hdfs cli to send libs to hdsf
$ hdfs dfs -put spark-libs.jar /spark-jars.jar

Start a interactive shell

$ pyspark --master yarn --deploy-mode cluster

or submit a job

$ spark-submit \
    --executor-memory 1G \
    --total-executor-cores 1 \
    --master yarn \
    --deploy-mode cluster \
    / argument1


Originally as part of Hue project Lyve has webservice that provide a interactive spark session by rest API. In a way that can be integrated with other UI, to helps the development of spark jobs.


TODO - install oozie in cluester and explain


hive is a SQL engine atop jobs as mapr or spark.

# define a metastore engine(pg, mysql, derby) to store your metadatas and init it (derby in this case)
$ schematool -dbType derby -initSchema

# init hive server
$ hive --service hiveserver2


Hue is a Django based web service UI to manage cluster and hadoop ecosystem services.

init hue web service

$ ./build/env/bin/hue runserver_plus


  • Install Oozie and integrates with Hue
  • Config Hue to not show unused services
  • Use a secondary namenode
  • Integrate Build of images with docker-compose (if possible)


And a special thanks for my lovely companion: Gerusa Fernandes for patience in long hours of study.


hdsf, hadoop & spark dockers for study






No releases published


No packages published