This project have a intent to study the hadoop ecosystem for BigData, starting on haddop 3.1 and going on the newest versions.
Has created a image that contemples mostly hadoop dependencies and according the command pass to entrypoint start one or another service. This save resources and build time in the final.
About Hue https://www.cloudera.com/documentation/enterprise/6/latest/topics/hue.html
# hadoop image
$ docker build -t hadoop-3 hadoop/
-
The download steps are the more slow, if this is a problem for you try refac this image, i recommend to you download the .tar files and use
COPY
in Dockerfile else download -
The data examples are from https://www.kaggle.com/abecklas/fifa-world-cup/data
$ docker-compose up
Open Hue in your favorite web browser
other UI
- Hive
- Hdfs - Namenode
- Yarn - Resource Manager
- Spark - Worker
- Spark - Master
- Spark - Livy
hdfs is the hadoop distributed file system, has divided primarily into 2 main components, that acts like master & slaves
Is a master node, that hold metadata, about data localization, permissions, and locations of data blocks etc
# format metadata
$ hdfs namenode -format -nonInteractive
# start namenode
$ hdfs --daemon start namenode
# set permissions to all users in root folder
$ hdfs dfs -chmod 777 /
# TODO - ckeck this need
$ hdfs dfs -chown -R dr.who:dr.who /
Is a slave node, that holds data, splited in blocks and send heartbeats to namenode, to keep namenode with updated infos.
# start datanode
$ hdfs --daemon start namenode
yarn is the resource manager, has divided in 2 main components to, that acts like master & slaves
The master node, holds info about total resources of cluster(memory and cpus): total, used, and used by jobs.
# start resourcemanager
$ yarn --daemon start resourcemanager
The node manager, sends info about your resources to resource manager.
# start nodemanager
$ yarn --daemon start nodemanager
Spark is a alternative to Hadoop mapReduce layer, but uses memory else disk.
To works, need distribute your dependencies across system by hdfs.
# make a jar with spark jars
jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
# using hdfs cli to send libs to hdsf
$ hdfs dfs -put spark-libs.jar /spark-jars.jar
Start a interactive shell
$ pyspark --master yarn --deploy-mode cluster
or submit a job
$ spark-submit \
--executor-memory 1G \
--total-executor-cores 1 \
--master yarn \
--deploy-mode cluster \
/pyspark-job.py argument1
Originally as part of Hue project Lyve has webservice that provide a interactive spark session by rest API. In a way that can be integrated with other UI, to helps the development of spark jobs.
TODO - install oozie in cluester and explain
hive is a SQL engine atop jobs as mapr or spark.
# define a metastore engine(pg, mysql, derby) to store your metadatas and init it (derby in this case)
$ schematool -dbType derby -initSchema
# init hive server
$ hive --service hiveserver2
Hue is a Django based web service UI to manage cluster and hadoop ecosystem services.
init hue web service
$ ./build/env/bin/hue runserver_plus 0.0.0.0:8888
- Install Oozie and integrates with Hue
Config Hue to not show unused services- Use a secondary namenode
- Integrate Build of images with docker-compose (if possible)
- Luiz Carlos Zamboni: [email protected]
And a special thanks for my lovely companion: Gerusa Fernandes for patience in long hours of study.