nixos/hadoop: package rewrite and module improvements #141143

illustris · 2021-10-09T22:05:30Z

Motivation for this change

The current system of building hadoop creates a monolithic fixed-output "build-deps" derivation by running maven in a loop. This makes updating the packages and using custom builds of hadoop much more difficult. It also forces expensive full rebuilds for minor changes. Most of the difficulties in building the package from source are because of maven's unusual way of doing things, such as returning different checksums for the same files, or downloading dynamically linked binaries at build time. The new package can directly accept upstream builds from apache, or binaries from your own custom builds.

The HDFS and YARN modules in their present state require too much manual configuration to spin up a cluster. The changes in this PR adds many sane defaults that make it possible to start a cluster with very little manual configuration. See nixos/tests/hadoop/hadoop.nix for an example.

The existing tests for HDFS and YARN are simply checking whether the namenode, datanode, resourcemanager and nodemanager services start up and expose their web UIs. This is not enough to check if the services are able to communicate, store data and run workloads. The newly added test will test the following:

Does the HDFS cluster exit safemode after startup?
Does the YARN resourcemanager register the nodemanager?
Does a simple mapreduce job using YARN for compute and HDFS for storage succeed?

Things done

Package:

Update to latest releases as per https://hadoop.apache.org/releases.html
Point the hadoop package to the latest 3.x release
Add hadoop2 pointing to the latest hadoop 2.x release
Replace maven for-loop fixed output builds with binary releases
Add more easily accessible options to selectively enable native libraries
set defaults for HADOOP_HOME and HADOOP_CONF_DIR with makeWrapper

Module:

Remove HADOOP_HOME from service config as it is now correctly set by the package
Set default restart policy for all services to always
Add an option to add additional files to HADOOP_CONF_DIR
Add hadoop CLI tools to systemPackages when any hadoop service is enabled
Add restartIfChanged option
Add support for LinuxContainerExecutor, make it the default executor type
Add firewall defaults and options
Generate container-executor.cfg from options

Tests

Add cluster test for HDFS and YARN
Add unified test for HDFS+YARN+mapreduce

Todo

Add documentation

Future work

In its current state, the module doesn't make it easy to spin up an HA HDFS cluster with QJM. Usually this would require a series of manual steps to initialize the cluster. In subsequent PRs I'll try to make a 1-click deployment of a production-ready HA hadoop cluster possible.

While building from source is very inconvenient with nix's currently limited support for maven, it would be nice to provide the option to build hadoop from source eventually.

nixos-discourse · 2021-10-20T21:05:48Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-ready-for-review/3032/609