Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/hadoop: package rewrite and module improvements #141143

Merged
merged 3 commits into from
Oct 25, 2021

Conversation

illustris
Copy link
Contributor

@illustris illustris commented Oct 9, 2021

Motivation for this change

The current system of building hadoop creates a monolithic fixed-output "build-deps" derivation by running maven in a loop. This makes updating the packages and using custom builds of hadoop much more difficult. It also forces expensive full rebuilds for minor changes. Most of the difficulties in building the package from source are because of maven's unusual way of doing things, such as returning different checksums for the same files, or downloading dynamically linked binaries at build time. The new package can directly accept upstream builds from apache, or binaries from your own custom builds.

The HDFS and YARN modules in their present state require too much manual configuration to spin up a cluster. The changes in this PR adds many sane defaults that make it possible to start a cluster with very little manual configuration. See nixos/tests/hadoop/hadoop.nix for an example.

The existing tests for HDFS and YARN are simply checking whether the namenode, datanode, resourcemanager and nodemanager services start up and expose their web UIs. This is not enough to check if the services are able to communicate, store data and run workloads. The newly added test will test the following:

  • Does the HDFS cluster exit safemode after startup?
  • Does the YARN resourcemanager register the nodemanager?
  • Does a simple mapreduce job using YARN for compute and HDFS for storage succeed?
Things done
Package:
  • Update to latest releases as per https://hadoop.apache.org/releases.html
  • Point the hadoop package to the latest 3.x release
  • Add hadoop2 pointing to the latest hadoop 2.x release
  • Replace maven for-loop fixed output builds with binary releases
  • Add more easily accessible options to selectively enable native libraries
  • set defaults for HADOOP_HOME and HADOOP_CONF_DIR with makeWrapper
Module:
  • Remove HADOOP_HOME from service config as it is now correctly set by the package
  • Set default restart policy for all services to always
  • Add an option to add additional files to HADOOP_CONF_DIR
  • Add hadoop CLI tools to systemPackages when any hadoop service is enabled
  • Add restartIfChanged option
  • Add support for LinuxContainerExecutor, make it the default executor type
  • Add firewall defaults and options
  • Generate container-executor.cfg from options
Tests
  • Add cluster test for HDFS and YARN
  • Add unified test for HDFS+YARN+mapreduce
Todo
  • Add documentation
Future work

In its current state, the module doesn't make it easy to spin up an HA HDFS cluster with QJM. Usually this would require a series of manual steps to initialize the cluster. In subsequent PRs I'll try to make a 1-click deployment of a production-ready HA hadoop cluster possible.

While building from source is very inconvenient with nix's currently limited support for maven, it would be nice to provide the option to build hadoop from source eventually.

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • 21.11 Release Notes (or backporting 21.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-ready-for-review/3032/609

pkgs/applications/networking/cluster/hadoop/default.nix Outdated Show resolved Hide resolved
pkgs/applications/networking/cluster/hadoop/default.nix Outdated Show resolved Hide resolved
outputHashMode = "recursive";
outputHash = dependencies-sha256;
};
with lib;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with lib;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding it at the top level is better than adding with lib for both common and hadoop_3_3

pkgs/applications/networking/cluster/hadoop/default.nix Outdated Show resolved Hide resolved
pname = "hadoop";
version = "3.3.1";
sha256 = "1b3v16ihysqaxw8za1r5jlnphy8dwhivdx2d0z64309w57ihlxxd";
untarDir = "${pname}-${version}";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
untarDir = "${pname}-${version}";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

untarDir is needed here, it's used by libPatches below

pkgs/applications/networking/cluster/hadoop/default.nix Outdated Show resolved Hide resolved
pkgs/top-level/all-packages.nix Outdated Show resolved Hide resolved
pkgs/applications/networking/cluster/hadoop/default.nix Outdated Show resolved Hide resolved
- Make hadoop 3 the default hadoop package
- hadoop_3 -> 3.3.1
- hadoop_3_2 -> 3..2.2
- hadoop_2 -> 2.10.1
The existing tests for HDFS and YARN only check if the services come up and expose their web interfaces.
The new combined hadoop test will also test whether the services and roles work together as intended.
It spin up an HDFS+YARN cluster and submit a demo YARN application that uses the hadoop cluster for
storageand yarn cluster for compute.
mkdir -p $out/share/doc/hadoop
cp -dpR * $out/
mv $out/*.txt $out/share/doc/hadoop/
for n in $(find $out/lib/${untarDir}/bin -type f ! -name "*.*"); do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the bash builtin glob *.* ?
(subshells are generally frowned upon for added execution time and not so good error propagation).

Copy link
Contributor Author

@illustris illustris Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's doing doing the opposite of *.* here. We want to exclude files with a . in the name. ! -name "*.*" passed to find means files that don't match *.*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about extended globing then https://www.linuxjournal.com/content/bash-extended-globbing
!(*.*)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to set ( shopt -s extglob) before the expression.

Copy link
Contributor

@happysalada happysalada Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested and it returns directories as well.
You will need to add a check for directory in the for.
I feel the extglob is a little more readable, but it's mostly a style preference. If you feel strongly about keeping the find, just let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it returns directories? -type f ensures find only returns files. As for readability, maybe it's just a matter of familiarity, but I find find's syntax of "find in this path all files that don't match the name pattern *.*" more readable.

[illustris@illustris-thinkpad:/dev/shm]$ perf stat bash -c 'for n in $(find ./bin -type f ! -name "*.*"); do echo $n; done'
./bin/mapred
./bin/yarn
./bin/test-container-executor
./bin/container-executor
./bin/hadoop
./bin/oom-listener
./bin/hdfs

 Performance counter stats for 'bash -c for n in $(find ./bin -type f ! -name "*.*"); do echo $n; done':

              5.25 msec task-clock:u              #    0.899 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               447      page-faults:u             #    0.085 M/sec
         9,854,259      cycles:u                  #    1.878 GHz
        15,911,728      instructions:u            #    1.61  insn per cycle
         3,607,606      branches:u                #  687.434 M/sec
           103,533      branch-misses:u           #    2.87% of all branches

       0.005837019 seconds time elapsed

       0.004626000 seconds user
       0.001255000 seconds sys

The glob on the other hand returns a single line containing paths of all files. This will then need to be split and tested to see if it's a file or a directory.

[illustris@illustris-thinkpad:/dev/shm]$ perf stat bash -c 'shopt -s extglob; for n in bin/!\(*.*\); do echo $n; done | tr " " "\n" | while read line; do [ -d $line ] || echo $line; done'
bin/container-executor
bin/hadoop
bin/hdfs
bin/mapred
bin/oom-listener
bin/test-container-executor
bin/yarn

 Performance counter stats for 'bash -c shopt -s extglob; for n in bin/!\(*.*\); do echo $n; done | tr " " "\n" | while read line; do [ -d $line ] || echo $line; done':

              5.33 msec task-clock:u              #    1.019 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               720      page-faults:u             #    0.135 M/sec
         7,124,080      cycles:u                  #    1.337 GHz
        10,574,787      instructions:u            #    1.48  insn per cycle
         2,309,161      branches:u                #  433.261 M/sec
            58,271      branch-misses:u           #    2.52% of all branches

       0.005228103 seconds time elapsed

       0.003550000 seconds user
       0.002430000 seconds sys

to me the extglob alternative seems a lot harder to read. If you have a better way of writing it, please let me know. In terms of performance, both are within margin of error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for not being clear. The extglob returns directories and those need a test in the for loop.
I don't feel too strongly about the extglob.
Interesting that the performance is better with the find!

@happysalada
Copy link
Contributor

I have one minor nit, but otherwise it's looking good.

@happysalada
Copy link
Contributor

Thank you for your contribution!

@happysalada happysalada merged commit 6688c52 into NixOS:master Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants