This demo shows the submission of AGC - Analysis Grand Challenge to the REANA using the Snakemake as an workflow engine.
For full explanation please have a look at this documentation:
The Analysis Grand Challenge (AGC) is about performing the last steps in an analysis pipeline at scale to test workflows envisioned for the HL-LHC. This includes
- columnar data extraction from large datasets,
- processing of that data (event filtering, construction of observables, evaluation of systematic uncertainties) into histograms,
- statistical model construction and statistical inference,
- relevant visualizations for these steps,
The physics analysis task is a datasets/cms-open-data-2015
).
The current reference implementation can be found in analyses/cms-open-data-ttbar
.
We are using 2015 CMS Open Data in this demonstration to showcase an analysis pipeline. The input .root
files are located in the nanoAODschema.json
.
The current coffea AGC version defines the coffea Processor, which includes a lot of the physics analysis details:
- event filtering and the calculation of observables,
- event weighting,
- calculating systematic uncertainties at the event and object level,
- filling all the information into histograms that get aggregated and ultimately returned to us by coffea.
The analysis takes the following inputs:
nanoAODschema.json
input.root
filesSnakefile
The Snakefile forttbar_analysis_reana.ipynb
The main notebook file where files are processed and analysed.file_merging.ipynb
Notebook to merge each processed.root
file in one file with unique keys.final_merging.ipynb
Notebook to merge histograms together all of
REANA provides support for the Snakemake workflow engine. To ensure the fastest execution of the AGC ttbar workflow, a two-level (multicascading) parallelization approach with Snakemake is implemented.
In the initial step, Snakemake distributes all jobs across separate nodes, each with a single .root
file for ttbar_analysis_reana.ipynb
.
Subsequently, after the completion of each rule, the merging of individual files into one per sample takes place.
#Here is the high level of AGC workflow
+-----------------------------------------+
| Take the CMS open data from nanoaod.json|
+-----------------------------------------+
|
|
|
v
+-----------------------------------+
|rule: Process each file in parallel|
+-----------------------------------+
|
|
|
v
+-----------------------------------------+
|rule: Merge created files for each sample|
+-----------------------------------------+
|
|
|
v
+----------------------------------------------+
|rule: Merge sample files into single histogram|
+----------------------------------------------+
To be able to rerun the AGC after some time, we need to "encapsulate the current compute environment", for example to freeze the ROOT version our analysis is using. We shall achieve this by preparing a Docker container image for our analysis steps.
We are using the modified verison of the analysis-systems-base
Docker image container with additional packages, the main on is papermill which allows to run the Jupyter Notebook from the command line with additional parameters.
In our case, the Dockerfile creates a conda virtual environment with all necessary packages for running the AGC analysis.
$ less environment/Dockerfile
Let's go inside the environment and build it
$ cd environment/
We can build our AGC environment image and give it a name
docker.io/reanahub/reana-demo-agc-cms-ttbar-coffea
:
$ docker build -t docker.io/reanahub/reana-demo-agc-cms-ttbar-coffea .
We can push the image to the DockerHub image registry:
$ docker push docker.io/reanahub/reana-demo-agc-cms-ttbar-coffea
Some data are located at the eos/public so in order to process the big amount of files, user should be authenticated with Kerberos. In our case we achieve it by setting up:
workflow:
type: snakemake
resources:
kerberos: true
file: Snakefile
If you are pocessing small amount of files (less than 10) you can set this option to False
.
Or you can also set the kerberos authentication via the Snakemake rules.
For deeper understanding please refer to the (REANA documentation)[https://docs.reana.io/advanced-usage/access-control/kerberos/]
The reana.yaml file describes the above analysis structure with its inputs, code, runtime environment, computational workflow steps and expected outputs:
version: 0.8.0
inputs:
files:
- ttbar_analysis_reana.ipynb
- nanoaod_inputs.json
- fix-env.sh
- corrections.json
- Snakefile
- file_merging.ipynb
- final_merging.ipynb
- prepare_workspace.py
directories:
- histograms
- utils
workflow:
type: snakemake
resources:
kerberos: true
file: Snakefile
outputs:
files:
- histograms_merged.root
We can now install the REANA command-line client, run the analysis and download the resulting plots:
$ # create new virtual environment
$ virtualenv ~/.virtualenvs/reana
$ source ~/.virtualenvs/reana/bin/activate
$ # install REANA client
$ pip install reana-client
$ # connect to some REANA cloud instance
$ export REANA_SERVER_URL=https://reana.cern.ch/
$ export REANA_ACCESS_TOKEN=XXXXXXX
$ # run AGC workflow
$ reana-client run -w reana-agc-cms-ttbar-coffea
$ # ... should be finished in around 6 minutes if you select all files in the Snakefile
$ reana-client status
$ # list workspace files
$ reana-client ls
Please see the REANA-Client documentation for
more detailed explanation of typical reana-client
usage scenarios.
The output is created under the name of histograms_merged.root
which can be further evaluated with variety of AGC tools.