Skip to content
Denis C. Bauer edited this page Dec 23, 2013 · 18 revisions

CSIRO Garvan

Advanced Production Informatics

AIM: Reduce overhead of set-up and scripting for new projects

The first steps of analysing sequencing data (2GS, NGS) has entered a transitional period where on one hand most analysis steps can be automated and standardized (pipeline), while on the other constantly evolving protocols and software updates makes maintaining these analysis pipelines labour intensive.
NGSANE offers a highly flexible framework to cater for different analyses types on high-performance-compute (HPC) systems while also being generic enough to efficiently disseminate labour intensive maintenance and extension amongst the user community.

###Crowd-sourcing not Wheel-reinvention Academic tools will remain the methods of choice for cutting-edge data analysis, however, most do not comply with even very basic software-development practice (i.e. poor documentation, lack of legacy –support), which makes setting-up and maintenance time consuming. A similar issue applies to reference data sets, which need to be downloaded and often filtered and converted into a usable format.
Summarizing quality control and data yield in a meaningful way remains a labour intensive expert task. Rather than individually battling these issues, a more efficient way would be to have a centralized system set up that is collectively maintained by the researchers who are using the system. Benefits would be:

  • Sharing modular methods/scripts for data analysis and summary
  • Optimized task-packaging for efficient HPC resource utilization
  • Ensuring consistency and reproducibility by keeping scripts separate from data
  • Benchmarking quality amongst other datasets within your organization
  • Enabling collaborative knowledge gain
  • Making developers’ expert knowledge available to users by enforcing scripts to have a self-contained quality control stage

submit jobs

Fig.1: Overview over NGSANE's functionality and structure. Each project (Input) has a project specific config file (A) holding the necessary customizations for the planned analysis tasks. Note, each project can have multiple config files for each analysis task. Distinct from the project is the NGSANE core, which contains the pan-project configuration file (header.sh B). This file contains general system variables, platform-specific parameters, and paths to the various software binaries installed on a system. It should be configured once upon initial installation, then modified whenever new software versions are required. Also in the core is the trigger.sh file (C), which is the main executable file in NGSANE. It processes the variables and tasks specified in the configuration files, ensuring that all dependencies are met and invoking the core job submission protocols. It allows the user to selectively launch a test or 'dry' run, a full high performance computing run, or generate a summary report once the tasks have completed. (D) The mod files contain the generic analytic pipelines that are to be executed on the HPC cluster. Each mod corresponds to a specific analysis, a single task, or a series thereof. They include checkpoints to recover previous failed executions, as well as comprehensive logging of each step. Advanced users can create customised mods and include them in the framework. After execution, a concise summary of the results and a project-card (E) can be generated. This usually includes general statistics of the results, including graphs, potential errors, and a itemised log of the checkpoints for each task. result

Fig.2: The figure shows NGSANE's project card (Fig.1 E).

Quickstart

For detailed information see Setup Guide and User Guide.

  • Prepare config.txt file to specify what needs to be done in a project
  • Start Trigger Pipelines and the different advanced production informatics tasks.
     trigger.sh config.txt armed 
  • Generate Project Card
     trigger.sh config.txt report 
  • If rerun is necessary choose individual files to be rerun by
 trigger.sh config.txt keep
[delete all filenames in qout//runnow.txt that do not have to be rerun]
trigger.sh config.txt armed
  • Jobs can be run directly and not qsubmitted with (e.g. for testing)
 trigger.sh config.txt direct