Skip to content

Writing GATK Tools that use Python

Chris Norman edited this page Feb 8, 2018 · 2 revisions

Under construction.

General Guidelines

Some GATK tools depend on the use of Python for machine learning tasks. Such tools must have a Java front-end that:

  • uses standard GATK arguments
  • handles reading/writing of user inputs and final outputs to ensure GCS support/consistent authentication
  • handles temporary file and resource management
  • uses Python only when necessary, as a computational kernel
  • documents all dependencies
  • minimizes amount of code written in Python

Additionally, tool authors should:

  • ceclare all dependencies in the Conda environment definition file gatkcondaenv.yml
  • not depend on package versions that have Linux or Mac-specific dependencies
  • prefer single line commands embedded in Java over multiple, serial commands
  • write Python errors to stderr
  • raise exceptions in Python for error conditions
  • ensure that program correctness should not rely on consumption of Python stdout
  • logging: TBD

Conda Environment

GATK relies on a Conda environment to establish the correct version of Python and underlying required dependencies. This environment is defined declaratively in the file gatkcondaenv.yml, and shared by all GATK Python tools and peripheral code. Removing or changing the version of a dependency in this file should be done with care, and by consensus with all teams that are dependent on that package.

Executors

There are two methods for integrating Python with aJava front end (PythonScriptExeutor and StreamingPythonScriptExecutor). PythonScriptExecutor is an easy-to-use method for synchronously executing a single Python command, script or module. StreamingPythonScriptExecutor employs a more complex, keep-alive model, that allows execution of multiple commands, asynchronous commands, and data transfer through named pipes.

PythonScriptExeutor

Under construction.

StreamingPythonScriptExecutor

Under construction.