Skip to content

Creating a Cloudera Cluster in the Cloud

git4impatient edited this page Aug 22, 2016 · 5 revisions

instant (almost) gatk

Please see this wiki page:

https://github.com/git4impatient/GATK4onCloudera/wiki

This outlines how to create a Cloudera cluster in the cloud. You would do this if you want to do analysis with GATK but lack a compute environment of your own. The example here will create a 5 node Cloudera Hadoop cluster. If you have lots of data or want results really fast you probably will want more nodes. If you are new to the cloud environment you will need to have your cloud provider increase your limits for cpu, disk, network, memory as genomic files and processing can be very resource intensive.

The steps are as follows:

  • Create an account with the cloud provider of your choice, for example: Google Compute Engine(GCE): https://cloud.google.com/compute/ or Amazon Web Services(AWS): https://aws.amazon.com/
  • Prepare the configuration files
  • Create a virtual computer in the cloud
  • Log into that node and run the appropriate cluster creation script
  • Install GATK4
  • Solve your genomic challenge with GATK :-)