Skip to content

An R package for fast and efficient visualizing of GWAS results using Q-Q and Manhattan plots directly from PLINK output files.

License

Notifications You must be signed in to change notification settings

kaustubhad/fastman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fastman

Description

An R package for fast and efficient visualizing of GWAS results using Q-Q and Manhattan plots directly from PLINK output files.

  • Fast: Drastically reduces time in plot generation compared to qqman. On a typical imputed PLINK assoc file of 10 million SNPs, plotting time is reduced from 737s in qqman to 60s.
  • Efficient: Optimized memory management
  • Versatile: Can handle various inputs from p-values, and logarithms of p-values to FST scores. Compatible plotting with other genome-wide population genetic parameters (e.g. FST, Ď€ and D statistics). Allows both-sided scores, e.g. scores with negative values.
  • Non-model friendly: Additional support for results from genomes of non-model organisms (often with hundreds of contigs or many scaffolds), alphabetical and other ordering options.
  • Annotation and Highlight versatility: Has a wide set of options to customize annotating and highlighting SNPs of interest.
  • Familiar: Has a similar set of input arguments and code structure compared to qqman.
  • Missing Value Handling: Can handle missing values in the input data frame.
  • Gene Annotation Capability: Can annotate with gene names instead of SNP names if the user prefers. Currently, the package provides in-built support for human builds GRCh36, 37 and 38, but other builds, e.g., for non-model organisms, can also be provided as input by the user.

Installation

If you are using Rstudio you can use the following code to install the package.

devtools::install_github('kaustubhad/fastman',build_vignettes = TRUE)

If you are not using Rstudio, we would recommend that you install the package without building the vignette.

devtools::install_github('kaustubhad/fastman',build_vignettes = FALSE)

Functions:

1. fastman

Description

Creates a Manhattan plot directly from a PLINK assoc output (or any data frame with chromosome, position, and p-value).

Usage

fastman (m, chr = "CHR", bp = "BP", p = "P", snp, chrlabs, speedup=TRUE, logp = TRUE, scattermore = FALSE,
        col="matlab", maxP=14, sortchr=TRUE, bybp=FALSE, chrsubset, bprange, highlight,
        annotateHighlight=FALSE, annotatePval, colAbovePval=FALSE, col2="greys", annotateTop=TRUE,
        annotationWinMb, annotateN, annotationCol, annotationAngle=45, baseline=NULL, suggestiveline,
        genomewideline, cex=0.4, cex.text=0.4, cex.axis=0.6, scattermoresize = c(3000,1800), 
        geneannotate = FALSE, build, sep="|", border=0, xlab, ylab, xlim, ylim, gap.axis=NA, ...)

Parameters:

  • m = A data frame containing data for producing the Manhattan plot. Has to contain a minimum of three columns: base pair position, chromosome ID, and P-value: defaults are "BP", "CHR", and "P" following the Plink assoc file convention. And optionally some ID, e.g. the SNP ID (default "SNP”) if annotations are needed. See explanations of the next four parameters if your column names differ, e.g. for non-model organisms, you can use contid ID instead of chromosome ID.
  • chr = A string denoting the column name for the chromosome. Defaults to “CHR”, which corresponds to the PLINK –assoc command output. For non-model organisms, this could be the contig ID. In case your chromosome column is numeric but has been converted into string during the reading of data in R, you must pay close attention to the sorting order of chromosomes. If you still want the chromosomes to be sorted in an increasing order of chromosome number then you must convert your chromosome column to numeric before using the function. If your data is already sorted then you do not need to convert the column to numeric, you can specify sortchr = FALSE in your input arguments instead.
  • bp = A string denoting the column name for the chromosomal position. Defaults to “BP”, which corresponds to the PLINK –assoc command output. The column must be numeric.
  • p = A string denoting the column name for the p-values or scores for the SNP association tests. Defaults to “P”, which corresponds to the PLINK –assoc command output. The column must be numeric. You can also provide an already-computed log of p-values, e.g. from published summary statistics.
  • snp = A string denoting the column name for the SNP name (rs number). The column must be character.
  • chrlabs = An optional character vector of length equal to the number of chromosomes, specifying the chromosome labels. e.g., you can provide c(1:22, "X", "Y", "MT") to convert the Plink numerical notation of 23=X, 24=Y, etc. This character vector is used to create the axis labels of the Manhattan plot. So, you should sort the character vector in the order you want the chromosome labels to appear in the final plot. For example, if your input data frame has chromosome numbers in a particular order you specifically want, and you have used the option sortchr = FALSE to preserve the order for your final plot, then your chrlabs vector should also have the same order of chromosomes.
  • speedup = A logical value; if TRUE, the function employs the faster method where input values at the extreme 0.2% are rounded to 3 digits, and the rest is rounded to 2 digits. The default value of this parameter is TRUE.
  • logp = A logical value; if TRUE, negative logarithms (base 10) of p-values are plotted. In case the user wants to use FST score type data or logarithm of p-values directly, then logp must be stated to be FALSE, as the default value of this parameter is TRUE.
  • scattermore = A logical value; if TRUE, uses scattermore package to speed up plot generation faster. In case the user wants to use this feature, the scattermore package needs to be installed and loaded before running the command. The default value of this parameter is FALSE.
  • col = A string indicating the colour scheme of the plot. Defaults to “matlab”. There are various options available for users. See below for details.
  • maxP = A numeric value indicating the maximum y-value till which the user wants to visualize. The default value of this parameter is 14. If the data has negative values then both sides are truncated till the absolute value of the parameter. The user can provide NULL as input if truncation is not required.
  • sortchr = A logical value; if TRUE, the table is sorted by chromosome number before plotting. If not specified by the user, the function takes the default value TRUE.
  • bybp = A logical value; if TRUE, the y-values are plotted against chromosome positions. In this case, the table is not sorted by chromosome number before plotting. If not specified by the user, the function takes the default value FALSE. This feature is useful, especially for plots where the user might be interested in studying the association p-values across contigs.
  • chrsubset = The subset of chromosome numbers to be plotted.
  • bprange = The range of chromosome positions to be plotted. In case the user wants to subset the X-axis by region, then this should be the parameter of choice, not xlim.
  • highlight = A character vector of SNPs in the dataset to highlight. These SNPs should all be in the dataset.
  • annotateHighlight = A logical value; if TRUE, annotates all highlighted SNPs in case more specific annotation instructions are not provided.
  • annotatePval = A numeric value, if set, SNPs with p-values below this will be annotated on the plot. In the case of p-value, the user can provide either the p-value or the negative logarithm of the p-value as input for this argument, whichever is convenient. In the case of scores, the user can provide the score cutoff directly as input.
  • colAbovePval = A logical value, if TRUE, will colour all hits above the specified p-value threshold using the colour scheme chosen in col argument (default "matlab"), while the points below the threshold will be coloured using the colour scheme chosen in col2 argument below (default "greys"). Defaults to FALSE.
  • col2 = A string indicating the colour scheme of the part of the plot below the specified p-value threshold. Defaults to “greys”. There are various options available for users. See below for details.
  • annotateTop = A logical value; If TRUE, only annotates the top hit on each chromosome that is below the annotatePval threshold. This is just a modifier, and it works only when used with either annotatePval, annotateHighlight or annotateN.
  • annotationWinMb = A numeric value, if set, will determine the megabase window within which the top SNP will be annotated. This is just a modifier, and it works only when used with either annotatePval, annotateHighlight or annotateN.
  • annotateN = A numeric value, if set, this number of top SNPs will be annotated on the plot.
  • annotationCol = A string indicating the colour of the annotation or the column name containing the colour information for the individual rows. The user can provide a column as a part of the input data frame which contains annotation colour information corresponding to each SNP and specify the column name in this parameter. If the user does not want annotation for some particular SNP, NA can be provided in this column corresponding to that SNP. In case the user provides just a string indicating the annotation colour instead of a column name, then the same colour will be used for annotating all the SNPs. Defaults to grey.
  • annotationAngle = The angle of annotation, defaults to 45 degrees.
  • baseline = The position to draw a baseline in black. Defaults to NULL, as a typical Manhattan plot already has a baseline at y=0. In case the data has a left tail (e.g. two-sided scoes) the user might want to provide a baseline position for reference. In case multiple baselines are required, the user can provide a vector of positions.
  • suggestiveline = The position to draw a GWAS "suggestive significance" line. Defaults to -log10(1e-5). In case multiple suggestive lines are required, the user can provide a vector of positions.
  • genomewideline = The position to draw a GWAS “genome-wide significance” line. Defaults to -log10(5e-8). In case multiple genome-wide significant lines are required, the user can provide a vector of positions.
  • cex = A numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default. This works as a multiple of par("cex"). NULL and NA are equivalent to 1.0. Defaults to 0.4.
  • cex.text = A numerical vector giving the amount by which annotation text should be scaled relative to the default. This works as a multiple of par("cex"). NULL and NA are equivalent to 1.0. Defaults to 0.4.
  • cex.axis = The magnification to be used for axis annotation relative to the current setting of cex. Defaults to 0.6.
  • scattermoresize = A 2-element integer vector to specify the size of scattermore plot in pixels. Applicable only if scattermore = TRUE. The user should make sure that the value of scattermoresize parameter matches the desired size of the final image output for best-quality plots. Defaults to c(3000,1800).
  • geneannotate = A logical value; if TRUE, the function annotates using gene names instead of SNP names. If not specified by the user, the function takes the default value FALSE.
  • build = A numerical value specifying the build number for matching locations to gene names. build=38 corresponds to GRCh38/b38 and Hg20, build=37 corresponds to GRCh37/b37 and Hg19, and build=36 corresponds to GRCh36/b36 and Hg18.
  • sep = A character value specifying the separator to be used in case multiple gene name hits come up in the same annotated SNP. If not specified by the user, the function takes the default value '|'.
  • xlab = A label for the x-axis, defaults to a description of x.
  • ylab = A label for the y-axis, defaults to a description of y.
  • xlim = The x limits of the plot. The user should refrain from changing xlim to subset x-axis by region. The better option is to specify this in the bprange argument, as changing xlim might lead to improper scaling and spacing in the plot.
  • ylim = The y limits of the plot. The user should refrain from changing ylim to truncate the y-axis. The better option is to specify the same in the maxP argument, as changing ylim might lead to improper scaling and spacing in the plot.
  • gap.axis = A numeric factor that determines the minimum gaps between axis labels; defaults to NA to use R's default settings. The default value might lead to labels of the smaller chromosomes being dropped in some cases; if so, setting gap.axis = -2 can help to display all the labels. But if the figure width is not sufficiently large, or if the chromosome/contig names are large, this can cause overlap in the text.

Value

A Manhattan Plot

2. fastqq

Description

Creates a quick quantile-quantile plot from GWAS outputs. On a typical imputed assoc file of 10 million SNPs, it reduces plotting time from 400s in qqman to 24s.

Usage

fastqq (p1, p2=NULL, colour, logtransform=TRUE, pairwisecompare=TRUE, speedup=TRUE, lambda=TRUE, maxP=14,
        fix_zero=TRUE, cex=0.6, cex.axis=0.9, xlab, ylab, ...)

Parameters:

  • p1 = A numeric vector of p-values. If the user has a single set of p-values, then this is the input for that. In case the user wants to compare two sets of p-values, then this argument takes the first set of p-values as input.
  • p2 = A numeric vector of p-values. In case the user wants to compare two sets of p-values, this argument takes the second set of p-values as input. Otherwise, this argument defaults to NULL.
  • logtransform = A logical value; if TRUE, negative logarithms (base 10) of p-values are plotted. In case the user wants to use FST score type data or logarithm of p-values directly, then logtransform must be stated to be FALSE, as the default value of this parameter is TRUE.
  • pairwisecompare = A logical value. If the two sets of p-values are provided in pairs, the user should state this argument as TRUE. Otherwise, both sets of p-values are sorted individually and then plotted against each other.
  • speedup = A logical value; if TRUE, the function employs the faster method where inputs are rounded to 3 digits.
  • lambda = A logical value; if TRUE, the genomic inflation factor lambda is shown in the top-left corner.
  • maxP = A numeric value indicating the maximum negative logarithm of the p-value till which the user wants to visualize. The default value of this parameter is 14. If the data has negative values then both sides are truncated till the absolute value of the parameter. The user can provide NULL as input if truncation is not required.
  • fix_zero = A logical value; if TRUE, zero input values are converted to half of the minimum observed input value; if FALSE, zero input values are removed.
  • cex = A numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default. This works as a multiple of par("cex"). NULL and NA are equivalent to 1.0. Defaults to 0.6.
  • cex.axis = The magnification to be used for axis annotation relative to the current setting of cex. Defaults to 0.9.
  • xlab = A label for the x axis, defaults to a description of x.
  • ylab = A label for the y-axis, defaults to a description of y.

Value

  • A Q-Q Plot
  • The genomic inflation factor lambda

3. fastman_gg

Description

Creates a Manhattan ggplot object directly from a PLINK assoc output (or any data frame with chromosome, position, and p-value). The user needs to install and load the ggplot2 package to use this function.

Usage

fastman_gg (m, chr = "CHR", bp = "BP", p = "P", snp, chrlabs, speedup=TRUE, logp = TRUE,
        scattermore = FALSE, repel = FALSE, col="matlab", maxP=14, sortchr=TRUE, bybp=FALSE, chrsubset,
        bprange, highlight, annotateHighlight=FALSE, annotatePval, colAbovePval=FALSE, col2="greys",
        annotateTop=TRUE, annotationWinMb, annotateN, annotationCol, annotationAngle=45, baseline=NULL,
        suggestiveline, genomewideline, cex=0.9, cex.text=1.8, cex.axis=0.6,
        scattermoresize = c(3000,1800), geneannotate = FALSE, build, sep="|",xlab, ylab, xlim, ylim, ...)

Parameters:

  • m = A data frame containing data for producing the Manhattan plot. Has to contain a minimum of three columns: base pair position, chromosome ID, and P-value: defaults are "BP", "CHR", "P" following the Plink assoc file convention. And optionally some sort of ID, e.g. the SNP ID (default "SNP”) if annotations are needed. See explanations of the next four parameters if your column names are different, e.g. for non-model organisms, you can use contid ID instead of chromosome ID.
  • chr = A string denoting the column name for the chromosome. Defaults to “CHR”, which corresponds to the PLINK –assoc command output. For non-model organisms, this could be the contig ID. In case your chromosome column is numeric but has been converted into string during the reading of data in R, you must pay close attention to the sorting order of chromosomes. If you still want the chromosomes to be sorted in an increasing order of chromosome number then you must convert your chromosome column to numeric before using the function. If your data is already sorted then you do not need to convert the column to numeric, you can specify sortchr = FALSE in your input arguments instead.
  • bp = A string denoting the column name for the chromosomal position. Defaults to “BP”, which corresponds to the PLINK –assoc command output. The column must be numeric.
  • p = A string denoting the column name for the p-values or scores for the SNP association tests. Defaults to “P”, which corresponds to the PLINK –assoc command output. The column must be numeric. You can also provide an already-computed log of p-values, e.g. from published summary statistics.
  • snp = A string denoting the column name for the SNP name (rs number). The column must be character.
  • chrlabs = An optional character vector of length equal to the number of chromosomes, specifying the chromosome labels. e.g., you can provide c(1:22, "X", "Y", "MT") to convert the Plink numerical notation of 23=X, 24=Y, etc. This character vector is used to create the axis labels of the Manhattan plot. So, you must sort the character vector in the order you want the chromosome labels to appear in the final plot. For example, if your input data frame has chromosome numbers in a particular order you specifically want, and you have used the option sortchr = FALSE to preserve the order for your final plot, then your chrlabs vector should also have the same order of chromosomes.
  • speedup = A logical value; if TRUE, the function employs the faster method where input values at the extreme 0.2% are rounded to 3 digits, and the rest is rounded to 2 digits. The default value of this parameter is TRUE.
  • logp = A logical value; if TRUE, negative logarithms (base 10) of p-values are plotted. In case the user wants to use FST score type data or logarithm of p-values directly, then logp must be stated to be FALSE, as the default value of this parameter is TRUE.
  • scattermore = A logical value; if TRUE, uses scattermore package to speed up plot generation faster. In case the user wants to use this feature, the scattermore package needs to be installed and loaded before running the command. The default value of this parameter is FALSE.
  • repel = A logical value; if TRUE, uses ggrepelpackage to repel overlapping text labels. In case the user wants to use this feature, the ggrepel package needs to be installed and loaded before running the command. The default value of this parameter is FALSE.
  • col = A string indicating the colour scheme of the plot. Defaults to “matlab”. There are various options available for users. See below for details.
  • maxP = A numeric value indicating the maximum y-value till which the user wants to visualize. The default value of this parameter is 14. If the data has negative values then both sides are truncated till the absolute value of the parameter. The user can provide NULL as input if truncation is not required.
  • sortchr = A logical value; if TRUE, the table is sorted by chromosome number before plotting. If not specified by the user, the function takes the default value TRUE.
  • bybp = A logical value; if TRUE, the y-values are plotted against chromosome positions. In this case, the table is not sorted by chromosome number before plotting. If not specified by the user, the function takes the default value FALSE. This feature is useful, especially for plots where the user might be interested in studying the association p-values across contigs.
  • chrsubset = The subset of chromosome numbers to be plotted.
  • bprange = The range of chromosome positions to be plotted. In case the user wants to subset the X-axis by region, then this should be the parameter of choice, not xlim.
  • highlight = A character vector of SNPs in the dataset to highlight. These SNPs should all be in the dataset.
  • annotateHighlight = A logical value; if TRUE, annotates all highlighted SNPs in case more specific annotation instructions are not provided.
  • annotatePval = A numeric value, if set, SNPs with p-values below this will be annotated on the plot. In the case of p-value, the user can provide either the p-value or the negative logarithm of the p-value as input for this argument, whichever is convenient. In the case of scores, the user can provide the score cutoff directly as input.
  • colAbovePval = A logical value, if TRUE, will colour all hits above the specified p-value threshold using the colour scheme chosen in col argument (default "matlab"), while the points below the threshold will be coloured using the colour scheme chosen in col2 argument below (default "greys"). Defaults to FALSE.
  • col2 = A string indicating the colour scheme of the part of the plot below the specified p-value threshold. Defaults to “greys”. There are various options available for users. See below for details.
  • annotateTop = A logical value; If TRUE, only annotates the top hit on each chromosome that is below the annotatePval threshold. This is just a modifier, and it works only when used with either annotatePval, annotateHighlight or annotateN.
  • annotationWinMb = A numeric value, if set, will determine the megabase window within which the top SNP will be annotated. This is just a modifier, and it works only when used with either annotatePval, annotateHighlight or annotateN.
  • annotateN = A numeric value, if set, this number of top SNPs will be annotated on the plot.
  • annotationCol = A string indicating the colour of the annotation or the column name containing the colour information for the individual rows. The user can provide a column as a part of the input data frame which contains annotation colour information corresponding to each SNP and specify the column name in this parameter. If the user does not want annotation for some particular SNP, NA can be provided in this column corresponding to that SNP. In case the user provides just a string indicating the annotation colour instead of a column name, then the same colour will be used for annotating all the SNPs. Defaults to grey.
  • annotationAngle = The angle of annotation, defaults to 45 degrees.
  • baseline = The position to draw a baseline in black. Defaults to NULL, as a typical Manhattan plot already has a baseline at y=0. In case the data has a left tail (e.g. two-sided scoes) the user might want to provide a baseline position for reference. In case multiple baselines are required, the user can provide a vector of positions.
  • suggestiveline = The position to draw a GWAS "suggestive significance" line. Defaults to -log10(1e-5). In case multiple suggestive lines are required, the user can provide a vector of positions.
  • genomewideline = The position to draw a GWAS “genome-wide significance” line. Defaults to -log10(5e-8). In case multiple genome-wide significant lines are required, the user can provide a vector of positions.
  • cex = A numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default. This works as a multiple of par("cex"). NULL and NA are equivalent to 1.0. Defaults to 0.9.
  • cex.text = A numerical vector giving the amount by which annotation text should be scaled relative to the default. This works as a multiple of par("cex"). NULL and NA are equivalent to 1.0. Defaults to 1.8.
  • cex.axis = The magnification to be used for axis annotation relative to the current setting of cex. Defaults to 0.6.
  • scattermoresize = A 2-element integer vector to specify the size of the scattermore plot in pixels. Applicable only if scattermore = TRUE. The user should make sure that the value of scattermoresize parameter matches with the desired size of the final image output for best-quality plots. Defaults to c(3000,1800).
  • geneannotate = A logical value; if TRUE, the function annotates using gene names instead of SNP names. If not specified by the user, the function takes the default value FALSE.
  • build = A numerical value specifying the build number for matching locations to gene names. build=38 corresponds to GRCh38/b38 and Hg20, build=37 corresponds to GRCh37/b37 and Hg19, and build=36 corresponds to GRCh36/b36 and Hg18.
  • sep = A character value specifying the separator to be used in case multiple gene name hits come up in the same annotated SNP. If not specified by the user, the function takes the default value '|'.
  • xlab = A label for the x-axis, defaults to a description of x.
  • ylab = A label for the y-axis, defaults to a description of y.
  • xlim = The x limits of the plot. The user should refrain from changing xlim to subset x-axis by region. The better option is to specify this in the bprange argument, as changing xlim might lead to improper scaling and spacing in the plot.
  • ylim = The y limits of the plot. The user should refrain from changing ylim to truncate the y-axis. The better option is to specify the same in the maxP argument, as changing ylim might lead to improper scaling and spacing in the plot.

Value

A ggplot object

Examples

The fastman package includes functions for creating Manhattan plots and Q-Q plots from GWAS results. Let us first try using the package on a regular PLINK assoc output file. This is a GWAS summary statistics output file from a study by the authors (https://doi.org/10.1038/ncomms10815), available from the GWAS Central repository ( https://www.gwascentral.org/study/HGVST2597)

Reading a regular PLINK assoc output dataset

m=read.table("beard.assoc.linear",header=TRUE,stringsAsFactors=FALSE,sep="\t")

This dataset has results for 8,776,423 SNPs on 22 chromosomes. Let us take a look at the data.

str(m)
'data.frame':	8776423 obs. of  9 variables:
 $ CHR  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ SNP  : chr  "rs58108140" "rs180734498" "rs116400033" "rs62637813" ...
 $ BP   : num  10583 13302 51479 52058 52185 ...
 $ A1   : chr  "A" "T" "A" "C" ...
 $ TEST : chr  "ADD" "ADD" "ADD" "ADD" ...
 $ NMISS: num  1551 1911 1552 2348 2421 ...
 $ BETA : num  0.1486 0.07006 0.02572 0.1391 -0.00482 ...
 $ STAT : num  1.793 0.7667 0.3814 2.035 -0.0403 ...
 $ P    : num  0.0731 0.4433 0.703 0.042 0.9678 ...
head(m)
  CHR         SNP    BP A1 TEST NMISS      BETA     STAT       P
1   1  rs58108140 10583  A  ADD  1551  0.148600  1.79300 0.07314
2   1 rs180734498 13302  T  ADD  1911  0.070060  0.76670 0.44330
3   1 rs116400033 51479  A  ADD  1552  0.025720  0.38140 0.70300
4   1  rs62637813 52058  C  ADD  2348  0.139100  2.03500 0.04199
5   1 rs201374420 52185  T  ADD  2421 -0.004824 -0.04033 0.96780
6   1 rs150021059 52238  T  ADD  2290 -0.103600 -0.86640 0.38640
tail(m)
        CHR         SNP       BP A1 TEST NMISS      BETA     STAT      P
8776418  22 rs144549712 51229855  A  ADD  2206  0.029490  0.52490 0.5997
8776419  22  rs62240042 51233300  T  ADD  1536 -0.009488 -0.27510 0.7833
8776420  22 rs200507571 51236013 AT  ADD  1847  0.003518  0.08454 0.9326
8776421  22   rs3896457 51237063  C  ADD  1802 -0.018700 -0.56280 0.5737
8776422  22 rs149733995 51238249  C  ADD  2268  0.064180  0.99660 0.3190
8776423  22 rs181833046 51243297  T  ADD  2250  0.100500  1.06600 0.2866

Creating Manhattan Plots

From the above dataset, lets generate a basic Manhattan plot.

png("md1.png", width=10, height=6, units="in", res=300)
fastman(m)
dev.off()

Before moving into any further details, let us look into the plethora of options that this package provides us in a single plot. We will understand these options more rigorously as we go into the later parts of this exercise.

Let us compare the time of plot generation with qqman. For this purpose, we are going to use a library tictoc which will record the run time.

library(tictoc)
tic(); png("md1.png", width=10, height=6, units="in", res=300); fastman(m); dev.off(); toc();

This will record the run time of fastman.

77.238 sec elapsed

Now, lets run the same code for qqman.

library(qqman)
tic(); png("md1a.png", width=10, height=6, units="in", res=300); manhattan(m); dev.off(); toc();

Lets find out the run time of qqman.

630.995 sec elapsed

As seen above, fastman reduces the run time drastically.

We recommend using CairoPNG() command instead of the regular png() for generating the plots, as it provides high-quality image outputs. It requires the user to install and load the Cairo package beforehand. Let us generate the previous plot again using CairoPNG().

CairoPNG("cmd1.png", width=10, height=6, units="in", res=300)
fastman(m)
dev.off()

We can change some basic graph parameters. Let us reduce the point size (cex=) to 30% and reduce the font size of the axis labels (cex.axis=) to 50%. We can also change the colour palette (col=) and remove the suggestive (suggestiveline=) and genome-wide (genomewideline=) significance lines.

CairoPNG("cmd2.png", width=10, height=6, units="in", res=300)
fastman(m, cex=0.3, cex.axis=0.5, col="rainbow1", suggestiveline=FALSE, genomewideline=FALSE)
dev.off()

In this plot, if we want to show p-values till 1E-10, we can set maxP=10, instead of changing the ylim.

CairoPNG("cmd2a.png", width=10, height=6, units="in", res=300)
fastman(m, cex=0.3, cex.axis=0.5, col="rainbow1", suggestiveline=FALSE, genomewideline=FALSE, maxP=10)
dev.off()

We can now look into the SNPs of a single chromosome.

CairoPNG("cmd3.png", width=10, height=6, units="in", res=300)
fastman(m, chrsubset=3)
dev.off()

Let us say we are interested in highlighting some particular 3876 SNPs in chromosomes 2 and 5. We have the names of the required SNPs in a character vector called snp1.

str(snp1)
chr [1:3876] "rs1317448" "rs66535929" "rs12616526" "rs185377827" "rs28528792" "rs13411576" "rs28473143" ...
CairoPNG("cmd4.png", width=10, height=6, units="in", res=300)
fastman(m, highlight = snp1)
dev.off()

We can also annotate the highlighted SNPs. By default, only the top SNP in every chromosome will be annotated.

CairoPNG("cmd7.png", width=10, height=6, units="in", res=300)
fastman(m, highlight = snp1, annotateHighlight = TRUE)
dev.off()

Let us discuss the annotations in detail. We have seven input parameters with respect to annotations: annotateHighlight, annotatePval, annotateTop, annotationWinMb, annotateN, annotationCol and annotationAngle. We have already discussed about annotateHighlight. Among the rest, annotatePval and annotateN are annotation criteria, while the other four parameters are annotation modifiers.

As stated above, we can annotate our plot in only two ways, either by providing a p-value threshold or by providing the required number of top SNPs. The p-value threshold can be provided using the annotatePval criterion. The user can provide either the p-value or the negative logarithm as an input for this parameter, whichever is convenient. For example, both 1E-5 and 5 are acceptable input values.

CairoPNG("cmd5.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5)
dev.off()

We can see that, among the SNPs that exceed the provided threshold, only the top SNP in every chromosome is annotated. We have observed the same for annotateHighlight as well. We will come to that later, when we discuss the annotation modifiers.

For now, let us try annotating with gene names instead of SNP IDs. This can be accomplished using the geneannotate=TRUE argument. We have to specify gene build for our data as well. We use the same p-value threshold as the previous plot.

CairoPNG("cmd5g.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, geneannotate = TRUE, build=37)
dev.off()

Let us explore the other annotation criterion annotateN. Lets say we want to annotate the top 20 SNPs of the data.

CairoPNG("cmd8.png", width=10, height=6, units="in", res=300)
fastman(m, annotateN=20)
dev.off()

Lets go back to the observation we made while using annotateHighlight and annotatePval. By default, only the top SNP in every chromosome is annotated. We can override this default rule using the annotation modifier annotateTop. This modifier has a default value TRUE unless mentioned otherwise by the user. Lets consider the example where we are annotating the SNPs beyond a specified p-value threshold. By setting annotateTop=FALSE, we can annotate all the SNPs beyond our threshold instead of only the top SNP per chromosome.

CairoPNG("cmd6.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, annotateTop=FALSE)
dev.off()

Now we move on to the next annotation modifier annotationWinMb. Instead of annotating the top SNP in every chromosome, if we want to annotate the top SNP within our chosen megabase window, then this is the modifier for us. Now, lets specify a 5-megabase window.

CairoPNG("cmd10.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, annotationWinMb=5)
dev.off()

Among the remaining two modifiers, annotationCol sets the annotation colour and annotationAngle specifies the angle of annotation. The default colour is "gray50" and the default angle is 45 degrees. Let us illustrate with an example.

CairoPNG("cmd11.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, annotationCol="red", annotationAngle= 60)
dev.off()

As stated already, the default annotation colour in our function is "gray50". However, if the user chooses the colour palette "greys" for the plot, then the default annotation colour changes to "green4".

CairoPNG("cmd18.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, col = "greys")
dev.off()

We have the option to provide a column indicating the annotation colour intended for each SNP as a part of the input data frame. In that case, we just need to specify the name of the column in the annotationCol argument. Let us show this in the example below. In our example, we have added a separate column named ANNCOL to the input data frame with our intended annotation colours for each SNP. Using the column we have annotated the first 4 chromosomes with green, chromosomes 5-10 with red and the rest with blue as can be seen in the plot below.

CairoPNG("cmd17.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, annotationCol="ANNCOL")
dev.off()

In case we do not want to annotate any particular set of SNPs we can just provide NA values in the annotationCol column for those SNPs. In our example, we have another column ANNCOL2 in our input dataframe that has NA values corresponding to chromosomes 3,4,5 and 6. The rest of the values of ANNCOL are the same as that of the column ANNCOL. We have provided NA values in this column because we do not want to annotate SNPs in chromosomes 3,4,5 and 6. Let us see what the plot looks like when we run the command using this column for annoationCol argument.

CairoPNG("cmd19.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, annotationCol="ANNCOL2")
dev.off()

We can choose to colour only the points above our specified p-value threshold. The rest of the plot will become grey by default.

CairoPNG("cmd12.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, colAbovePval=TRUE)
dev.off()

In the above plot, we can change the colour of the region below our specified threshold as well.

CairoPNG("cmd13.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, colAbovePval=TRUE, col2="greens")
dev.off()

As stated previously, our package also supports plotting of results from genomes of non-model organisms (often with hundreds of contigs or many scaffolds). This is an incremental feature, as the qqman package does not support direct plotting of results from non-model organisms (The chromosome column for the input dataset needs to be numeric for qqman).

m=read.table("plink.assoc.fisher",header=TRUE,stringsAsFactors=FALSE,sep= '')

This dataset has results for 57,712 SNPs on 760 chromosomes. Let us look at the data.

str(m)
'data.frame':	57712 obs. of  9 variables:
 $ CHR: chr  "NC_015762.1" "NC_015762.1" "NC_015762.1" "NC_015762.1" ...
 $ SNP: chr  "1139:88:-" "1139:45:-" "1139:28:-" "2518:86:+" ...
 $ BP : int  32821 32864 32881 58664 68491 83326 83332 92777 92937 92945 ...
 $ A1 : chr  "T" "C" "G" "T" ...
 $ F_A: num  0 0.5217 0 0.0222 0.5217 ...
 $ F_U: num  0.0454 0.375 0.0454 0.0465 0.3636 ...
 $ A2 : chr  "C" "T" "A" "C" ...
 $ P  : num  0.0551 0.0528 0.0551 0.436 0.0366 ...
 $ OR : num  0 1.818 0 0.466 1.909 ...
head(m)
          CHR       SNP    BP A1     F_A     F_U A2       P     OR
1 NC_015762.1 1139:88:- 32821  T 0.00000 0.04545  C 0.05513 0.0000
2 NC_015762.1 1139:45:- 32864  C 0.52170 0.37500  T 0.05276 1.8180
3 NC_015762.1 1139:28:- 32881  G 0.00000 0.04545  A 0.05513 0.0000
4 NC_015762.1 2518:86:+ 58664  T 0.02222 0.04651  C 0.43600 0.4659
5 NC_015762.1 3702:96:- 68491  C 0.52170 0.36360  T 0.03664 1.9090
6 NC_015762.1 4414:33:- 83326  A 0.50000 0.31820  G 0.01556 2.1430
tail(m)
                 CHR           SNP  BP A1     F_A     F_U A2      P     OR
57707 NW_003570988.1 18013690:37:- 547  G 0.06818 0.08333  C 0.7782 0.8049
57708 NW_003570988.1 18013681:25:- 550  A 0.00000 0.01190  G 0.4941 0.0000
57709 NW_003570988.1 18013687:31:- 550  A 0.01282 0.01389  G 1.0000 0.9221
57710 NW_003570988.1  18013670:8:- 556  T 0.12820 0.19230  A 0.3830 0.6176
57711 NW_003570988.1  18013666:3:- 557  A 0.05128 0.08571  G 0.5178 0.5766
57712 NW_003570988.1 18013689:26:- 557  A 0.02632 0.01282  G 0.6176 2.0810

The dataset contains P-values (column P) and two scores (F_A and F_U). From the dataset, lets generate a basic Manhattan plot for P-values.

CairoPNG("cmd14.png", width=10, height=6, units="in", res=300)
fastman(m)
dev.off()

Now let us plot the Manhattan plots for one of the scores. We must note that the scores do not need log transformations and so we must specify that while running.

CairoPNG("cmd15.png", width=10, height=6, units="in", res=300)
fastman(m, p = "F_A", logp = FALSE)
dev.off()

The colour schemes available for this package are provided below.

About

An R package for fast and efficient visualizing of GWAS results using Q-Q and Manhattan plots directly from PLINK output files.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published