-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.Rmd
182 lines (133 loc) · 6.46 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: "gglogo"
author: "Eric Hare, Heike Hofmann"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
keep_md: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.path = "man/figures/")
```
R package for creating sequence logo plots
<!-- badges: start -->
[![CRAN Status](http://www.r-pkg.org/badges/version/gglogo)](https://cran.r-project.org/package=gglogo) [![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/last-month/gglogo?color=blue)](https://r-pkg.org/pkg/gglogo)
[![Last-changedate](https://img.shields.io/badge/last%20change-`r gsub('-', '--', Sys.Date())`-yellowgreen.svg)](https://github.com/heike/gglogo/commits/main)
[![R-CMD-check](https://github.com/heike/gglogo/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/heike/gglogo/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/heike/gglogo/branch/main/graph/badge.svg)](https://app.codecov.io/gh/heike/gglogo?branch=main)
<!-- badges: end -->
# Installation
`gglogo` is available from CRAN (version 0.1.4):
```{r, eval = FALSE}
install.packages("gglogo")
```
The development version is available from Github (0.1.9000):
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("heike/gglogo", build_vignettes = TRUE)
```
# Getting Started
Load the library
```{r}
library(gglogo)
```
Load a dataset containing a set of peptide sequences
```{r, message = FALSE}
data(sequences)
head(sequences)
```
Now plot the sequences in a(n almost) traditional sequence plot, the Shannon information is shown on the y axis.
```{r warning = FALSE, message=FALSE}
library(ggplot2)
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6, position = "classic") +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```
(Sequence) Logo plots ([Schneider & Stephens 1990](https://academic.oup.com/nar/article-abstract/18/20/6097/1141316)) are typically used in bioinformatics as a way to visually demonstrate how well a sequence of nucleotides or amino acids are preserved in a certain region.
A cognitively better version of the plot is the default, i.e. without specifying the `position` parameter, the plot defaults to aligning the largest contributor in each position along the y axis and showing all other variants in each position by a tail hanging below the axis. Longer tails indicate more variability in a position.
```{r}
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```
## Other variants
Besides the Shannon information, we could also visualize the frequencies of peptides in each position. We can either set `method = frequency`, or calculate the (relative) frequency information ourselves as:
```{r}
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = freq/total, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```
Using the classic variant of alignment results in a stacked barchart of amino acids by position:
```{r}
ggplot(data = ggfortify(sequences, peptide, method="shannon")) +
geom_logo(aes(x = position, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6, position="classic") +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom")
```
## Implementation details
This implementation of sequence logos is a two-step process of data prepping/wrangling followed by the visualization.
The data prepping happens in the function `ggfortify`:
```{r message = FALSE}
library(dplyr)
seq_info <- sequences %>% # data pipeline for processing
ggfortify(
peptide, # variable in which the sequences are stored
treatment = class,
method = "shannon",
missing_encode = c(".", "*", NA)
)
```
`sequences` specifies the variable of the sequences in the data set, `treatment` is a (list) of grouping variables for which the (Shannon) information will be calculated in each position. For peptide sequences, the data set `aacids` is used to provide additional information on properties.
```{r}
head(seq_info)
```
By specifying the `treatment` parameter, the corresponding information methods are now calculated for treatments as well, and we can assess the variability/conservation of the sequence by the treatment:
```{r}
seq_info %>%
ggplot() +
geom_logo(aes(x = class, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom") +
facet_wrap(~position, ncol = 12)
```
## Available alphabets
By default, the font used for logo plots is Helvetica, available as dataset `alphabet`. Each letter is implemented in form of a polygon with `x` and `y` coordinates. The variable `group` contains the corresponding letter.
```{r message = FALSE}
alphabet %>%
filter(group == "B") %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1)
```
Besides the default alphabet, the fonts Comic Sans, xkcd, and braille (for 3d printing) are implemented:
```{r}
alphabet_comic %>%
filter(group %in% c(LETTERS, 0:9)) %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) +
ggtitle("Comic Sans")
alphabet_xkcd %>%
dplyr::filter(group %in% c(LETTERS, 0:9)) %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) +
ggtitle("xkcd font")
alphabet_braille %>%
dplyr::filter(group %in% c(LETTERS, 0:9)) %>%
ggplot(aes(x = x, y = y)) + geom_polygon() +
theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) +
ggtitle("Braille (use in 3d prints)")
```
## References
Schneider, TD, Stephens, RM (1990). [Sequence logos: a new way to display consensus sequences](https://academic.oup.com/nar/article-abstract/18/20/6097/1141316). Nucleic Acids Res, 18, 20:6097-100.