Skip to content

Commit

Permalink
fix typos
Browse files Browse the repository at this point in the history
  • Loading branch information
DominikRafacz committed Sep 26, 2024
1 parent 740acbf commit 355bfc1
Show file tree
Hide file tree
Showing 2 changed files with 152 additions and 152 deletions.
4 changes: 2 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
* `write_fasta()` and `find_motifs()` accept `data.frame` arguments now; sequences and their names are taken from specified two columns
* more descriptive error messages for non-existing generics that print out classes of the first parameter

## Fixed-ish:
* return to autoexported `Rcpp` catch declaration
## Fixed:
* return to automatically exported `Rcpp` catch declaration

## Quality of code stuff:
* added tests and adjusted vignettes for the changes
Expand Down
300 changes: 150 additions & 150 deletions vignettes/quick-start.Rmd
Original file line number Diff line number Diff line change
@@ -1,150 +1,150 @@
---
title: "Quick Start"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Quick Start}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

`tidysq` package is meant to store and conduct operations on biological sequences. This vignette provides a guide to basic usage of `tidysq`, i.e. reading, manipulating and writing sequences to file.

The most recent version of `tidysq` can be installed with `install_github()` function from `devtools`.

```{r setup}
# devtools::install_github("BioGenies/tidysq")
library(tidysq)
```

## Sequence creation

Biological sequences can be and often are represented as strings -- sequences of letters. For example, a DNA sequence can take the form of `"TAGGCCCTAGACCTG"`, where `A` means adenine, `C` -- cytosine, `G` -- guanine and `T` -- thymine. Exact IUPAC recommendations for one-letter codes can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC341218/).

Within `tidysq` package sequence data is stored in `sq` objects, that is, vectors of biological sequences. They can be created from string vectors as above:

```{r sq_from_string}
sq_dna <- sq(c("TAGGCCCTAGACCTG", "TAGGCCCTGGGCATG"))
sq_dna
```

There are several thing to note. First, each sequence is an element of `sq` object. Many operations are vectorized --- they are applied to all sequences of a vector --- and `sq` objects are no different in this regard. Second, the first line of output says: `basic DNA sequences list`. This means that all sequences of this object are of DNA type and do not use ambiguous letters (more about that in "Advanced alphabet techniques" vignette).

## Subsetting sequences

Manipulating sequence objects is an integral part of `tidysq`. `sq` objects can be easily subsetted using usual R syntax:

```{r sq_subset}
sq_dna[1]
```

Extracting subsequences is a bit more complicated than that --- because it uses designated function `bite()`. Its syntax, however, closely resembles that of base R --- indexing starts with one and negative indices are interpreted as "anything except that". It returns an `sq` object with all sequences subsetted:

```{r sq_bite}
bite(sq_dna, 5:10)
bite(sq_dna, c(-9, -11, -13))
```

It's possible to reverse sequences using this function:

```{r sq_bite_reversing}
# Don't do it like that!
bite(sq_dna, 15:1)
```

However, this usage is strongly discouraged, because it's both ineffective and works badly with sequences of different lengths. Instead, there is a designated function `reverse()`:

```{r sq_reverse}
reverse(sq_dna)
```

Note that it is very different to base `rev()`, which reverses only the order of sequences, not letters:

```{r sq_rev}
rev(sq_dna)
```

We can combine two or more `sq` objects using base `c()` function:

```{r sq_c}
sq_dna <- c(sq_dna, reverse(sq_dna))
sq_dna
```

## Biological interpretation

`tidysq` offers two functions specific to DNA/RNA sequences, namely `complement()` and `translate()`. The former creates sequences with complementary bases, that is, replaces `A` with `T`, `C` with `G` and *vice versa*. The latter translates input to amino acid sequences using [the translation table with three-letter codons](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables).

These functions can be called as shown below:

```{r sq_complement_translate}
complement(sq_dna)
translate(sq_dna)
```

One noteworthy feature here is that translation can be done with any genetic code table of those listed [on this Wikipedia page](https://en.wikipedia.org/wiki/List_of_genetic_codes):

```{r sq_translate_other_table}
translate(sq_dna, table = 6)
```

## Finding motifs

Motifs are short subsequences. These are often searched for in biological sequences. `tidysq` has two distinct functions that allow the user to perform such search.

One of them is a `%has%` operator that takes `sq` object and character vector as parameters respectively. It returns a logical vector of the same length as `sq` object, where each element says whether all motifs passed as strings were found in given sequence:

```{r sq_has}
sq_dna %has% "ATC"
# It can be used to subset sq
sq_dna[sq_dna %has% c("AG", "CC")]
```

It says nothing about motif placement within sequence nor it exact form, however. In this case, there is `find_motifs()` function that returns a whole `tibble` (from `tibble` package; basically improved version of `data.frame`) with various info about found motifs. Important thing to note here is that the second argument is a character vector of sequence names to avoid embedding potentially long sequences in resulting `tibble` potentially many times:

```{r sq_find_motifs}
find_motifs(sq_dna, c("seq1", "seq2", "rev1", "rev2"), c("ATC", "TAG"))
```

You can also provide this function with a `data.frame` (or, what we recommend, `tibble`) containing one column called `sq`, containing the sequences and the other colum `name` containing the names.

```{r sqibble_find_motifs}
sqibble <- tibble::tibble(sq = sq_dna,
name = c("seq1", "seq2", "rev1", "rev2"))
# does the same as the call from previous chunk of code
find_motifs(sqibble, c("ATC", "TAG"))
```

There are ambiguous DNA bases in IUPAC codes and these can be used in motifs. One of them is `"N"` --- its meaning is "any of `A`, `C`, `G` or `T`:

```{r sq_find_motifs_amb}
find_motifs(sqibble, "GNCC")
```

This example displays the difference between `"sought"` and `"found"` columns. The former contains the string representation of motif that the user was looking for, while the latter contains a `tidysq`-encoded sequence with an "instance" of motif.

Two additional characters are reserved because of their special meaning in motifs. `"^"` means that this motif must be found at the start of a sequence, while `"$"` means the same, but with the end instead. They can be mixed with ambiguous letters, of course:

```{r sq_find_motifs_start_end}
find_motifs(sqibble, c("^TAG", "ATN$"))
```

## Exporting sq objects

After doing computations the user might wish to save their sequences for future use. One of the most popular formats for storing biological sequences is FASTA. `tidysq` allows the user to write sequences to FASTA file with `write_fasta()` function. Important thing to remember here that the arguments for the function are analogous to those used in `find_motifs()` -- either `sq` object and a vector of names or a `tibble` with columns of sequences and names:

```{r write_fasta, eval=FALSE}
write_fasta(sq_dna,
c("seq1", "seq2", "rev1", "rev2"),
"just_your_ordinary_fasta_file.fasta")
# or
write_fasta(sqibble,
"just_your_ordinary_fasta_file.fasta")
```
---
title: "Quick Start"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Quick Start}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

`tidysq` package is meant to store and conduct operations on biological sequences. This vignette provides a guide to basic usage of `tidysq`, i.e. reading, manipulating and writing sequences to file.

The most recent version of `tidysq` can be installed with `install_github()` function from `devtools`.

```{r setup}
# devtools::install_github("BioGenies/tidysq")
library(tidysq)
```

## Sequence creation

Biological sequences can be and often are represented as strings -- sequences of letters. For example, a DNA sequence can take the form of `"TAGGCCCTAGACCTG"`, where `A` means adenine, `C` -- cytosine, `G` -- guanine and `T` -- thymine. Exact IUPAC recommendations for one-letter codes can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC341218/).

Within `tidysq` package sequence data is stored in `sq` objects, that is, vectors of biological sequences. They can be created from string vectors as above:

```{r sq_from_string}
sq_dna <- sq(c("TAGGCCCTAGACCTG", "TAGGCCCTGGGCATG"))
sq_dna
```

There are several thing to note. First, each sequence is an element of `sq` object. Many operations are vectorized --- they are applied to all sequences of a vector --- and `sq` objects are no different in this regard. Second, the first line of output says: `basic DNA sequences list`. This means that all sequences of this object are of DNA type and do not use ambiguous letters (more about that in "Advanced alphabet techniques" vignette).

## Subsetting sequences

Manipulating sequence objects is an integral part of `tidysq`. `sq` objects can be easily subsetted using usual R syntax:

```{r sq_subset}
sq_dna[1]
```

Extracting subsequences is a bit more complicated than that --- because it uses designated function `bite()`. Its syntax, however, closely resembles that of base R --- indexing starts with one and negative indices are interpreted as "anything except that". It returns an `sq` object with all sequences subsetted:

```{r sq_bite}
bite(sq_dna, 5:10)
bite(sq_dna, c(-9, -11, -13))
```

It's possible to reverse sequences using this function:

```{r sq_bite_reversing}
# Don't do it like that!
bite(sq_dna, 15:1)
```

However, this usage is strongly discouraged, because it's both ineffective and works badly with sequences of different lengths. Instead, there is a designated function `reverse()`:

```{r sq_reverse}
reverse(sq_dna)
```

Note that it is very different to base `rev()`, which reverses only the order of sequences, not letters:

```{r sq_rev}
rev(sq_dna)
```

We can combine two or more `sq` objects using base `c()` function:

```{r sq_c}
sq_dna <- c(sq_dna, reverse(sq_dna))
sq_dna
```

## Biological interpretation

`tidysq` offers two functions specific to DNA/RNA sequences, namely `complement()` and `translate()`. The former creates sequences with complementary bases, that is, replaces `A` with `T`, `C` with `G` and *vice versa*. The latter translates input to amino acid sequences using [the translation table with three-letter codons](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables).

These functions can be called as shown below:

```{r sq_complement_translate}
complement(sq_dna)
translate(sq_dna)
```

One noteworthy feature here is that translation can be done with any genetic code table of those listed [on this Wikipedia page](https://en.wikipedia.org/wiki/List_of_genetic_codes):

```{r sq_translate_other_table}
translate(sq_dna, table = 6)
```

## Finding motifs

Motifs are short subsequences. These are often searched for in biological sequences. `tidysq` has two distinct functions that allow the user to perform such search.

One of them is a `%has%` operator that takes `sq` object and character vector as parameters respectively. It returns a logical vector of the same length as `sq` object, where each element says whether all motifs passed as strings were found in given sequence:

```{r sq_has}
sq_dna %has% "ATC"
# It can be used to subset sq
sq_dna[sq_dna %has% c("AG", "CC")]
```

It says nothing about motif placement within sequence nor it exact form, however. In this case, there is `find_motifs()` function that returns a whole `tibble` (from `tibble` package; basically improved version of `data.frame`) with various info about found motifs. Important thing to note here is that the second argument is a character vector of sequence names to avoid embedding potentially long sequences in resulting `tibble` potentially many times:

```{r sq_find_motifs}
find_motifs(sq_dna, c("seq1", "seq2", "rev1", "rev2"), c("ATC", "TAG"))
```

You can also provide this function with a `data.frame` (or, what we recommend, `tibble`) containing one column called `sq`, containing the sequences and the other column `name` containing the names.

```{r sqibble_find_motifs}
sqibble <- tibble::tibble(sq = sq_dna,
name = c("seq1", "seq2", "rev1", "rev2"))
# does the same as the call from previous chunk of code
find_motifs(sqibble, c("ATC", "TAG"))
```

There are ambiguous DNA bases in IUPAC codes and these can be used in motifs. One of them is `"N"` --- its meaning is "any of `A`, `C`, `G` or `T`:

```{r sq_find_motifs_amb}
find_motifs(sqibble, "GNCC")
```

This example displays the difference between `"sought"` and `"found"` columns. The former contains the string representation of motif that the user was looking for, while the latter contains a `tidysq`-encoded sequence with an "instance" of motif.

Two additional characters are reserved because of their special meaning in motifs. `"^"` means that this motif must be found at the start of a sequence, while `"$"` means the same, but with the end instead. They can be mixed with ambiguous letters, of course:

```{r sq_find_motifs_start_end}
find_motifs(sqibble, c("^TAG", "ATN$"))
```

## Exporting sq objects

After doing computations the user might wish to save their sequences for future use. One of the most popular formats for storing biological sequences is FASTA. `tidysq` allows the user to write sequences to FASTA file with `write_fasta()` function. Important thing to remember here that the arguments for the function are analogous to those used in `find_motifs()` -- either `sq` object and a vector of names or a `tibble` with columns of sequences and names:

```{r write_fasta, eval=FALSE}
write_fasta(sq_dna,
c("seq1", "seq2", "rev1", "rev2"),
"just_your_ordinary_fasta_file.fasta")
# or
write_fasta(sqibble,
"just_your_ordinary_fasta_file.fasta")
```

0 comments on commit 355bfc1

Please sign in to comment.