Skip to content

Text files

Andrea Telatin edited this page Oct 5, 2020 · 5 revisions

Commands review

  • cat, head, tail, wc
  • grep
  • cut, sort, uniq
  • advanced: sed, awk

Using nano

Using nano create a text file called ~/myself.tsv where to put these data separated by a tab:

  • Surname
  • Name
  • E-mail
  • Research area

FASTA Format

There is no single “file extension” for FASTA files, but there are many, the most common and generic being “.fasta” or “.fa”. Sometimes more specific versions are “.faa” for protein files (aminoacidic), and “.fna” for nucleic acids.

Let's start listing all the files ending with .fna in our home:

find ~ -name "*.fna"

How many sequences in those files? Last time we counted them first selecting the lines containing '>':

grep  '>' ~/learn_bash/phage/vir_cds_from_genomic.fna

We subsequently pass the output of grep to the wc command, using a pipe. We used this trick mainly to make use of the pipes, but grep has a switch (-c, for count) for this task:

# Counting sequences in a file:
grep  '>' ~/learn_bash/phage/vir_cds_from_genomic.fna | wc -l 
 
# Counting sequences in multiple files using wildcards:
grep -c  '>' ~/learn_bash/phage/*.fna

FASTQ Format

The FASTQ format devotes 4 lines for each sequence, the last being an encoded version of the quality score for each nucleotide. There are some FASTQ files in a shared directory called /homes/qib/shared/reads/. Let's have a look:

# List the (compressed) FASTQ files in a specific directory of this repository
ls -l ~/learn_bash/files/*.fastq.gz

# Decompress them
gunzip ~/learn_bash/files/*.fastq.gz

# Display the first two reads
head -n 8 ~/learn_bash/files/Sample1_R1.fastq

How many reads? We can count the lines and then divide by 4!

wc -l ~/learn_bash/files/*.fastq

Or we can use a specific bioinformatics tool: seqkit. If we don't have it installed we can use Miniconda:

# Install seqkit, if it's not installed
conda install -y -c bioconda seqkit

Using the subcommand stats to count reads (and have more details on their lengths):

# Count reads
seqkit stats ~/learn_bash/files/*.fastq

GFF: Annotation

The GFF (General Feature Format) is used to store annotations. An alternative format, called GTF, is more focused on genes annotations while GFF is more generic. They are both TSV (tab separated values), that is they are table where the boundaries across cells are marked by a single tabulation.

The first lines optionally specify some metadata, and they are preceded by a #.

Let's see an example:

less -S ~/learn_bash/phage/vir_genomic.gff
 
# If we want to remove the header lines:
grep -v '^#' ~/learn_bash/phage/vir_genomic.gff | less -S 
 
# If we want to increase the tabulation:
grep -v '^#' ~/learn_bash/phage/vir_genomic.gff | less -S -x 15

If we want to extract all the lines with CDSs, and then lines containing the word capsid:

grep -w CDS ~/learn_bash/phage/vir_genomic.gff
 
grep -w CDS ~/learn_bash/phage/vir_genomic.gff | grep -i capsid

A useful command to extract some columns from a text file is cut:

cut -f 1,3-5 ~/examples/phage/vir_genomic.gff

Other TSV

GFF, GTF, but also SAM and VCF are examples of tabular text files. They all are tab-separated values. A smaller example will be easier to deal with:

# Try using relative path!
cat ~/learn_bash/files/wine.tsv

If we want to sort by username, that is the third column of the file:

sort -k 3 ~/learn_bash/files/wine.tsv

Sometimes we need to increase the space used by tabs to have a clearer view:

sort -k 3 /homes/2020/binf/data/people.tsv | less -S -x 20

Extra topics

Create a text file in your home directory called ~/reads.fasta where you should put some substring taken from ~/learn_bash/phage/vir_genomic.fna.

You can extract a substring of at least 20 chars from anywhere. You can add errors (i.e. change some letters), small deletions or small insertions…

Decompressing archives

In your home there should be a couple of archives, in two very popular formats: “zip” and “tar.gz”. They are in your examples/archives/ directory.

Unzipping is done by:

unzip FILENAME

while for tar archives:

tar xvfz FILENAME

The “switches” here are:

  • x, to eXtract
  • v, for Verbose reporting (print files as they are extracted). Don't add it if you are not interested in the list
  • f, extract from a File (sounds crazy)
  • z, the tar archive is alzo compressed with gz. Don't add it if the archive is .tar and not .tar.gz

Menu

Clone this wiki locally