Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated Part3 #5

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions Degnan_deduper_Pt1.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
---
title: "Deduper Part One"
author: "David Degnan"
date: "October 19th, 2018"
output:
pdf_document: default
---

```{r setup, include=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

<br/ >

## 1. Define the problem

Illumina sequencing libaries contain duplicates due to PCR biases, as certain
sequences are more likely to be amplified than others. This creates a problem
during downstream analyses (differential expression, for example) as certain
reads will be over-represented. These PCR duplicates need to be removed.

<br/ >

## 2. Write examples

Examples have been moved to the end of the file for R Markdown readability.

<br/ >

## 3. Develop your algorithm using psuedocode

*Step One:* The files will undergo Samtools sort and be organized by chromosome and position.

*Step Two:* A 30 line "sliding window" will be generated and stored as a dictionary (key will be line number, and the line itself will be the value, split with the python .split() function). Reads will either be written to a discard file (Step Three) or written to a new file (Step Four). This will continue until the end of the file is reached.

*Step Three:* 30 lines will be read at a time. A line will be written to a discard file (with "_discarded" at the end of the file name before the extension) if:
* If it contains any N's in the UMI.
* The chromosome number, start position, and UMI are the same as any of the following 30 lines.
* Note that start positions will be adjusted according to soft clipping, insertion, and deletion. This information will be taken from the cigar string.
* The bitwise flag will keep track of which strand it's on. If it's only on a different strand once, it is not a duplicate and will not be written to the _discarded file.

*Step Four:* If no conflicts are found, the reads will be written to an output file that will contain the name of the input file with a "_deduped" at the end of it (before the extension).
*Step Five:* The sliding window will then slide down to the next read. In other words, the first entry in the dictionary will become the 31st line of the file (this can be tracked with a modulus, ie 31 % 30 = 1). The second entry in the dictionary will then be compared to all the other lines, testing for any conflicts (Step Three) and if none found, being written to the "_deduped" file (Step Four). This continues for each line of the file until the end is reached.

<br/ >

## 4. Determine high level functions

<br/ >

### STEP ONE: Samtools Sort
Use samtools to sort the file
```{r, eval = FALSE, engine = 'bash'}
purge the modules
load samtools
samtools sort input file
convert sorted input file to a sam file
```

<br/ >

### STEP TWO: Fill the sliding window

Open input file, and generate the two output files
```{r, engine = 'bash', eval = FALSE}
open all three of the files, allowing an argparse for the input file
```

Initialize sliding window and read in the first thirty lines
```{r, engine = 'bash', eval = FALSE}
intialize an open window
for i in range(30):
entry number is key and line is file
```

<br/ >

### STEP THREE/FOUR: Start testing

UMIs with any N's will need to be tossed
```{r, engine = 'bash', eval = FALSE}
define test UMI as a function that
returns true if the UMI has any Ns
returns false if the UMI does not
TEST: UMI with an N and an UMI without
```

Bitwise flag checker. If line passes, this function adds an R to UMI. A singular R will be written to deduped, and multiple will be written to duplicate.
```{r, engine = 'bash', eval = FALSE}
if the 5th bit is on:
then the read is on the reverse strand
add an R to the UMI

TEST: With 5th bit set and without
```

Adjust start position based on soft clipping
```{r, engine = 'bash', eval = FALSE}
given the number preceeding the s:
add to the position number whatever preceeded the s

TEST: given an S adjust the position
```

If necessary, adjust for insertions or deletions.
```{r, engine = 'bash', eval = FALSE}
for deletion:
add to the position number
for insertion:
subtract from the position number

TEST: given an I or a D, adjust the position
```

Given chromosome, start position, and UMI, determine if they all match any other sequence
```{r, engine = 'bash', eval = FALSE}
if chromosome matches any in sliding window
and start position matches the same line in the sliding window
and UMI matches the same line in the sliding window:
write file to duplicate
else:
write file to deduped

TEST: give matches and ones that do not match
```

<br/ >

### STEP FIVE: Read in next line
```{r, engine = 'bash', eval = FALSE}
position in window = line number % 30
save line into that position as value
```

<br/ >

## 2. Examples
```{r, echo = FALSE, message=FALSE, warning=FALSE}
input <- read.table("./Input.txt")
output1 <- read.table("./Output_deduped.txt")
output2 <- read.table("./Output_duplicate.txt")
labels <- c("QNAME", "FLAG", "NAME", "POS", "MAPQ", "CIG", "R", "P", "LEN", "SEQ", "QUAL")
colnames(input) <- labels
colnames(output1) <- labels
colnames(output2) <- labels
knitr::kable(input, caption = "Input Sam File", row.names = FALSE)
knitr::kable(output1, caption = "Output Deduped File", row.names = FALSE)
knitr::kable(output2, caption = "Output Duplicate File", row.names = FALSE)
```
Binary file added Degnan_deduper_Pt1.pdf
Binary file not shown.
173 changes: 173 additions & 0 deletions degnan_deduper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
#!/usr/bin/python

import argparse
import sys

# Declare tuple of UMIs
UMIList = ("AACGCCAT","AAGGTACG","AATTCCGG","ACACAGAG","ACACTCAG","ACACTGTG",
"ACAGGACA","ACCTGTAG","ACGAAGGT","ACGACTTG","ACGTCAAC","ACGTCATG",
"ACTGTCAG","ACTGTGAC","AGACACTC","AGAGGAGA","AGCATCGT","AGCATGGA",
"AGCTACCA","AGCTCTAG","AGGACAAC","AGGACATG","AGGTTGCT","AGTCGAGA",
"AGTGCTGT","ATAAGCGG","ATCCATGG","ATCGAACC","ATCGCGTA","ATCGTTGG",
"CAACGATC","CAACGTTG","CAACTGGT","CAAGTCGT","CACACACA","CAGTACTG",
"CATCAGCA","CATCGTTC","CCAAGGTT","CCTAGCTT","CGATTACG","CGCCTATT",
"CGTTCCAT","CGTTGGAT","CTACGTTC","CTACTCGT","CTAGAGGA","CTAGGAAG",
"CTAGGTAC","CTCAGTCT","CTGACTGA","CTGAGTGT","CTGATGTG","CTGTTCAC",
"CTTCGTTG","GAACAGGT","GAAGACCA","GAAGTGCA","GACATGAG","GAGAAGAG",
"GAGAAGTC","GATCCTAG","GATGTCGT","GCCGATAT","GCCGATTA","GCGGTATT",
"GGAATTGG","GGATAACG","GGCCTAAT","GGCGTATT","GTCTTGTC","GTGATGAG",
"GTGATGTC","GTGTACTG","GTGTAGTC","GTTCACCT","GTTCTGCT","GTTGTCGA",
"TACGAACC","TAGCAAGG","TAGCTAGC","TAGGTTCG","TATAGCGC","TCAGGACT",
"TCCACATC","TCGACTTC","TCGTAGGT","TCGTCATC","TGAGACTC","TGAGAGTG",
"TGAGTGAG","TGCTTGGA","TGGAGTAG","TGTGTGTG","TTCGCCTA","TTCGTTCG")

# List of chromosomes encountered
chrs = []

def get_arguments():
'''Takes one input file'''
parser = argparse.ArgumentParser(description = "Deduper Inputs")
parser.add_argument("-f", "--file", help= "Designates the Input File", required=True, type=str)
parser.add_argument("-p", "--paired", help= "(Unsupported) Designates if paired end", required=False, type=str)
parser.add_argument("-u", "--umi", help= "(Unsupported) Designates the UMI list", required=False, type=str)
return parser.parse_args()

def TestUMI(UMI):
'''Tests if UMI contains 8nt, no N's, and is in list of 96 UMIs'''
if "N" not in UMI and len(UMI) == 8 and UMI in UMIList:
return True
else:
return False

def ForOrRev(FLAG):
'''Determines from FLAG if the read is on the forward strand (True) or reverse strand (False)'''
FLAG = int(FLAG)
Forward = True
if (FLAG & 16 == 16):
Forward = False
return Forward

def AdjustPosFor(CIGAR, POS):
'''Adjust the position of the alignment based on the cigar for the forward strand'''

# NOTE: No I, D, or N will be reached before the first M
# This keeps track of when the M is reached when reading through the cigar string
M_Reached = False
POS = int(POS)
for i in range(len(CIGAR)):
if M_Reached == True:
break
elif CIGAR[i] == "M":
M_Reached = True
elif CIGAR[i] == "S":
POS -= int(CIGAR[:i])
elif CIGAR[i] == "H":
POS -= int(CIGAR[:i])

return POS

def AddSpaces(CIGAR):
'''Add spaces after characters to allow for splitting'''
# Determine if each character is present
CIGAR = CIGAR.replace("M", "M ")
CIGAR = CIGAR.replace("I", "I ")
CIGAR = CIGAR.replace("N", "N ")
CIGAR = CIGAR.replace("D", "D ")
CIGAR = CIGAR.replace("S", "S ")
CIGAR = CIGAR.replace("H", "H ")
return CIGAR

def AdjustPosRev(CIGAR, POS):
'''Adjust the position of the alignment based on the cigar for the reverse strand'''

# This keeps track of when M is reached when reading through the cigar string
M_Reached = False

POS = int(POS)
ADJ = 0

CIGAR = AddSpaces(CIGAR)
CIGAR = CIGAR.split()
for pos in range(len(CIGAR)):
if "M" in CIGAR[pos]:
ADJ += int(CIGAR[pos].replace("M", ""))
M_Reached = True
elif "N" in CIGAR[pos]:
ADJ += int(CIGAR[pos].replace("N", ""))
elif "D" in CIGAR[pos]:
ADJ += int(CIGAR[pos].replace("D", ""))
elif "S" in CIGAR[pos] and M_Reached == True:
ADJ += int(CIGAR[pos].replace("S", ""))
elif "H" in CIGAR[pos] and M_Reached == True:
ADJ += int(CIGAR[pos].replace("H", ""))

POS += ADJ
return POS


def main():
'''Removes PCR duplicates from SAM file, assuming SAMTOOLS sort has been used first'''

# Initialize forward and reverse dictionaries for duplicates
chr_f = {}
chr_r = {}

# Get file arguments
args = get_arguments()
File = args.file

if args.paired is not None or args.umi is not None:
print("Hey friend! That's not supported. You can literally only use the -f flag which is required.")
sys.exit()

In_File = open(File, "r")

# Create File Names
deduped = File.replace(".sam", "") + "_Deduped.sam"

# Open Files
Deduped = open(deduped, "w")

# Read all the lines in file
while True:
ToWrite = In_File.readline()
Line = ToWrite.strip("\n")
if not Line:
break
if Line.startswith("@") != True:
Line = Line.split()

# Get the variables for each line and test UMI
UMI = Line[0].split(":")[7]
if TestUMI(UMI) == True:
CHR = Line[2]
POS = Line[3]
CIGAR = Line[5]

# Purge both dictionaries if new chromosome
if CHR not in chrs:
chrs.append(CHR)
chr_f = {}
chr_r = {}

# Determine if the read is forward or reverse
Forward = ForOrRev(Line[1])

# Adjust the position, write to appropriate file, and keep record in dictionary
if Forward == True:
POS = AdjustPosFor(CIGAR, POS)
key = str(CHR) + "_" + str(UMI) + "_" + str(POS)
if key not in chr_f:
chr_f[key] = 1
Deduped.write(ToWrite)

elif Forward == False:
POS = AdjustPosRev(CIGAR, POS)
key = str(CHR) + "_" + str(UMI) + "_" + str(POS)
if key not in chr_r:
chr_r[key] = 1
Deduped.write(ToWrite)

Deduped.close()

main()
Loading