2018-BGMP · David-Degnan · Nov 1, 2018 · Nov 1, 2018 · Nov 2, 2018
diff --git a/Degnan_deduper_Pt1.Rmd b/Degnan_deduper_Pt1.Rmd
@@ -0,0 +1,149 @@
+---
+title: "Deduper Part One"
+author: "David Degnan"
+date: "October 19th, 2018"
+output:
+  pdf_document: default
+---
+
+```{r setup, include=FALSE, message=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+<br/ >
+
+## 1. Define the problem
+
+Illumina sequencing libaries contain duplicates due to PCR biases, as certain 
+sequences are more likely to be amplified than others. This creates a problem
+during downstream analyses (differential expression, for example) as certain 
+reads will be over-represented. These PCR duplicates need to be removed. 
+
+<br/ >
+
+## 2. Write examples
+
+Examples have been moved to the end of the file for R Markdown readability. 
+
+<br/ >
+
+## 3. Develop your algorithm using psuedocode
+
+*Step One:* The files will undergo Samtools sort and be organized by chromosome and position.
+
+*Step Two:* A 30 line "sliding window" will be generated and stored as a dictionary (key will be line number, and the line itself will be the value, split with the python .split() function). Reads will either be written to a discard file (Step Three) or written to a new file (Step Four). This will continue until the end of the file is reached.
+
+*Step Three:* 30 lines will be read at a time. A line will be written to a discard file (with "_discarded" at the end of the file name before the extension) if: 
+  * If it contains any N's in the UMI.
+  * The chromosome number, start position, and UMI are the same as any of the following 30 lines. 
+  * Note that start positions will be adjusted according to soft clipping, insertion, and deletion. This information will be taken from the cigar string.
+  * The bitwise flag will keep track of which strand it's on. If it's only on a different strand once, it is not a duplicate and will not be written to the _discarded file. 
+
+*Step Four:* If no conflicts are found, the reads will be written to an output file that will contain the name of the input file with a "_deduped" at the end of it (before the extension).
+*Step Five:* The sliding window will then slide down to the next read. In other words, the first entry in the dictionary will become the 31st line of the file (this can be tracked with a modulus, ie 31 % 30 = 1). The second entry in the dictionary will then be compared to all the other lines, testing for any conflicts (Step Three) and if none found, being written to the "_deduped" file (Step Four). This continues for each line of the file until the end is reached. 
+
+<br/ >
+
+## 4. Determine high level functions
+
+<br/ >
+
+### STEP ONE: Samtools Sort
+Use samtools to sort the file
+```{r, eval = FALSE, engine = 'bash'}
+purge the modules
+load samtools
+samtools sort input file
+convert sorted input file to a sam file
+```
+
+<br/ >
+
+### STEP TWO: Fill the sliding window
+
+Open input file, and generate the two output files
+```{r, engine = 'bash', eval = FALSE}
+open all three of the files, allowing an argparse for the input file
+```
+
+Initialize sliding window and read in the first thirty lines
+```{r, engine = 'bash', eval = FALSE}
+intialize an open window
+for i in range(30):
+  entry number is key and line is file
+```
+
+<br/ >
+
+### STEP THREE/FOUR: Start testing
+
+UMIs with any N's will need to be tossed
+```{r, engine = 'bash', eval = FALSE}
+define test UMI as a function that
+  returns true if the UMI has any Ns
+  returns false if the UMI does not
+TEST: UMI with an N and an UMI without
+```
+
+Bitwise flag checker. If line passes, this function adds an R to UMI. A singular R will be written to deduped, and multiple will be written to duplicate.
+```{r, engine = 'bash', eval = FALSE}
+if the 5th bit is on:
+  then the read is on the reverse strand
+  add an R to the UMI
+
+TEST: With 5th bit set and without
+```
+
+Adjust start position based on soft clipping
+```{r, engine = 'bash', eval = FALSE}
+given the number preceeding the s:
+  add to the position number whatever preceeded the s
+
+TEST: given an S adjust the position
+```
+
+If necessary, adjust for insertions or deletions.
+```{r, engine = 'bash', eval = FALSE}
+for deletion:
+  add to the position number
+for insertion:
+  subtract from the position number
+
+TEST: given an I or a D, adjust the position
+```
+
+Given chromosome, start position, and UMI, determine if they all match any other sequence
+```{r, engine = 'bash', eval = FALSE}
+if chromosome matches any in sliding window
+and start position matches the same line in the sliding window
+and UMI matches the same line in the sliding window:
+  write file to duplicate
+else:
+  write file to deduped
+
+TEST: give matches and ones that do not match
+```
+
+<br/ >
+
+### STEP FIVE: Read in next line
+```{r, engine = 'bash', eval = FALSE}
+position in window = line number % 30
+save line into that position as value
+```
+
+<br/ >
+
+## 2. Examples
+```{r, echo = FALSE, message=FALSE, warning=FALSE}
+input <- read.table("./Input.txt")
+output1 <- read.table("./Output_deduped.txt")
+output2 <- read.table("./Output_duplicate.txt")
+labels <- c("QNAME", "FLAG", "NAME", "POS", "MAPQ", "CIG", "R", "P", "LEN", "SEQ", "QUAL")
+colnames(input) <- labels
+colnames(output1) <- labels
+colnames(output2) <- labels
+knitr::kable(input, caption = "Input Sam File", row.names = FALSE)
+knitr::kable(output1, caption = "Output Deduped File", row.names = FALSE)
+knitr::kable(output2, caption = "Output Duplicate File", row.names = FALSE)
+```
diff --git a/Degnan_deduper_Pt1.pdf b/Degnan_deduper_Pt1.pdf
diff --git a/degnan_deduper.py b/degnan_deduper.py
@@ -0,0 +1,173 @@
+#!/usr/bin/python
+
+import argparse
+import sys
+
+# Declare tuple of UMIs
+UMIList = ("AACGCCAT","AAGGTACG","AATTCCGG","ACACAGAG","ACACTCAG","ACACTGTG",
+           "ACAGGACA","ACCTGTAG","ACGAAGGT","ACGACTTG","ACGTCAAC","ACGTCATG",
+           "ACTGTCAG","ACTGTGAC","AGACACTC","AGAGGAGA","AGCATCGT","AGCATGGA",
+           "AGCTACCA","AGCTCTAG","AGGACAAC","AGGACATG","AGGTTGCT","AGTCGAGA",
+           "AGTGCTGT","ATAAGCGG","ATCCATGG","ATCGAACC","ATCGCGTA","ATCGTTGG",
+           "CAACGATC","CAACGTTG","CAACTGGT","CAAGTCGT","CACACACA","CAGTACTG",
+           "CATCAGCA","CATCGTTC","CCAAGGTT","CCTAGCTT","CGATTACG","CGCCTATT",
+           "CGTTCCAT","CGTTGGAT","CTACGTTC","CTACTCGT","CTAGAGGA","CTAGGAAG",
+           "CTAGGTAC","CTCAGTCT","CTGACTGA","CTGAGTGT","CTGATGTG","CTGTTCAC",
+           "CTTCGTTG","GAACAGGT","GAAGACCA","GAAGTGCA","GACATGAG","GAGAAGAG",
+           "GAGAAGTC","GATCCTAG","GATGTCGT","GCCGATAT","GCCGATTA","GCGGTATT",
+           "GGAATTGG","GGATAACG","GGCCTAAT","GGCGTATT","GTCTTGTC","GTGATGAG",
+           "GTGATGTC","GTGTACTG","GTGTAGTC","GTTCACCT","GTTCTGCT","GTTGTCGA",
+           "TACGAACC","TAGCAAGG","TAGCTAGC","TAGGTTCG","TATAGCGC","TCAGGACT",
+           "TCCACATC","TCGACTTC","TCGTAGGT","TCGTCATC","TGAGACTC","TGAGAGTG",
+           "TGAGTGAG","TGCTTGGA","TGGAGTAG","TGTGTGTG","TTCGCCTA","TTCGTTCG")
+
+# List of chromosomes encountered 
+chrs = []
+
+def get_arguments():
+    '''Takes one input file'''
+    parser = argparse.ArgumentParser(description = "Deduper Inputs")
+    parser.add_argument("-f", "--file", help= "Designates the Input File", required=True, type=str)
+    parser.add_argument("-p", "--paired", help= "(Unsupported) Designates if paired end", required=False, type=str)
+    parser.add_argument("-u", "--umi", help= "(Unsupported) Designates the UMI list", required=False, type=str)
+    return parser.parse_args()
+
+def TestUMI(UMI):
+    '''Tests if UMI contains 8nt, no N's, and is in list of 96 UMIs'''
+    if "N" not in UMI and len(UMI) == 8 and UMI in UMIList:
+        return True
+    else:
+        return False
+
+def ForOrRev(FLAG):
+    '''Determines from FLAG if the read is on the forward strand (True) or reverse strand (False)'''
+    FLAG = int(FLAG)
+    Forward = True
+    if (FLAG & 16 == 16):
+        Forward = False
+    return Forward
+
+def AdjustPosFor(CIGAR, POS):
+    '''Adjust the position of the alignment based on the cigar for the forward strand'''
+
+    # NOTE: No I, D, or N will be reached before the first M
+    # This keeps track of when the M is reached when reading through the cigar string
+    M_Reached = False  
+    POS = int(POS)
+    for i in range(len(CIGAR)):
+        if M_Reached == True:
+            break
+        elif CIGAR[i] == "M":
+            M_Reached = True
+        elif CIGAR[i] == "S":
+            POS -= int(CIGAR[:i])
+        elif CIGAR[i] == "H":
+            POS -= int(CIGAR[:i])
+
+    return POS
+
+def AddSpaces(CIGAR):
+    '''Add spaces after characters to allow for splitting'''
+    # Determine if each character is present
+    CIGAR = CIGAR.replace("M", "M ")
+    CIGAR = CIGAR.replace("I", "I ")
+    CIGAR = CIGAR.replace("N", "N ")
+    CIGAR = CIGAR.replace("D", "D ")
+    CIGAR = CIGAR.replace("S", "S ")
+    CIGAR = CIGAR.replace("H", "H ")
+    return CIGAR
+
+def AdjustPosRev(CIGAR, POS):
+    '''Adjust the position of the alignment based on the cigar for the reverse strand'''
+
+    # This keeps track of when M is reached when reading through the cigar string
+    M_Reached = False
+
+    POS = int(POS)
+    ADJ = 0
+
+    CIGAR = AddSpaces(CIGAR)
+    CIGAR = CIGAR.split()
+    for pos in range(len(CIGAR)):
+        if "M" in CIGAR[pos]:
+            ADJ += int(CIGAR[pos].replace("M", ""))
+            M_Reached = True
+        elif "N" in CIGAR[pos]:
+            ADJ += int(CIGAR[pos].replace("N", ""))
+        elif "D" in CIGAR[pos]: 
+            ADJ += int(CIGAR[pos].replace("D", ""))
+        elif "S" in CIGAR[pos] and M_Reached == True:
+            ADJ += int(CIGAR[pos].replace("S", ""))
+        elif "H" in CIGAR[pos] and M_Reached == True:
+            ADJ += int(CIGAR[pos].replace("H", ""))
+
+    POS += ADJ
+    return POS    
+
+
+def main():
+    '''Removes PCR duplicates from SAM file, assuming SAMTOOLS sort has been used first'''
+
+    # Initialize forward and reverse dictionaries for duplicates
+    chr_f = {}
+    chr_r = {}
+
+    # Get file arguments
+    args = get_arguments()
+    File = args.file
+
+    if args.paired is not None or args.umi is not None:
+        print("Hey friend! That's not supported. You can literally only use the -f flag which is required.")
+        sys.exit()
+
+    In_File = open(File, "r")
+
+    # Create File Names
+    deduped = File.replace(".sam", "") + "_Deduped.sam"
+
+    # Open Files
+    Deduped = open(deduped, "w")
+
+    # Read all the lines in file
+    while True:
+        ToWrite = In_File.readline()
+        Line = ToWrite.strip("\n")
+        if not Line:
+            break
+        if Line.startswith("@") != True:
+            Line = Line.split()
+
+            # Get the variables for each line and test UMI
+            UMI = Line[0].split(":")[7]
+            if TestUMI(UMI) == True:
+                CHR = Line[2]
+                POS = Line[3]
+                CIGAR = Line[5]
+
+                # Purge both dictionaries if new chromosome
+                if CHR not in chrs:
+                    chrs.append(CHR)
+                    chr_f = {}
+                    chr_r = {}
+
+                # Determine if the read is forward or reverse
+                Forward = ForOrRev(Line[1])
+
+                # Adjust the position, write to appropriate file, and keep record in dictionary
+                if Forward == True:
+                    POS = AdjustPosFor(CIGAR, POS)
+                    key = str(CHR) + "_" + str(UMI) + "_" + str(POS)
+                    if key not in chr_f:
+                        chr_f[key] = 1
+                        Deduped.write(ToWrite)             
+
+                elif Forward == False:
+                    POS = AdjustPosRev(CIGAR, POS)
+                    key = str(CHR) + "_" + str(UMI) + "_" + str(POS)
+                    if key not in chr_r:
+                        chr_r[key] = 1
+                        Deduped.write(ToWrite)    
+
+    Deduped.close()
+
+main()