-
Notifications
You must be signed in to change notification settings - Fork 0
/
ATMKO_HSNpt_pheno_QAQC.Rmd
209 lines (157 loc) · 5.64 KB
/
ATMKO_HSNpt_pheno_QAQC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
title: "ATM KO HS/Npt Phenotype QA/QC"
author: "Daniel M. Gatti, Ph.D."
date: "8/22/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(qtl2)
library(readxl)
library(survival)
library(survminer)
library(tidyverse)
base_dir = '/media/dmgatti/data1/ColoState/ATM'
pheno_dir = file.path(base_dir, 'data', 'phenotypes')
pheno_file = file.path(pheno_dir, 'AT phenotype.xlsx')
hap_dir = file.path(base_dir, 'haplo_reconstr')
qtl2_dir = file.path(hap_dir, 'qtl2')
covar_file = file.path(qtl2_dir, 'covar.csv')
```
Read in the data.
```{r read_data}
pheno = readxl::read_xlsx(pheno_file, sheet = 'Sheet1', range = 'A1:R793') %>%
rename_with(str_to_lower) %>%
rename(mouse = `mouse #`,
coat =`coat color`,
necropsy = `necropsy date`,
euth = `fd/euth`,
tumor = `thymic tumor`) %>%
select(-(albino:brown), -`thymus rna`, -tail) %>%
mutate(sex = str_to_upper(sex))
covar = read_csv(covar_file) %>%
separate(id, into = c('batch', 'mouse'), sep = '\\.') %>%
select(-batch, -sample) %>%
mutate(mouse = if_else(str_detect(mouse, '^[0-9]'), str_c('A', mouse), mouse)) %>%
filter(str_detect(mouse, '^A[0-9]+$'))
```
How many mice are there in the phenotypes?
```{r num_mice}
distinct(pheno, mouse) %>%
nrow()
```
How many mice are there in the genotypes?
```{r num_mice_geno}
distinct(covar, mouse) %>%
nrow()
```
Are there any duplicates in the phenotype data?
```{r duplicate_mice}
count(pheno, mouse) %>%
filter(n > 1) %>%
left_join(pheno) %>%
arrange(mouse)
```
Yes, there are 4 mice with duplicated rows. I don't see a way of distinguishing A1540, A1855 or A56 becuase both mice are the same sex and coat color. I may be able to match A3752 with a mouse in the genotype data based on coat color or sex. I estimated the sex using the genotype data and this is in the 'covar' file.
I'm removing A1540, A1855 or A56 because I have no way of assigning them the correct genotype.
```{r remove_dupl}
dupl = count(pheno, mouse) %>%
filter(n > 1 & mouse != 'A3752') %>%
pull(mouse)
pheno = pheno %>%
filter(!mouse %in% dupl)
```
How many mice are in both phenotypes and genotypes?
```{r mouse_intersect}
inner_join(pheno, covar, by = 'mouse') %>%
nrow()
```
Which mice are not in the intersection?
```{r mouse_setdiff}
anti_join(pheno, covar, by = 'mouse')
```
There are 86 mice that were not genotyped.
Compare the sex in the phenotype file with sex from the genotypes.
```{r check_sex}
sex_check = pheno %>%
select(mouse, sex) %>%
rename(sex_p = sex) %>%
full_join(covar, by = 'mouse') %>%
rename(sex_g = sex) %>%
mutate(sex_ok = sex_p == sex_g)
```
Report the number of samples that match, don't match and are NA.?
```{r check_sex2}
count(sex_check, sex_ok)
```
The NA samples are samples that were not genotyped. Which samples have sex mismatches?
```{r check_sex3}
sex_check %>%
select(-sex_p) %>%
filter(!sex_ok) %>%
left_join(pheno) %>%
arrange(mouse)
```
I can resolve A3752 since there is a male sample in the genotype data. I don't see a way to resolve the other samples. They don't have consecutive IDs that might suggest swapping in a tube or plate well.
```{r remove_sample_mismatches}
remove = sex_check %>%
select(-sex_p) %>%
filter(!sex_ok) %>%
left_join(pheno) %>%
filter(mouse != 'A3752') %>%
pull(mouse)
pheno = pheno %>%
filter(!mouse %in% remove) %>%
filter(!(mouse == 'A3752' & sex == 'F'))
```
Verify that the sex matches now or that the samples aren't in the genotype data.
```{r veryfy_sex}
pheno %>%
select(mouse, sex) %>%
rename(sex_p = sex) %>%
left_join(select(covar, mouse, sex_g = sex), by = 'mouse') %>%
mutate(sex_ok = sex_p == sex_g) %>%
count(sex_ok)
```
There are 86 samples in the genotypes that aren't in the phenotypes.
Remove mice with no necropsy date.
```{r remove_no_necropsy}
pheno = pheno %>%
filter(!is.na(necropsy))
```
We now have `r nrow(pheno)` samples.
I will subset the phenotypes later for mapping. But first, I want to make survival plots. There are some notes in the data file about sick and fighting mice, but I'm not sure how to evaluate those. The study is looking for survival due to thymic tumors. I'm not sure how to handle mice that died without thymic tumors.
```{r}
pheno_surv = pheno %>%
select(mouse, days, sex, euth, tumor) %>%
mutate(event = tumor == 'yes')
```
I don't see any survival difference between sexes.
```{r surv_by_sex}
mod = survfit(Surv(days, event) ~ sex, data = pheno_surv)
ggsurvplot(mod, data = pheno_surv)
```
```{r surv_by_sex}
mod = survfit(Surv(days, event) ~ 1, data = pheno_surv)
ggsurvplot(mod, data = pheno_surv)
```
Organize the brain phenotypes.
```{r count_brain}
pheno = pheno %>%
mutate(brain = str_to_upper(brain))
count(pheno, brain)
```
I'm going to create 5 ordered categories based on the categories above: N, AB, CL, CL+, CL++, CL+++.
```{r create_brain_categories}
pheno = pheno %>%
mutate(brain = if_else(str_detect(brain, '---'), NA_character_, brain),
brain = str_replace(brain, '\\?', ''),
brain = if_else(brain == 'CL++;AB', 'CL++', brain))
count(pheno, brain)
```
Write out the cleaned phenotype data.
```{r write_pheno}
pheno = pheno %>%
filter(mouse %in% covar$mouse)
write_csv(pheno, file = file.path(pheno_dir, 'atmko_hs_phenotypes_cleaned.csv'))
```